Amit's Oracle DBA Blog: Oracle SQL Tuning Bobby Durrett

Introduction

“Tuning a SQL statement” means making the SQL query run as fast as you need it to. In

many cases this means taking a long running query and reducing its run time to a small

fraction of its original length - hours to seconds for example. Books have been written on

this topic and saying something useful in a short paper seems impossible. Oracle’s

database software has so many options. What should we talk about first? We contend

that at the core of SQL tuning are three fundamental concepts - join order, join method,

and access method. This paper will explain these three concepts and show that tuning

SQL boils down to making the proper choice in each category. It is hard to imagine an

area of SQL tuning that doesn’t touch on one or more of these three areas and in any case

the author knows from experience that tuning the join order, join methods, and access

methods consistently leads to fantastic results.

The following example will be used throughout this paper. It was chosen because it joins

three tables. Fewer wouldn’t show everything needed, and more would be too complex.

First, here are our three tables in a fictional sales application:

create table sales

(

sale_date date,

product_number number,

customer_number number,

amount number

);

create table products

(

product_number number,

product_type varchar2(12),

product_name varchar2(12)

);

create table customers

(

customer_number number,

customer_state varchar2(14),

customer_name varchar2(17)

);

The sales table has one row for each sale which is one product to one customer for a

given amount. The products and customers tables have one row per product or customer.

The columns product_number and customer_number are used to join the tables together.

They uniquely identify one product or customer.

Here is the sample data:

insert into sales values (to_date('01/02/2012','MM/DD/YYYY'),1,1,100);

insert into sales values (to_date('01/03/2012','MM/DD/YYYY'),1,2,200);

insert into sales values (to_date('01/04/2012','MM/DD/YYYY'),2,1,300);

insert into sales values (to_date('01/05/2012','MM/DD/YYYY'),2,2,400);

insert into products values (1,'Cheese','Chedder');

insert into products values (2,'Cheese','Feta');

insert into customers values (1,'FL','Sunshine State Co');

insert into customers values (2,'FL','Green Valley Inc');

Here is our sample query that we will use throughout and its output:

SQL> select

2 sale_date, product_name, customer_name, amount

3 from sales, products, customers

4 where

5 sales.product_number=products.product_number and

6 sales.customer_number=customers.customer_number and

7 sale_date between

8 to_date('01/01/2012','MM/DD/YYYY') and

9 to_date('01/31/2012','MM/DD/YYYY') and

10 product_type = 'Cheese' and

11 customer_state = 'FL';

SALE_DATE PRODUCT_NAME CUSTOMER_NAME AMOUNT

--------- ------------ ----------------- ----------

04-JAN-12 Feta Sunshine State Co 300

02-JAN-12 Chedder Sunshine State Co 100

05-JAN-12 Feta Green Valley Inc 400

03-JAN-12 Chedder Green Valley Inc 200

This query returns the sales for the month of January 2012 for products that are a type of

cheese and customers that reside in Florida. It joins the sales table to products and

customers to get the product and customer names and uses the product_type and

customer_state columns to limit the sales records by product and customer criteria.

Join Order

The Oracle optimizer translates this query into an execution tree that represents reads

from the underlying tables and joins between two sets of row data at a time. Rows must

be retrieved from the sales, products, and customers tables. These rows must be joined

together, but only two at a time. So, sales and products could be joined first and then the

result could be joined to customers. In this case the “join order” would be sales,

products, customers.

Some join orders have much worse performance than others. Consider the join order

customers, products, sales. This means first join customers to products and then join the

result to sales.

Note that there are no conditions in the where clause that relate the customers and

products tables.

A join between the two makes a Cartesian product which means

combine each row in products with every row in customers. The following query joins

customers and products with the related conditions from the original query, doing a

Cartesian join.

SQL> -- joining products and customers

SQL> -- cartesian join

SQL>

SQL> select

2 product_name,customer_name

3 from products, customers

4 where

5 product_type = 'Cheese' and

6 customer_state = 'FL';

PRODUCT_NAME CUSTOMER_NAME

------------ -----------------

Chedder Sunshine State Co

Chedder Green Valley Inc

Feta Sunshine State Co

Feta Green Valley Inc

Execution Plan

----------------------------------------------------------

0 SELECT STATEMENT Optimizer=ALL_ROWS

1 0 MERGE JOIN (CARTESIAN)

2 1 TABLE ACCESS (FULL) OF 'PRODUCTS' (TABLE)

3 1 BUFFER (SORT)

4 3 TABLE ACCESS (FULL) OF 'CUSTOMERS' (TABLE)

In our case we have two products and two customers so this is just 2 x 2 = 4 rows. But

imagine if you had 100,000 of each. Then a Cartesian join would return 100,000 x

100,000 = 10,000,000,000 rows. Probably not the most efficient join order. But, if you

have a small number of rows in both tables there is nothing wrong with a Cartesian join

and in some cases this is the best join order. But, clearly there are cases where a

Cartesian join blows up a join of modest sized tables into billions of resulting rows and

this is not the best way to optimize the query.

There are many possible join orders. If the variable n represents the number of tables in

your from clause there are n! possible join orders. So, our example query has 3! = 6 join

orders:

sales, products, customers

sales, customers, products

products, customers, sales

products, sales, customers

customers, products, sales

customers, sales, products

The number of possible join orders explodes as you increase the number of tables in the

from clause:

Tables in from

clause

Possible join

orders

1 1

2 2

3 6

4 24

5 120

6 720

7 5040

8 40320

9 362880

10 3628800

You might wonder if these join orders were really different:

sales, products, customers

products, sales, customers

In both cases sales and products are joined first and then the result is joined to customers:

Join order 1:

Join order 2:

These differ because of asymmetry in the join methods. Typically a join will be more

efficient if the first table has fewer rows and the second has more. So, if you have more

sales rows and fewer products then the second join order would be faster if the join

method preferred fewer rows in the first table in the join.

Generally the best join order lists the tables in the order of the selectivity of their

predicates. If your query returns 1% of the sales rows and 50% of the products rows then

you probably should list sales before products in the join order. Let’s look at our

example query again:

select

sale_date, product_name, customer_name, amount

from sales, products, customers

where

sales.product_number=products.product_number and

sales.customer_number=customers.customer_number and

sale_date between

to_date('01/01/2012','MM/DD/YYYY') and

to_date('01/31/2012','MM/DD/YYYY') and

product_type = 'Cheese' and

customer_state = 'FL';

If your sales table has only January and February 2012 data in equal amounts then the

selectivity of the condition on sales_date is only 50%. If sales had ten years of equally

distributed data the selectivity would be less than 1%. You can easily determine the

selectivity by running two count(*) queries against the tables using the non-joining where

clause conditions:

products sales

join 1 customers

join 2

-- # selected rows

select

count(*)

from sales

where

sale_date between

to_date('01/01/2012','MM/DD/YYYY') and

to_date('01/31/2012','MM/DD/YYYY');

-- total #rows

select

count(*)

from sales;

You should do this for every table in the from clause to see the real selectivity.

In the same way, you need to know the selectivity of each subtree in the plan - i.e. each

combination of tables that have already been joined together. So, in our three table

example this would just be the three possible combinations - sales and products, sales and

customers, and customers and products. Here is the sales and products combination:

SQL> select count(*)

2 from sales, products

3 where

4 sales.product_number=products.product_number and

5 sale_date between

6 to_date('01/01/2012','MM/DD/YYYY') and

7 to_date('01/31/2012','MM/DD/YYYY') and

8 product_type = 'Cheese';

COUNT(*)

----------

SQL> select count(*)

2 from sales, products

3 where

4 sales.product_number=products.product_number;

COUNT(*)

----------

In this simple example the selectivity of the sales and products combination is 100%

because the count without the conditions against the constants equals the count with the

conditions with constants. But, in a real query these would differ and in some cases the

combination of conditions on two tables is much more selective than the conditions on

each individual table.

Extreme differences in selectivity cause reordering the joins to produce the best tuning

improvement. We started out this paper with the idea that you have a SQL statement that

is slow and you want to speed it up. You may want a query that runs for hours to run in

seconds. To find this kind of improvement by looking at join order and predicate

selectivity you have to look for a situation where the predicates on one of the tables has

an extremely high selectivity compared with the others. So, say your conditions on your

sales rows reduced the number of sales rows considered down to a handful of rows out of

the millions in the table. Putting the sales table first in the join order if it isn’t already put

there by the optimizer could produce a huge improvement, especially if the conditions on

the other large tables in the join were not very selective. What if your criteria for

products reduced you down to one product row out of hundreds of thousands? Put

products first. On the other hand if there isn’t a big gap in the selectivity against the

tables in the from clause then simply ordering the joins by selectivity is not guaranteed to

help.

You change the join order by giving the optimizer more information about the selectivity

of the predicates on the tables or by manually overriding the optimizer’s choice. If the

optimizer knows that you have an extremely selective predicate it will put the associated

table at the front of the join order. You can help the optimizer by gathering statistics on

the tables in the from clause. Accurate statistics help the optimizer know how selective

the conditions are. In some cases higher estimate percentages on your statistics gathering

can make the difference. In other cases histograms on the columns that are so selective

can help the optimizer realize how selective the conditions really are.

Here is how to set the optimizer preferences on a table to set the estimate percentage and

choose a column to have a histogram and gather statistics on the table:

-- 1 - set preferences

begin

DBMS_STATS.SET_TABLE_PREFS(NULL,'SALES','ESTIMATE_PERCENT','10');

DBMS_STATS.SET_TABLE_PREFS(NULL,'SALES','METHOD_OPT',

'FOR COLUMNS SALE_DATE SIZE 254 PRODUCT_NUMBER SIZE 1 '||

'CUSTOMER_NUMBER SIZE 1 AMOUNT SIZE 1');

end;

-- 2 - regather table stats with new preferences

execute DBMS_STATS.GATHER_TABLE_STATS (NULL,'SALES');

This set the estimate percentage to 10% for the sales table. It put a histogram on the

sales_date column only.

You can use a cardinality hint to help the optimizer know which table has the most

selective predicates. In this example we tell the optimizer that the predicates on the sales

table will only return one row:

SQL> select /*+cardinality(sales 1) */

2 sale_date, product_name, customer_name, amount

3 from sales, products, customers

4 where

5 sales.product_number=products.product_number and

6 sales.customer_number=customers.customer_number and

7 sale_date between

8 to_date('01/01/2012','MM/DD/YYYY') and

9 to_date('01/31/2012','MM/DD/YYYY') and

10 product_type = 'Cheese' and

11 customer_state = 'FL';

SALE_DATE PRODUCT_NAME CUSTOMER_NAME AMOUNT

--------- ------------ ----------------- ----------

04-JAN-12 Feta Sunshine State Co 300

02-JAN-12 Chedder Sunshine State Co 100

05-JAN-12 Feta Green Valley Inc 400

03-JAN-12 Chedder Green Valley Inc 200

Execution Plan

----------------------------------------------------------

0 SELECT STATEMENT Optimizer=ALL_ROWS

1 0 HASH JOIN

2 1 HASH JOIN

3 2 TABLE ACCESS (FULL) OF 'SALES' (TABLE)

4 2 TABLE ACCESS (FULL) OF 'PRODUCTS' (TABLE

5 1 TABLE ACCESS (FULL) OF 'CUSTOMERS' (TABLE)

As a result, the optimizer puts sales at the beginning of the join order. Note that in this

case for an example we lied to the optimizer since really there are four rows. Sometimes

giving the optimizer the true cardinality will work, but in others you will need to under or

overestimate to get the join order that is optimal.

You can use a leading hint to override the optimizer’s choice of join order. For example

if sales had the highly selective predicate you could do this:

SQL> select /*+leading(sales) */

2 sale_date, product_name, customer_name, amount

3 from sales, products, customers

4 where

5 sales.product_number=products.product_number and

6 sales.customer_number=customers.customer_number and

7 sale_date between

8 to_date('01/01/2012','MM/DD/YYYY') and

9 to_date('01/31/2012','MM/DD/YYYY') and

10 product_type = 'Cheese' and

11 customer_state = 'FL';

SALE_DATE PRODUCT_NAME CUSTOMER_NAME AMOUNT

--------- ------------ ----------------- ----------

04-JAN-12 Feta Sunshine State Co 300

02-JAN-12 Chedder Sunshine State Co 100

05-JAN-12 Feta Green Valley Inc 400

03-JAN-12 Chedder Green Valley Inc 200

Execution Plan

----------------------------------------------------------

0 SELECT STATEMENT Optimizer=ALL_ROWS

1 0 HASH JOIN

2 1 HASH JOIN

3 2 TABLE ACCESS (FULL) OF 'SALES' (TABLE)

4 2 TABLE ACCESS (FULL) OF 'PRODUCTS'

5 1 TABLE ACCESS (FULL) OF 'CUSTOMERS' (TABLE)

The leading hint on sales forced it to be the first table in the join order just like the

cardinality hint. It left it up to the optimizer to choose either products or customers as the

next table.

If all else fails and you can’t get the optimizer to join the tables together in an efficient

order you can break the query into multiple queries saving the intermediate results in a

global temporary table. Here is how to break our three table join into two joins - first

sales and products, and then the results of that join with customers:

SQL> create global temporary table sales_product_results

2 (

3 sale_date date,

4 customer_number number,

5 amount number,

6 product_type varchar2(12),

7 product_name varchar2(12)

8 ) on commit preserve rows;

Table created.

SQL> insert /*+append */

2 into sales_product_results

3 select

4 sale_date,

5 customer_number,

6 amount,

7 product_type,

8 product_name

9 from sales, products

10 where

11 sales.product_number=products.product_number and

12 sale_date between

13 to_date('01/01/2012','MM/DD/YYYY') and

14 to_date('01/31/2012','MM/DD/YYYY') and

15 product_type = 'Cheese';

4 rows created.

SQL> commit;

Commit complete.

SQL> select

2 sale_date, product_name, customer_name, amount

3 from sales_product_results spr, customers c

4 where

5 spr.customer_number=c.customer_number and

6 c.customer_state = 'FL';

SALE_DATE PRODUCT_NAME CUSTOMER_NAME AMOUNT

--------- ------------ ----------------- ----------

02-JAN-12 Chedder Sunshine State Co 100

03-JAN-12 Chedder Green Valley Inc 200

04-JAN-12 Feta Sunshine State Co 300

05-JAN-12 Feta Green Valley Inc 400

Breaking a query up like this is a very powerful method of tuning. If you have the ability

to modify an application in this way you have total control over how the query is run

because you decide which joins are done and in what order. I’ve seen dramatic run time

improvement using this simple technique.

JoinMethods

The choice of the best join method is determined by the number of rows being joined

from each source and the availability of an index on the joining columns. Nested loops

joins work best when the first table in the join has few rows and the second has a unique

index on the joined columns. If there is no index on the larger table or if both tables have

a large number of rows to be joined then hash join them with the smaller table first in the

join.

Here is our sample query’s plan with all nested loops:

Execution Plan

----------------------------------------------------------

0 SELECT STATEMENT Optimizer=ALL_ROWS

1 0 TABLE ACCESS (BY INDEX ROWID) OF 'CUSTOMERS' (TABLE)

2 1 NESTED LOOPS

3 2 NESTED LOOPS

4 3 TABLE ACCESS (FULL) OF 'SALES' (TABLE)

5 3 TABLE ACCESS (BY INDEX ROWID) OF 'PRODUCTS'

6 5 INDEX (RANGE SCAN) OF 'PRODUCTS_INDEX' (INDEX)

7 2 INDEX (RANGE SCAN) OF 'CUSTOMERS_INDEX' (INDEX)

In this example let’s say that the date condition on the sales table means that only a few

rows would be returned from sales. But, what if the conditions on products and

customers were not very limiting? Maybe all your customers are in Florida and you

mainly sell cheese. If you have indexes on product_number and customer_number or if

you add them then a nested loops join makes sense here.

But, what if the sales table had many rows in the date range the conditions on products

and customers were not selective either? In this case the nested joops join using the

index can be very inefficient. We have seen many cases were doing a full scan on both

tables in a join and using a hash join produces the fastest query runtime. Here is a plan

with all hash joins that we saw in our leading hint example:

Execution Plan

----------------------------------------------------------

0 SELECT STATEMENT Optimizer=ALL_ROWS

1 0 HASH JOIN

2 1 HASH JOIN

3 2 TABLE ACCESS (FULL) OF 'SALES' (TABLE)

4 2 TABLE ACCESS (FULL) OF 'PRODUCTS'

5 1 TABLE ACCESS (FULL) OF 'CUSTOMERS' (TABLE)

Of course you can mix and match and use a hash join in one case and nested loops in

another.

Execution Plan

----------------------------------------------------------

0 SELECT STATEMENT Optimizer=ALL_ROWS

1 0 TABLE ACCESS (BY INDEX ROWID) OF 'CUSTOMERS' (TABLE)

2 1 NESTED LOOPS

3 2 HASH JOIN

4 3 TABLE ACCESS (FULL) OF 'SALES' (TABLE)

5 3 TABLE ACCESS (FULL) OF 'PRODUCTS' (TABLE)

6 2 INDEX (RANGE SCAN) OF 'CUSTOMERS_INDEX' (INDEX)

As with join order you can cause the optimizer to use the optimal join method by either

helping it know how many rows will be joined from each row source or by forcing the

join method using a hint. The following hints forced the join methods used above:

/*+ use_hash(sales products) use_nl(products customers) */

Use_hash forces a hash join, use_nl forces a nested loops join. The table with the fewest

rows to be joined should go first. A great way to get the optimizer to use a nested loops

join is to create an index on the join columns of the table with the least selective where

clause predicates.

For example, these statements create indexes on the product_number

and customer_number columns of the products and customers tables:

create index products_index on products(product_number);

create index customers_index on customers(customer_number);

You can encourage hash joins by manually setting the hash_area_size to a large number.

The init parameter hash_area_size is an area of memory in each session’s server

process’s local memory (PGA) that buffers the on-disk hash table. The more memory

dedicated to hashing the more efficient a hash join is and the more likely the optimizer is

to choose a hash join over a nested loops join. You have to first turn off automatic pga

memory management by setting workarea_size_policy to manual and then setting sort

and hash area sizes.

Here is an example from a working production system:

NAME TYPE VALUE

------------------------------------ ----------- ---------

hash_area_size integer 100000000

sort_area_size integer 100000000

workarea_size_policy string MANUAL

Access Methods

For clarity we limit our discussion of access method to two options: index scan and full

table scan. There are many possible access methods but at its core the choice of access

method comes down to how many rows will be retrieved from the table. Access a table

using an index scan when you want to read a very small percentage of the rows,

otherwise use a full table scan. An old rule of thumb was that if you read 10% of the

table or less use an index scan. Today that rule should be 1% or even much less. If you

are accessing only a handful of rows of a huge table and you are not using an index then

switch to an index. If the table is small or if you are accessing even a modest percentage

of the rows you will want a full scan.

The parameters optimizer_index_caching, optimizer_index_cost_adj, and

db_file_multiblock_read_count all affect how likely the optimizer is to use an index

versus a full scan. You may find that you have many queries using indexes where a full

scan is more efficient or vice versa. These parameters can improve the performance of

many queries since they are system wide. On one Exadata system where we wanted to

discourage index use we set optimizer_index_cost_adj to 1000, ten times its normal

setting of 100. This caused many queries to begin using full scans, which Exadata

processes especially well. Here is the command we used to set this parameter value:

alter system set optimizer_index_cost_adj=1000 scope=both sid='*';

The greater the degree of parallelism set on a table the more likely the optimizer is to use

a full scan. So, if a table is set to parallel 8 the optimizer assumes a full scan takes 1/8th

the time it would serially. But there are diminishing marginal returns on increasing the

degree of parallelism. You get less benefit as you go above 8 and if you have many users

querying with a high degree of parallelism you can overload your system. This is how

you set a table to a parallel degree 8:

alter table sales parallel 8;

Last, but certainly not least in importance are index and full scan hints. Here are

examples of both:

/*+ full(sales) index(customers) index(products) */

These hints force the sales table to be accessed using a full scan, but require the

customers and products tables to be accessed using an index.

Conclusion

We have demonstrated how to tune an Oracle SQL query by examining and manipulating

the join order, join methods, and access methods used in the query’s execution plan. First

you must query the tables to determine the true number of rows returned from each table

and from each likely join. Then you identify which join order, join method, or access

method chosen by the optimizer is inefficient. Then you use one of the suggested

methods to change the optimizer’s choice to be the efficient one. These methods include

gathering optimizer statistics, hints, setting init parameters, breaking the query into

smaller pieces, and adjusting the parallel degree. Left unsaid is that you must test your

changed plan to make sure the new combination of join order, join methods and access

methods actually improve the performance. In SQL*Plus turn timing on and check the

elapsed time with the optimizer’s original execution plan and compare the elapsed time

with the new plan:

SQL> set timing on

SQL>

SQL> select

2 sale_date, product_name, customer_name, amount

3 from sales, products, customers

4 where

5 sales.product_number=products.product_number and

6 sales.customer_number=customers.customer_number and

7 sale_date between

8 to_date('01/01/2012','MM/DD/YYYY') and

9 to_date('01/31/2012','MM/DD/YYYY') and

10 product_type = 'Cheese' and

11 customer_state = 'FL';

SALE_DATE PRODUCT_NAME CUSTOMER_NAME AMOUNT

--------- ------------ ----------------- ----------

02-JAN-12 Chedder Sunshine State Co 100

03-JAN-12 Chedder Green Valley Inc 200

04-JAN-12 Feta Sunshine State Co 300

05-JAN-12 Feta Green Valley Inc 400

Elapsed: 00:00:00.00

This is the final test that our analysis has brought us to our initial goal - of vastly

improved run time for our problem query.

Amit's Oracle DBA Blog

Disclaimer

Thursday, 10 February 2022

Oracle SQL Tuning Bobby Durrett

No comments:

Post a Comment

Oracle Exadata

Labels