Chapter 21 - Objectivespattarachai/DB/PDF/ch21.pdfHow Oracle handles QO. 4 ... Rule also applies to Equijoin and Natural join. For example: Staff staff.branchNo=branch.branchNo Branch

Ch t 21Chapter 21

Query Processing

T iTransparencies

1© Pearson Education Limited 1995, 2005

Chapter 21 - Objectives

Objectives of query processing and optimization.

St ti d i ti i tiStatic versus dynamic query optimization.

How a query is decomposed and semanticallyq y p yanalyzed.

How to create a R A T to represent a queryHow to create a R.A.T. to represent a query.

Rules of equivalence for RA operations.

How to apply heuristic transformation rules toimprove efficiency of a query.improve efficiency of a query.

2

© Pearson Education Limited 1995, 2005


Types of database statistics required to estimatecost of operationscost of operations.

Different strategies for implementing selection.

How to evaluate cost and size of selection.

Different strategies for implementing joinDifferent strategies for implementing join.

How to evaluate cost and size of join.

Different strategies for implementing projection.

How to evaluate cost and size of projectionHow to evaluate cost and size of projection.

3



How to evaluate the cost and size of other RAoperations.operations.How pipelining can be used to improve efficiencyof queriesof queries.Difference between materialization and

i li ipipelining.Advantages of left-deep trees.Approaches to finding optimal executionstrategy.gyHow Oracle handles QO.

4


Introduction

In network and hierarchical DBMSs, low-levelprocedural query language is generally embeddedprocedural query language is generally embeddedin high-level programming language.Programmer’s responsibility to select mostProgrammer s responsibility to select mostappropriate execution strategy.With d l ti l h SQLWith declarative languages such as SQL, userspecifies what data is required rather than how iti t b t i dis to be retrieved.Relieves user of knowing what constitutes goodexecution strategy.

5


Introduction

Also gives DBMS more control over systemperformance.performance.

Two main techniques for query optimization:q q y p– heuristic rules that order operations in a query;

comparing different strategies based on relative– comparing different strategies based on relative costs, and selecting one that minimizes resource usageusage.

Disk access tends to be dominant cost in queryDisk access tends to be dominant cost in query processing for centralized DBMS.

6


Query Processing

Activities involved in retrieving data from thedatabasedatabase.

Aims of QP:Aims of QP:

– transform query written in high-level language(e.g. SQL), into correct and efficient executionstrategy expressed in low-level languagegy p g g(implementing RA);

execute strategy to retrieve required data– execute strategy to retrieve required data.

7


Query Optimization

Activity of choosing an efficient executionstrategy for processing query.strategy for processing query.

As there are many equivalent transformations ofhi h l l i f QO i hsame high-level query, aim of QO is to choose one

that minimizes resource usage.Generally, reduce total execution time of query.May also reduce response time of query.May also reduce response time of query.Problem computationally intractable with largenumber of relations so strategy adopted isnumber of relations, so strategy adopted isreduced to finding near optimum solution.

8


Example 21.1 - Different Strategies

Find all Managers who work at a London branch.

SELECT *

FROM Staff s, Branch b

WHERE s branchNo = b branchNo ANDWHERE s.branchNo = b.branchNo AND

(s.position = ‘Manager’ AND b.city = ‘London’);

9



Three equivalent RA queries are:

(1)(1) σ(position='Manager') ∧ (city='London') ∧

(Staff.branchNo=Branch.branchNo) (Staff X Branch) ( )

(2) σ(position='Manager') ∧ (city='London')(

St ff B h)Staff Staff.branchNo=Branch.branchNo Branch)

(3) (σposition='Manager'(Staff)) Staff.branchNo=Branch.branchNoposition Manager Staff.branchNo Branch.branchNo

(σcity='London' (Branch))

10



Assume:

1000 t l i St ff 50 t l i B h– 1000 tuples in Staff; 50 tuples in Branch;

– 50 Managers; 5 London branches;g

– no indexes or sort keys;

lt f i t di t ti t d– results of any intermediate operations stored on disk;

– cost of the final write is ignored;

tuples are accessed one at a time– tuples are accessed one at a time.

11


Example 21.1 - Cost Comparison

Cost (in disk accesses) are:

(1) (1000 + 50) + 2*(1000 * 50) = 101 050 (2) 2*1000 + (1000 + 50) = 3 050 ( ) ( )(3) 1000 + 2*50 + 5 + (50 + 5) = 1 160

Cartesian product and join operations muchmore expensive than selection, and third optionp , psignificantly reduces size of relations being joinedtogether.g

12


Phases of Query Processing

QP has four main phases:

– decomposition (consisting of parsing andvalidation);validation);

– optimization;

– code generation;

– execution.execution.

13


Phases of Query Processing

14


Dynamic versus Static Optimization

Two times when first three phases of QP can be carried out:carried out:– dynamically every time query is run;

t ti ll h i fi t b itt d– statically when query is first submitted. Advantages of dynamic QO arise from fact thatinformation is up to date.Disadvantages are that performance of query isg p q yaffected, time may limit finding optimumstrategy.gy

15


Dynamic versus Static Optimization

Advantages of static QO are removal of runtimeoverhead and more time to find optimumoverhead, and more time to find optimumstrategy.

Disadvantages arise from fact that chosenexecution strategy may no longer be optimalgy y g pwhen query is run.

Could use a hybrid approach to overcome thisCould use a hybrid approach to overcome this.

16


Query Decomposition

Aims are to transform high-level query into RAquery and check that query is syntactically andquery and check that query is syntactically andsemantically correct.Typical stages are:Typical stages are:– analysis,– normalization,– semantic analysis,y ,– simplification,

query restructuring– query restructuring.

17


Analysis

Analyze query lexically and syntactically usingcompiler techniquescompiler techniques.

Verify relations and attributes exist.

Verify operations are appropriate for object type.

18


Analysis - Example

SELECT staff_no

FROM StaffFROM Staff

WHERE position > 10;

This query would be rejected on two grounds:

– staff_no is not defined for Staff relation(should be staffNo).( )

– Comparison ‘>10’ is incompatible with typeposition which is variable character stringposition, which is variable character string.

19


Analysis

Finally, query transformed into some internalrepresentation more suitable for processing.representation more suitable for processing.Some kind of query tree is typically chosen,constructed as follows:constructed as follows:– Leaf node created for each base relation.– Non-leaf node created for each intermediate

relation produced by RA operation.– Root of tree represents query result.– Sequence is directed from leaves to rootSequence is directed from leaves to root.

20


Example 21.1 - R.A.T.

21


Normalization

Converts query into a normalized form for easiermanipulationmanipulation.

Predicate can be converted into one of two forms:

Conjunctive normal form:(position = 'Manager' ∨ salary > 20000) ∧ (branchNo = 'B003')

Disjunctive normal form:Disjunctive normal form:(position = 'Manager' ∧ branchNo = 'B003' ) ∨

(salary > 20000 ∧ branchNo = 'B003')

22


Semantic Analysis

Rejects normalized queries that are incorrectlyformulated or contradictory.formulated or contradictory.Query is incorrectly formulated if componentsdo not contribute to generation of resultdo not contribute to generation of result.Query is contradictory if its predicate cannot be

ti fi d b t lsatisfied by any tuple.Algorithms to determine correctness exist onlyfor queries that do not contain disjunction andnegation.

23


Semantic Analysis

For these queries, could construct:

A l ti ti h– A relation connection graph.

– Normalized attribute connection graph.g p

Relation connection graph

Create node for each relation and node forresult. Create edges between two nodes thatresult. Create edges between two nodes thatrepresent a join, and edges between nodes thatrepresent projectionrepresent projection.

If not connected, query is incorrectly formulated.24


Semantic Analysis - Normalized AttributeyConnection Graph

Create node for each reference to an attribute, orconstant 0constant 0.

Create directed edge between nodes that representj i d di t d d b t tt ib t da join, and directed edge between attribute node

and 0 node that represents selection.

Weight edges a → b with value c, if it representsinequality condition (a ≤ b + c); weight edges 0 → ainequality condition (a ≤ b c); weight edges 0 → awith -c, if it represents inequality condition (a ≥ c).

If graph has cycle for which valuation sum isIf graph has cycle for which valuation sum isnegative, query is contradictory.

25


Example 21.2 - Checking Semantic Correctness

SELECT p.propertyNo, p.streetFROM Client c, Viewing v, PropertyForRent p, g , p y pWHERE c.clientNo = v.clientNo AND

c.maxRent >= 500 ANDc.maxRent > 500 AND

c.prefType = ‘Flat’ AND p.ownerNo = ‘CO93’;

Relation connection graph not fully connected, sog p y ,query is not correctly formulated.Have omitted the join condition (v propertyNo =Have omitted the join condition (v.propertyNo =p.propertyNo) .

26



Relation Connection graph

Normalized attribute

connection graphconnection graph

27© Pearson Education Limited 1995, 2005


SELECT p.propertyNo, p.streetFROM Client c, Viewing v, PropertyForRent p, g , p y pWHERE c.maxRent > 500 AND

c.clientNo = v.clientNo ANDc.clientNo v.clientNo ANDv.propertyNo = p.propertyNo ANDc prefType = ‘Flat’ AND c maxRent < 200;c.prefType = Flat AND c.maxRent < 200;

Normalized attribute connection graph has cycleNormalized attribute connection graph has cyclebetween nodes c.maxRent and 0 with negativevaluation sum so query is contradictoryvaluation sum, so query is contradictory.

28


Simplification

– Detects redundant qualifications,– eliminates common sub-expressions– eliminates common sub-expressions,– transforms query to semantically equivalent

b t il d ffi i tl t d fbut more easily and efficiently computed form.Typically, access restrictions, view definitions,and integrity constraints are considered.Assuming user has appropriate access privileges,g pp p p g ,first apply well-known idempotency rules ofboolean algebra.g

29


Transformation Rules for RA Operations

Conjunctive Selection operations can cascade intoindividual Selection operations (and vice versa)individual Selection operations (and vice versa).

σp∧q∧r(R) = σp(σq(σr(R)))p∧q∧r p q r

Sometimes referred to as cascade of Selection.

σbranchNo='B003' ∧ salary>15000(Staff) = σ (σ (Staff))σbranchNo='B003'(σsalary>15000(Staff))

30



Commutativity of Selection.

σp(σq(R)) = σq(σp(R))

For example:

σbranchNo='B003'(σsalary>15000(Staff)) = σsalary>15000(σbranchNo='B003'(Staff))salary>15000( branchNo= B003 ( ))

31



In a sequence of Projection operations, only thelast in the sequence is requiredlast in the sequence is required.

ΠLΠM … ΠN(R) = ΠL (R)ΠLΠM … ΠN(R) ΠL (R)

For example:For example:

ΠlNameΠbranchNo lName(Staff) = ΠlName (Staff)lName branchNo, lName( ) lName ( )

32



Commutativity of Selection and Projection.

If predicate p involves only attributes in projection list,Selection and Projection operations commute:j p

ΠAi, …, Am(σp(R)) = σp(ΠAi, …, Am(R)) where p∈ {A1, A2, …, Am}p { 1, 2, , m}

For example:

Π ( (St ff))ΠfName, lName(σlName='Beech'(Staff)) = σlName='Beech'(ΠfName,lName(Staff))

33



Commutativity of Theta join (and Cartesianproduct).product).

R p S = S p R

S SR X S = S X R

Rule also applies to Equijoin and Natural join.For example:

Staff staff.branchNo=branch.branchNo Branch =

B h St ffBranch staff.branchNo=branch.branchNo Staff

34



Commutativity of Selection and Theta join (orCartesian product).Cartesian product).

If l ti di t i l l tt ib t fIf selection predicate involves only attributes ofone of join relations, Selection and Join (orC t i d t) ti tCartesian product) operations commute:

σp(R r S) = (σp(R)) r Sp p

σp(R X S) = (σp(R)) X S

h {A A A }where p∈ {A1, A2, …, An}

35



If selection predicate is conjunctive predicatehaving form (p ∧ q) where p only involveshaving form (p ∧ q), where p only involvesattributes of R, and q only attributes of S,S l ti d Th t j i ti tSelection and Theta join operations commute as:

σ (R S) = (σ (R)) (σ (S))σp ∧ q(R r S) = (σp(R)) r (σq(S))

σp ∧ q(R X S) = (σp(R)) X (σq(S))

36



For example:

σposition='Manager' ∧ city='London'(Staff

St ff b hN B h b hN Branch) =Staff.branchNo=Branch.branchNo Branch)

(σposition='Manager'(Staff)) Staff.branchNo=Branch.branchNo

( (B h))(σcity='London' (Branch))

37



Commutativity of Projection and Theta join (orCartesian product).p )

If projection list is of form L = L ∪ L where LIf projection list is of form L = L1 ∪ L2, where L1only has attributes of R, and L2 only hasattributes of S provided join condition onlyattributes of S, provided join condition onlycontains attributes of L, Projection and Thetajoin commute:join commute:

Π (R S) = (Π (R)) (Π (S))ΠL1∪L2(R r S) (ΠL1(R)) r (ΠL2(S))

38



If join condition contains additional attributesnot in L (M = M ∪ M where M only hasnot in L (M = M1 ∪ M2 where M1 only hasattributes of R, and M2 only has attributes of S),

fi l j i i i i da final projection operation is required:

Π (R S) = Π ( (Π (R))ΠL1∪L2(R r S) = ΠL1∪L2( (ΠL1∪M1(R)) r(ΠL2∪M2(S)))

39



For example:

Π (Staff Branch)Πposition,city,branchNo(Staff Staff.branchNo=Branch.branchNo Branch)=

(Π (Staff)) ((Πposition, branchNo(Staff)) Staff.branchNo=Branch.branchNo (

Πcity, branchNo (Branch))

and using the latter rule:

(S ff )Πposition, city(Staff Staff.branchNo=Branch.branchNo Branch) =

Πposition, city ((Πposition, branchNo(Staff))

Staff.branchNo=Branch.branchNo ( Πcity, branchNo (Branch)))

40



Commutativity of Union and Intersection (butnot set difference)not set difference).

R ∪ S = S ∪ RR ∪ S = S ∪ R

R ∩ S = S ∩ R

41



Commutativity of Selection and set operations(Union Intersection and Set difference)(Union, Intersection, and Set difference).

σ (R ∪ S) = σ (S) ∪ σ (R)σp(R ∪ S) = σp(S) ∪ σp(R)

σp(R ∩ S) = σp(S) ∩ σp(R)p p p

σp(R - S) = σp(S) - σp(R)

42



Commutativity of Projection and Union.

ΠL(R ∪ S) = ΠL(S) ∪ ΠL(R)

Associativity of Union and Intersection (but notS t diff )Set difference).

(R ∪ S) ∪ T S ∪ (R ∪ T)(R ∪ S) ∪ T = S ∪ (R ∪ T)

(R ∩ S) ∩ T = S ∩ (R ∩ T)( ) ( )

43



Associativity of Theta join (and Cartesian product).

C t i d t d N t l j i lCartesian product and Natural join are alwaysassociative:

(R S) T = R (S T)

(R X S) X T R X (S X T)(R X S) X T = R X (S X T)

If join condition q involves attributes only from Sj q yand T, then Theta join is associative:

(R S) T = R (S T)(R p S) q ∧ r T = R p ∧ r (S q T)

44



For example:

(Staff Staff.staffNo=PropertyForRent.staffNo PropertyForRent)

Owner =ownerNo=Owner.ownerNo ∧ staff.lName=Owner.lName Owner =

Staff staff.staffNo=PropertyForRent.staffNo ∧ staff.lName=lName

(PropertyForRent ownerNo Owner)

45


Example 21.3 Use of Transformation Rules

For prospective renters of flats, find propertiesthat match requirements and owned by CO93.that match requirements and owned by CO93.

SELECT p.propertyNo, p.streetO C i i iFROM Client c, Viewing v, PropertyForRent p

WHERE c.prefType = ‘Flat’ ANDc.clientNo = v.clientNo ANDv.propertyNo = p.propertyNo ANDc.maxRent >= p.rent ANDc.prefType = p.type ANDp.ownerNo = ‘CO93’;

46



47



48



49


Heuristical Processing Strategies

Perform Selection operations as early as possible.

K di t l ti t th– Keep predicates on same relation together.

Combine Cartesian product with subsequentCombine Cartesian product with subsequentSelection whose predicate represents joincondition into a Join operationcondition into a Join operation.

Use associativity of binary operations torearrange leaf nodes so leaf nodes with mostrestrictive Selection operations executed first.restrictive Selection operations executed first.

50


Heuristical Processing Strategies

Perform Projection as early as possible.

K j ti tt ib t l ti t th– Keep projection attributes on same relation together.

Compute common expressions once.Compute common expressions once.

– If common expression appears more than once, andresult not too large store result and reuse it whenresult not too large, store result and reuse it whenrequired.

– Useful when querying views, as same expression is usedto construct view each time.

51


Cost Estimation for RA Operations

Many different ways of implementing RAoperations.operations.Aim of QO is to choose most efficient one.U f l th t ti t t f b fUse formulae that estimate costs for a number ofoptions, and select one with lowest cost.Consider only cost of disk access, which is usuallydominant cost in QP.Many estimates are based on cardinality of therelation, so need to be able to estimate this.,

52


Database Statistics

Success of estimation depends on amount andcurrency of statistical information DBMS holdscurrency of statistical information DBMS holds.

Keeping statistics current can be problematic.

If statistics updated every time tuple is changed,this would impact performance.this would impact performance.

DBMS could update statistics on a periodic basis,f l i htl h th t ifor example nightly, or whenever the system isidle.

53


Typical Statistics for Relation R

nTuples(R) - number of tuples in R.

bFactor(R) - blocking factor of R.

nBlocks(R) - number of blocks required to store R:

Bl k (R) [ T l (R)/bF t (R)]nBlocks(R) = [nTuples(R)/bFactor(R)]

54


Typical Statistics for Attribute A of Relation R

nDistinctA(R) - number of distinct values that

f tt ib t A i Rappear for attribute A in R.

minA(R),maxA(R)A A

– minimum and maximum possible valuesfor attribute A in Rfor attribute A in R.

SCA(R) - selection cardinality of attribute A in R.

Average number of tuples that satisfy anequality condition on attribute A.equality condition on attribute A.

55


Statistics for Multilevel Index I on Attribute A

nLevelsA(I) - number of levels in I.

nLfBlocksA(I) - number of leaf blocks in I.

56


Selection Operation

Predicate may be simple or composite.Number of different implementations dependingNumber of different implementations, dependingon file structure, and whether attribute(s)involved are indexed/hashedinvolved are indexed/hashed.Main strategies are:– Linear Search (Unordered file, no index).– Binary Search (Ordered file, no index).y ( , )– Equality on hash key.

Equality condition on primary key– Equality condition on primary key.

57


Selection Operation

– Inequality condition on primary key.

E lit diti l t i ( d )– Equality condition on clustering (secondary)index.

– Equality condition on a non-clustering(secondary) index.(secondary) index.

– Inequality condition on a secondary B+-treei dindex.

58


Estimating Cardinality of Selection

Assume attribute values are uniformly distributedwithin their domain and attributes arewithin their domain and attributes areindependent.

nTuples(S) = SCA(R)

For an attrib te B ≠ A of S nDistinct (S)For any attribute B ≠ A of S, nDistinctB(S) =

nTuples(S) if nTuples(S) < nDistinctB(R)/2p ( ) p ( ) B( )

nDistinctB(R) if nTuples(S) > 2*nDistinctB(R)

[(nTuples(S) + nDistinct (R))/3] otherwise[(nTuples(S) + nDistinctB(R))/3] otherwise

59


Linear Search (Ordered File, No Index)

May need to scan each tuple in each block tocheck whether it satisfies predicate.check whether it satisfies predicate.For equality condition on key attribute, costestimate is:estimate is:

[nBlocks(R)/2][ ( ) ]

For any other condition, entire file may need to besearched so more general cost estimate is:searched, so more general cost estimate is:

nBlocks(R)( )

60


Binary Search (Ordered File, No Index)

If predicate is of form A = x, and file is orderedon key attribute A, cost estimate:on key attribute A, cost estimate:

[log2(nBlocks(R))]

Generally, cost estimate is:[log2(nBlocks(R))] + [SCA(R)/bFactor(R)] - 1[ g2( ( ))] [ A( ) ( )]

First term represents cost of finding first tupleusing binary searchusing binary search.Expect there to be SCA(R) tuples satisfying

di tpredicate.

61


Equality of Hash Key

If attribute A is hash key, apply hashingalgorithm to calculate target address for tuplealgorithm to calculate target address for tuple.

If there is no overflow, expected cost is 1.

If there is overflow, additional accesses may benecessary.necessary.

62


Equality Condition on Primary Key

Can use primary index to retrieve single recordsatisfying conditionsatisfying condition.

Need to read one more block than number ofindex accesses, equivalent to number of levels inindex, so estimated cost is:

nLevelsA(I) + 1

63


Inequality Condition on Primary Key

Can first use index to locate record satisfyingpredicate (A = x)predicate (A = x).

Provided index is sorted, records can be found byaccessing all records before/after this one.

Assuming uniform distribution, would expectAssuming uniform distribution, would expecthalf the records to satisfy inequality, so estimatedcost is:cost is:

nLevelsA(I) + [nBlocks(R)/2]nLevelsA(I) [nBlocks(R)/2]

64


Equality Condition on Clustering Index

Can use index to retrieve required records.

E ti t d t iEstimated cost is:

nLevels (I) + [SC (R)/bFactor(R)]nLevelsA(I) + [SCA(R)/bFactor(R)]

Second term is estimate of number of blocks thatSecond term is estimate of number of blocks thatwill be required to store number of tuples thatsatisf eq alit condition represented as SC (R)satisfy equality condition, represented as SCA(R).

65


Equality Condition on Non-Clustering Index

Can use index to retrieve required records.

H t th t t l diff tHave to assume that tuples are on differentblocks (index is not clustered this time), soestimated cost becomes:

L l (I) + [SC (R)]nLevelsA(I) + [SCA(R)]

66


Inequality Condition on a Secondary B+-q y yTree Index

From leaf nodes of tree, can scan keys fromsmallest value up to x (< or <= ) or from x up tosmallest value up to x (< or <= ) or from x up tomaximum value (> or >=).

Assuming uniform distribution, would expecthalf the leaf node blocks to be accessed and, viaindex, half the file records to be accessed.

Estimated cost is:Estimated cost is:

nLevelsA(I) + [nLfBlocksA(I)/2 + nTuples(R)/2]nLevelsA(I) + [nLfBlocksA(I)/2 + nTuples(R)/2]

67


Composite Predicates - Conjunction p jwithout Disjunction

May consider following approaches:If one attribute has index or is ordered can use one of- If one attribute has index or is ordered, can use one of

above selection strategies. Can then check each retrievedrecordrecord.

- For equality on two or more attributes, with compositeindex (or hash key) on combined attributes can searchindex (or hash key) on combined attributes, can searchindex directly.

With d i d tt ib t- With secondary indexes on one or more attributes(involved only in equality conditions in predicate), could

se record pointers if e istuse record pointers if exist.

68


Composite Predicates - Selections with pDisjunction

If one term contains an ∨ (OR), and term requireslinear search entire selection requires linearlinear search, entire selection requires linearsearch.

Only if index or sort order exists on every termcan selection be optimized by retrieving recordsp y gthat satisfy each condition and applying unionoperator.operator.

Again, record pointers can be used if they exist.

69


Join Operation

Main strategies for implementing join:

– Block Nested Loop Join.

Indexed Nested Loop Join– Indexed Nested Loop Join.

– Sort-Merge Join.

– Hash Join.

70


Estimating Cardinality of Join

Cardinality of Cartesian product is:

nT ples(R) * nT ples(S)nTuples(R) * nTuples(S)

More difficult to estimate cardinality of any joiny y jas depends on distribution of values.

W t t b t th thi lWorst case, cannot be any greater than this value.

71


Estimating Cardinality of Join

If assume uniform distribution, can estimate forEquijoins with a predicate (R A = S B) as follows:Equijoins with a predicate (R.A S.B) as follows:

– If A is key of R: nTuples(T) ≤ nTuples(S)

– If B is key of S: nTuples(T) ≤ nTuples(R)

Otherwise could estimate cardinality of join as:Otherwise, could estimate cardinality of join as:

nTuples(T) = SCA(R)*nTuples(S) orp ( ) A( ) p ( )

nTuples(T) = SCB(S)*nTuples(R)

72


Block Nested Loop Join

Simplest join algorithm is nested loop that joinstwo relations together a tuple at a time.two relations together a tuple at a time.Outer loop iterates over each tuple in R, andinner loop iterates over each tuple in Sinner loop iterates over each tuple in S.As basic unit of reading/writing is a disk block,b tt t h t t l th tbetter to have two extra loops that processblocks.Estimated cost of this approach is:

nBlocks(R) + (nBlocks(R) * nBlocks(S))nBlocks(R) + (nBlocks(R) * nBlocks(S))

73


Block Nested Loop Join

Could read as many blocks as possible of smallerrelation, R say, into database buffer, saving onerelation, R say, into database buffer, saving oneblock for inner relation and one for result.New cost estimate becomes:New cost estimate becomes:

nBlocks(R) + [nBlocks(S)*(nBlocks(R)/(nBuffer-2))]

If can read all blocks of R into the buffer, thisreduces to:reduces to:

nBlocks(R) + nBlocks(S)

74


Indexed Nested Loop Join

If have index (or hash function) on joinattributes of inner relation, can use indexattributes of inner relation, can use indexlookup.For each tuple in R use index to retrieveFor each tuple in R, use index to retrievematching tuples of S.C t f i R i Bl k (R) b fCost of scanning R is nBlocks(R), as before.Cost of retrieving matching tuples in S dependson type of index and number of matching tuples.If join attribute A in S is PK, cost estimate is:j ,

nBlocks(R) + nTuples(R)*(nlevelsA(I) + 1)

75


Sort-Merge Join

For Equijoins, most efficient join is when bothrelations are sorted on join attributesrelations are sorted on join attributes.

Can look for qualifying tuples merging relations.

May need to sort relations first.

Now tuples with same join value are in orderNow tuples with same join value are in order.

If assume join is *:* and each set of tuples withsame join value can be held in database buffer atsame time, then each block of each relation need,only be read once.

76


Sort-Merge Join

Cost estimate for the sort-merge join is:

nBlocks(R) + nBlocks(S)

If a relation has to be sorted R sa add:If a relation has to be sorted, R say, add:

nBlocks(R)*[log (nBlocks(R)]nBlocks(R) [log2(nBlocks(R)]

77


Hash Join

For Natural or Equijoin, hash join may be used.Idea is to partition relations according to someIdea is to partition relations according to somehash function that provides uniformity andrandomnessrandomness.Each equivalent partition should hold same

l f j i tt ib t lth h it h ldvalue for join attributes, although it may holdmore than one value.Cost estimate of hash join as:

3(nBlocks(R) + nBlocks(S))3(nBlocks(R) + nBlocks(S))

78


Projection Operation

To implement projection need to:– remove attributes that are not required;– remove attributes that are not required;– eliminate any duplicate tuples produced from

i t O l i d if j tiprevious step. Only required if projectionattributes do not include a key.

Two main approaches to eliminating duplicates:

ti– sorting;

– hashing.g

79


Estimating Cardinality of Projection

When projection contains key, cardinality is:

nTuples(S) = nTuples(R)

If projection consists of a single non keyIf projection consists of a single non-keyattribute, estimate is:

nTuples(S) = SCA(R)

Other ise co ld estimate cardinalit as:Otherwise, could estimate cardinality as:

nTuples(S) ≤ min(nTuples(R), Πim

1(nDistinct i(R)))nTuples(S) ≤ min(nTuples(R), Πi =1(nDistinctai(R)))

80


Duplicate Elimination using Sorting

Sort tuples of reduced relation using allremaining attributes as sort keyremaining attributes as sort key.

Duplicates will now be adjacent and can beremoved easily.

Estimated cost of sorting is:Estimated cost of sorting is:

nBlocks(R)*[log2(nBlocks(R))].

Combined cost is:

nBlocks(R) + nBlocks(R)*[log2(nBlocks(R))]

81


Duplicate Elimination using Hashing

Two phases: partitioning and duplicateeliminationelimination.

In partitioning phase, for each tuple in R,remove unwanted attributes and apply hashfunction to combination of remaining attributes,and write reduced tuple to hashed value.

Two tuples that belong to different partitions areTwo tuples that belong to different partitions areguaranteed not to be duplicates.

Estimated cost is: nBlocks(R) + nB

82


Set Operations

Can be implemented by sorting both relations onsame attributes and scanning through each ofsame attributes, and scanning through each ofsorted relations once to obtain desired result.

Could use sort-merge join as basis.

Estimated cost in all cases is:Estimated cost in all cases is:

nBlocks(R) + nBlocks(S) + nBlocks(R)*[log2(nBlocks(R))] + nBlocks(S)*[log2(nBlocks(S))]( ) [ g2( ( ))]

Could also use hashing algorithm.

83


Estimating Cardinality of Set Operations

As duplicates are eliminated when performingUnion difficult to estimate cardinality but canUnion, difficult to estimate cardinality, but cangive an upper and lower bound as:

max(nTuples(R), nTuples(S)) ≤ nTuples(T) ≤nTuples(R) + nTuples(S)nTuples(R) + nTuples(S)

For Set Difference, can also give upper and lowerbound:

0 ≤ nTuples(T) ≤ nTuples(R)

84


Aggregate Operations

SELECT AVG(salary)

FROM Staff;FROM Staff;

To implement query could scan entire StaffTo implement query, could scan entire Staffrelation and maintain running count of number

f t l d d f ll l iof tuples read and sum of all salaries.

Easy to compute average from these two runningcounts.

85


Aggregate Operations

SELECT AVG(salary)

FROM StaffFROM Staff

GROUP BY branchNo;

For grouping queries, can use sorting or hashing algorithms similar to duplicate elimination.

Can estimate cardinality of result using C es e c d y o esu us gestimates derived earlier for selection.

86


Enumeration of Alternative Strategies

Fundamental to efficiency of QO is the searchspace of possible execution strategies and thespace of possible execution strategies and theenumeration algorithm used to search this space.

Query with 2 joins gives 12 join orderings:

R (S T) R (T S) (S T) R (T S) RR (S T) R (T S) (S T) R (T S) R

S (R T) S (T R) (R T) S (T R) S

T (R S) T (S R) (R S) T (S R) TT (R S) T (S R) (R S) T (S R) T

With n relations, (2(n – 1))!/(n – 1)! orderings., ( ( )) ( ) g

If n = 4 this is 120; if n = 10 this is > 176 billion.

C d d b diff t l ti /j i th d87

Compounded by different selection/join methods.© Pearson Education Limited 1995, 2005

Pipelining

Materialization - output of one operation isstored in temporary relation for processing bystored in temporary relation for processing bynext.Could also pipeline results of one operation toCould also pipeline results of one operation toanother without creating temporary relation.K i li i th fl iKnown as pipelining or on-the-fly processing.Pipelining can save on cost of creatingtemporary relations and reading results back inagain.Generally, pipeline is implemented as separateprocess or thread.

88

p


Types of Trees

89


Pipelining

With linear trees, relation on one side of eachoperator is always a base relation.operator is always a base relation.However, as need to examine entire inner relationfor each tuple of outer relation inner relationsfor each tuple of outer relation, inner relationsmust always be materialized.Thi k l ft d t li iThis makes left-deep trees appealing as innerrelations are always base relations.Reduces search space for optimum strategy, andallows QO to use dynamic processing.Not all execution strategies are considered.

90


Physical Operators & Strategies

Term physical operator refers to specificalgorithm that implements a logical operationalgorithm that implements a logical operation,such as selection or join.

For example, can use sort-merge join toimplement the join operation.

Replacing logical operations in a R.A.T. withphysical operators produces an execution strategyphysical operators produces an execution strategy(or query evaluation plan or access plan).

91


Physical Operators & Strategies

92


Reducing the Search Space

Restriction 1: Unary operations processed on-the-fly: selections processed as relations are

d f fi i j i daccessed for first time; projections processed asresults of other operations are generated.R i i 2 C i dRestriction 2: Cartesian products are neverformed unless query itself specifies one.R i i 3 I d f h j i iRestriction 3: Inner operand of each join is abase relation, never an intermediate result. Thisuses fact that with left deep trees inner operand isuses fact that with left-deep trees inner operand isa base relation and so already materialized.

R i i 3 l d l i iRestriction 3 excludes many alternative strategiesbut significantly reduces number to be considered.

93


Dynamic Programming

Enumeration of left-deep trees using dynamicprogramming first proposed for System R QOprogramming first proposed for System R QO.

Algorithm based on assumption that the costmodel satisfies principle of optimality.

Thus, to obtain optimal strategy for query with nus, to obta opt a st ategy o que y w t njoins, only need to consider optimal strategies forsubexpressions with (n – 1) joins and extend thosesubexpressions with (n – 1) joins and extend thosestrategies with an additional join. Remaining

b ti l t t i b di d dsuboptimal strategies can be discarded.

94


Dynamic Programming

To ensure some potentially useful strategies arenot discarded algorithm retains strategies withnot discarded algorithm retains strategies withinteresting orders: an intermediate result has ani t ti d if it i t d b fi l ORDERinteresting order if it is sorted by a final ORDERBY attribute, GROUP BY attribute, or anyattributes that participate in subsequent joins.

95


Dynamic Programming

SELECT p.propertyNo, p.streetFROM Client c Viewing v PropertyForRent pFROM Client c, Viewing v, PropertyForRent pWHERE c.maxRent < 500 AND

c.clientNo = v.clientNo AND v.propertyNo = p.propertyNo;p p y p p p y ;

Attributes c.clientNo, v.clientNo, v.propertyNo,and p propertyNo are interestingand p.propertyNo are interesting.If any intermediate result is sorted on any of theseattributes then corresponding partial strategyattributes, then corresponding partial strategymust be included in search.

96


Dynamic Programming

Algorithm proceeds from the bottom up and constructs all alternative join trees that satisfy theconstructs all alternative join trees that satisfy the restrictions above, as follows:Pass 1: Enumerate the strategies for each basePass 1: Enumerate the strategies for each base relation using a linear search and all available indexes on the relation These partial strategiesindexes on the relation. These partial strategies are partitioned into equivalence classes based on any interesting orders An additional equivalenceany interesting orders. An additional equivalence class is created for the partial strategies with no interesting orderinteresting order.

97


Dynamic Programming

For each equivalence class, strategy with lowest cost is retained for consideration in next pass.cost is retained for consideration in next pass. Do not retain equivalence class with no interesting order if its lowest cost strategy is not lower thanorder if its lowest cost strategy is not lower than all other strategies. F i l ti R l ti i l iFor a given relation R, any selections involving only attributes of R are processed on-the-fly. Si il l tt ib t f R th t t t fSimilarly, any attributes of R that are not part of the SELECT clause and do not contribute to any

b t j i b j t d t t thi tsubsequent join can be projected out at this stage (restriction 1 above).

98


Dynamic Programming

Pass 2: Generate all 2-relation strategies byconsidering each strategy retained after Pass 1 asconsidering each strategy retained after Pass 1 asouter relation, discarding any Cartesian productsgenerated (restriction 2 above). Again, any on-the-generated (restriction 2 above). Again, any on thefly processing is performed and lowest coststrategy in each equivalence class is retained.strategy in each equivalence class is retained.Pass n: Generate all n-relation strategies byconsidering each strategy retained after Pass (nconsidering each strategy retained after Pass (n –1) as outer relation, discarding any Cartesianproducts generated After pruning now haveproducts generated. After pruning, now havelowest overall strategy for processing the query.

99


Dynamic Programming

Although algorithm is still exponential, there arequery forms for which it only generates O(n3)query forms for which it only generates O(n )strategies, so for n = 10 the number is 1,000, whichis significantly better than the 176 billion differentis significantly better than the 176 billion differentjoin orders noted earlier.

100


Semantic Query Optimization

Based on constraints specified on the databaseschema to reduce the search space.schema to reduce the search space.For example, a constraint states that staff cannotsupervise more than 100 properties so any querysupervise more than 100 properties, so any querysearching for staff who supervise more than 100properties will produce zero rows Now consider:properties will produce zero rows. Now consider:CREATE ASSERTION ManagerSalary

CHECK ( l > 20000 AND iti ‘M ’)CHECK (salary > 20000 AND position = ‘Manager’)SELECT s.staffNo, fName, lName, propertyNoFROM S ff P F RFROM Staff s, PropertyForRent pWHERE s.staffNo = p.staffNo AND

101position = ‘Manager’;


Semantic Query Optimization

Can rewrite this query as:SELECT s staffNo fName lName propertyNoSELECT s.staffNo, fName, lName, propertyNo

FROM Staff s, PropertyForRent pWHERE t ffN t ffN ANDWHERE s.staffNo = p.staffNo AND

salary > 20000 AND position = ‘Manager’;

Additional predicate may be very useful if only

index for Staff is a B+ tree on the salary attributeindex for Staff is a B+-tree on the salary attribute.

However, additional predicate would complicate query if no such index existed.

102


Query Optimization in Oracle

Oracle supports two approaches to queryoptimization: rule-based and cost-based.optimization: rule based and cost based.

R le basedRule-based15 rules, ranked in order of efficiency. Particularaccess path for a table only chosen if statementcontains a predicate or other construct thatmakes that access path available.Score assigned to each execution strategy usingg gy gthese rankings and strategy with best (lowest)score selected.

103


QO in Oracle – Rule-Based

When 2 strategies have same score, tie-breakresolved by making decision based on order inresolved by making decision based on order inwhich tables occur in the SQL statement.

104


QO in Oracle – Rule-based: Example

SELECT propertyNoFROM PropertyForRentp yWHERE rooms > 7 AND city = ‘London’

Single-column access path using index on city fromSingle column access path using index on city fromWHERE condition (city = ‘London’). Rank 9.Unbounded range scan using index on rooms fromg gWHERE condition (rooms > 7). Rank 11.Full table scan - rank 15.Although there is index on propertyNo, column does notappear in WHERE clause and so is not considered bypp yoptimizer.Based on these paths, rule-based optimizer will choose to

105use index based on city column.


QO in Oracle – Cost-Based

To improve QO, Oracle introduced cost-basedoptimizer in Oracle 7, which selects strategy thatp , gyrequires minimal resource use necessary toprocess all rows accessed by query (avoidingp ocess ows ccessed by que y ( vo d gabove tie-break anomaly).User can select whether minimal resource usageUser can select whether minimal resource usageis based on throughput or based on response time,by setting the OPTIMIZER MODE initializationby setting the OPTIMIZER_MODE initializationparameter.C t b d ti i l t k i tCost-based optimizer also takes intoconsideration hints that the user may provide.

106


QO in Oracle – Statistics

Cost-based optimizer depends on statistics for alltables, clusters, and indexes accessed by query.tables, clusters, and indexes accessed by query.Users’ responsibility to generate these statisticsand keep them currentand keep them current.Package DBMS_STATS can be used to generate

d t ti tiand manage statistics.Whenever possible, Oracle uses a parallel methodto gather statistics, although index statistics arecollected serially.EXECUTE

DBMS_STATS.GATHER_SCHEMA_STATS(‘Manager’);

107


QO in Oracle – Histograms

Previously made assumption that data valueswithin columns of a table are uniformlywithin columns of a table are uniformlydistributed.

Histogram of values and their relativefrequencies gives optimizer improved selectivityq g p p yestimates in presence of non-uniformdistributiondistribution.

108



(a) uniform distribution of rooms; (b) actual non-uniformdistributiondistribution.

(a) can be stored compactly as low value (1) and high value(10) d t t l t f ll f i (i thi 100)

109

(10), and as total count of all frequencies (in this case, 100).© Pearson Education Limited 1995, 2005


Histogram is data structure that can improveestimates of number of tuples in result.estimates of number of tuples in result.Two types of histogram:

width balanced histogram which divides data into a– width-balanced histogram, which divides data into afixed number of equal-width ranges (called buckets)each containing count of number of values fallingeach containing count of number of values fallingwithin that bucket;

– height-balanced histogram, which placesg g , papproximately same number of values in each bucketso that end points of each bucket are determined byhow many values are in that bucket.

110



(a) width-balanced for rooms with 5 buckets. Each bucketof equal width with 2 values (1-2, 3-4, etc.)

(b) height-balanced – height of each column is 20 (100/5).111

( ) g g ( )


QO in Oracle – Viewing Execution Plan

112


Chapter 21 - Objectivespattarachai/DB/PDF/ch21.pdfHow Oracle handles QO. 4 ... Rule also applies to Equijoin and Natural join. For example: Staff staff.branchNo=branch.branchNo Branch

Documents