L21: Joins 2 - Northeastern University

Post on 23-Dec-2021

5 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

208

L21:Joins2

CS3200 Databasedesign(sp18 s2)https://course.ccs.neu.edu/cs3200sp18s2/4/2/2018

209

Announcements!

• Pleasepickupyourexamifyouhavenotyet• Changedclasscalendar• Outlinetoday- Joins- Relationalalgebra

• Nextclass- QueryOptimizations

210

211

212

GroupProjects:whatisyourexperience?

Source:FoundontheWebasvariationofhttp://www.inquisitr.com/160288/graph-what-i-learned-from-group-projects/

213

214

BNLJ:Somequickfacts.

• WeuseM bufferpagesas:- 1pageforS- 1pageforoutput- M-2PagesforR

• IfP(R)<=M-2- thenwedoonepassoverS,andwerunintimeP(R)+P(S)+OUT.- Note:Thisisoptimalforourcostmodel!- Thus,ifmin{P(R),P(S)}<=M-2weshouldalwaysuseBNLJ

• Weusethisattheendofhashjoin.Wedefineendcondition,oneofthebucketsissmallerthanM-2!

P 𝑅 +k l?@$

𝑃(𝑆) +OUT

215

SmarterthanCross-Products:FromQuadratictoNearlyLinear

• Alljoinsthatcomputethefullcross-product havesomequadraticterm- Forexamplewesaw:

• Nextwe’llseesome(nearly)linearjoins:- ~O(P(R)+P(S)+OUT),whereagainOUTcouldbequadraticbutisusuallybetter

P R +q rA@$

P(S) +OUT

P(R)+T(R)P(S)+OUTNLJ

BNLJ

Wegetthisgainbytakingadvantageofstructure- movingtoequalityconstraints(“equijoin”)only!

216

IndexNestedLoopJoin(INLJ)

Compute R ⋈ 𝑆𝑜𝑛𝐴:Given index idx on S.A: for r in R:s in idx(r[A]):yield r,s

P(R)+T(R)*L+OUT

àWecanuseanindex (e.g.B+Tree)toavoiddoingthefullcross-product!

whereListheIOcosttoaccessallthedistinctvaluesintheindex;assumingthesefitononepage,L~3 isgoodest.

Cost:

217

BetterJoinAlgorithms

• 2.Sort-MergeJoin(SMJ)

• 3.HashJoin(HJ)

• Comparison:SMJ vs.HJ

218

2.Sort-MergeJoin(SMJ)

219

Whatwewilllearnnext

• Sort-MergeJoin

• “Backup”&TotalCost

• Optimizations

220

SortMergeJoin(SMJ):BasicProcedure

• TocomputeR ⋈ 𝑆𝑜𝑛𝐴:

• SortR,SonAusingexternalmergesort

• Scan sortedfilesand“merge”

• [Mayneedto“backup”- seenextsubsection]

NotethatifR,SarealreadysortedonA,SMJwillbeawesome!

Notethatweareonlyconsideringequalityjoinconditionshere

221

SMJExample:R ⋈ 𝑆𝑜𝑛𝐴with3pagebuffer• Forsimplicity:Leteachpagebeonetuple,andletthefirstvaluebeA

Disk

Main Memory

BufferR (5,b) (3,j)(0,a)

S (7,f) (0,j)(3,g)

WeshowthefileHEAD,whichisthenextvaluetoberead!

222

SMJExample:R ⋈ 𝑆𝑜𝑛𝐴with3pagebuffer• 1.SorttherelationsR,Sonthejoinkey(firstvalue)

Disk

Main Memory

BufferR (5,b) (3,j)(0,a)

S (7,f) (0,j)(3,g)

(3,j) (5,b)(0,a)

(3,g) (7,f)(0,j)

223

SMJExample:R ⋈ 𝑆𝑜𝑛𝐴with3pagebuffer• 2.Scanand“merge”onjoinkey!

Disk

Main Memory

BufferR

S (3,g) (7,f)

(3,j) (5,b)

Output

(0,j)

(0,a)(0,a)

(0,j)

224

SMJExample:R ⋈ 𝑆𝑜𝑛𝐴with3pagebuffer• 2.Scanand“merge”onjoinkey!

Disk

Main Memory

BufferR

S (3,g) (7,f)

(3,j) (5,b)

Output

(0,j)(0,a)

(0,a)

(0,j)(0,a,j)

225

SMJExample:R ⋈ 𝑆𝑜𝑛𝐴with3pagebuffer• 2.Scanand“merge”onjoinkey!

Disk

Main Memory

BufferR

S (3,g) (7,f)

(3,j) (5,b)

Output

(0,a)

(0,j)

(0,a,j)

(3,j,g)

(3,j)

(3,g)

(5,b)

(7,f)

226

SMJExample:R ⋈ 𝑆𝑜𝑛𝐴with3pagebuffer• 2.Done!

Disk

Main Memory

BufferR

S 3,g 7,f

3,j 5,b

Output

(0,a)

(0,j)

(0,a,j)

(3,j)

(3,g)

(3,j,g)

(5,b)

(7,f)

227

Whathappenswithduplicatejoinkeys?

228

MultipletupleswithSameJoinKey:“Backup”

• 1.Startwithsortedrelations,andbeginscan/merge…

Disk

Main Memory

BufferR

S 3,g 7,f

3,j 5,b

Output

(0,j)

(0,g)

(0,b)

(7,f)

(0,a)

(0,j)

(0,a)

(0,j)

229

MultipletupleswithSameJoinKey:“Backup”

• 1.Startwithsortedrelations,andbeginscan/merge…

Disk

Main Memory

BufferR

S 3,g 7,f

3,j 5,b

Output

(0,j)

(0,g)

(0,b)

(7,f)

(0,a)

(0,a)(0,j)

(0,j) (0,a,j)

230

MultipletupleswithSameJoinKey:“Backup”

• 1.Startwithsortedrelations,andbeginscan/merge…

Disk

Main Memory

BufferR

S (0,g) 7,f

(0,j) 5,b

Output

(0,b)

(7,f)

(0,a)

(0,a)(0,j)

(0,a,j)

(0,a,g)(0,g)

(0,j)

231

MultipletupleswithSameJoinKey:“Backup”

• 1.Startwithsortedrelations,andbeginscan/merge…

Disk

Main Memory

BufferR

S 0,g 7,f

0,j 5,b

Output

(0,j) (0,b)

(7,f)

(0,a)

(0,a,j)

(0,g)

(0,a,g)

(0,j)

Haveto“backup”inthescanofSandreadtuplewe’vealreadyread!

(0,j)(0,j)

232

Backup

• Atbest,nobackupà scantakesP(R)+P(S) reads- Forex:ifnoduplicatevaluesinjoinattribute

• Atworst(e.g.fullbackupeachtime),scancouldtakeP(R)*P(S) reads!- Forex:ifallduplicate valuesinjoinattribute,i.e.alltuplesinRandShavethesame

valueforthejoinattribute- Roughly:ForeachpageofR,we’llhavetobackup andreadeachpageofS…

• Oftennotthatbadhowever,pluswecan:- Leavemoredatainbuffer(forlargerbuffers)- Can“zig-zag”(seeanimation)

233

SMJ:Totalcost

• CostofSMJ iscostofsorting RandS…

• Plusthecostofscanning:~P(R)+P(S)- Becauseofbackup:inworstcaseP(R)*P(S);butthiswouldbeveryunlikely

• Plusthecostofwritingout:~P(R)+P(S)butinworstcaseT(R)*T(S)

~Sort(P(R))+Sort(P(S))+P(R)+P(S) +OUT

Recall:Sort(N)≈ 2𝑁 log?@"𝑵𝟐𝑴

+ 1Note:thisisusingrepacking,whereweestimatethatwecancreateinitialrunsoflength~2M

Externalmerge:slidesp26Externalmergesort:slidesp43

234

Merge/JoinPhase

SortPhase(Ext.MergeSort)

SMJ Illustrated

SR

Split&sortSplit&sort

MergeMerge

MergeMerge

GivenM bufferpages

Joinedoutputfilecreated!

Unsortedinputrelations

235

SMJ vs.BNLJ:Comparison

• IfwehaveM=100bufferpages,P(R)= 1000pagesandP(S)=500pages:• CostforSMJ:- Sort:- Merge:- Sum:

• WhatisBNLJ?

236

SMJ vs.BNLJ:Comparison

• IfwehaveM=100bufferpages,P(R)= 1000pagesandP(S)=500pages:• CostforSMJ:- Sort:- Merge:- Sum:

• WhatisBNLJ?- 500+1000* wTT

xy=5,500IOs+OUT

• But,ifwehaveM=35bufferpages?- SortMergehassamebehavior(still2passes)- BNLJ?15,500IOs+OUT!

SMJis~linearvs.BNLJisquadratic…Butit’sallaboutthememory.

Sortbothintwopasses:2*2*1000+2*2*500=6,000IOsMergephase1000+500=1,500IOs7,500IOs+OUT

237

TakeawaypointsfromSMJ

• Ifinputalreadysortedonjoinkey,skipthesorts.- SMJ isbasicallylinear.- Nastybutunlikelycase:Manyduplicatejoinkeys.

• SMJ needstosortboth relations- Ifmax{P(R),P(S)}<M2 thencostis3(P(R)+P(S))+OUT

239

L21:TheRelationalMOdel

CS3200 Databasedesign(sp18 s2)https://course.ccs.neu.edu/cs3200sp18s2/4/2/2018

240

Ournextfocus

• TheRelationalModel

• RelationalAlgebra

• RelationalAlgebraPt.II[Optional:mayskip]

241

1.TheRelationalModel&RelationalAlgebra

242

Whatyouwilllearnaboutinthissection

• TheRelationalModel

• RelationalAlgebra:BasicOperators

• Execution

243

Motivation

TheRelationalmodelisprecise,implementable,andwecanoperateonit

(query/update,etc.)

Databasemapsinternallyintothisprocedurallanguage.

244

ALittleHistory

• RelationalmodelduetoEdgar“Ted”Codd,amathematicianatIBMin1970- ARelationalModelofDataforLarge

SharedDataBanks". CommunicationsoftheACM 13 (6):377–387

• IBMdidn’twanttouserelationalmodel(takemoneyfromIMS)- Apparentlyusedinthemoonlanding…

WonTuringaward1981

245

TheRelationalModel:Schemata

• RelationalSchema:

Students(sid: string, name: string, gpa: float)

AttributesString, float, int, etc. are the domains of the attributes

Relationname

246

TheRelationalModel:Data

sid name gpa

001 Bob 3.2

002 Joe 2.8

003 Mary 3.8

004 Alice 3.5

Student

Anattribute (orcolumn)isatypeddataentrypresentineachtupleintherelation

Thenumberofattributesisthearity oftherelation

247

TheRelationalModel:Data

sid name gpa

001 Bob 3.2

002 Joe 2.8

003 Mary 3.8

004 Alice 3.5

Student

Atuple orrow (orrecord)isasingleentryinthetablehavingtheattributesspecifiedbytheschema

Thenumberoftuplesisthecardinality oftherelation

248

TheRelationalModel:Data

Arelationalinstance isaset oftuplesallconformingtothesameschema

Recall:InpracticeDBMSsrelaxthesetrequirement,andusemultisets (orbags).

sid name gpa

001 Bob 3.2

002 Joe 2.8

003 Mary 3.8

004 Alice 3.5

Student

249

ToReiterate

• Arelationalschema describesthedatathatiscontainedinarelationalinstance

LetR(f1:Dom1,…,fm:Domm)bearelationalschema then,aninstanceofRisasubsetofDom1 xDom2 x…xDomn

Inthisway,arelationalschema Risatotalfunctionfromattributenames totypes

250

OneMoreTime

• Arelationalschema describesthedatathatiscontainedinarelationalinstance

ArelationRofarity t isafunction:R:Dom1 x…xDomt à {0,1}

Then,theschemaissimplythesignatureofthefunction

I.e.returnswhetherornotatupleofmatchingtypesisamemberofit

Noteherethatordermatters,attributenamedoesn’t…We’ll(mostly)workwiththeothermodel(lastslide)in

whichattributenamematters,orderdoesn’t!

251

Arelationaldatabase

• Arelationaldatabaseschema isasetofrelationalschemata,oneforeachrelation

• Arelationaldatabaseinstance isasetofrelationalinstances,oneforeachrelation

Twoconventions:1. Wecallrelationaldatabaseinstancesassimplydatabases2. Weassumeallinstancesarevalid,i.e.,satisfythedomainconstraints

252

ACourseManagementSystem(CMS)

• RelationDBSchema- Students(sid:string,name:string,gpa:float)- Courses(cid:string,cname:string,credits:int)- Enrolled(sid:string,cid:string,grade:string)

Sid Name Gpa101 Bob 3.2123 Mary 3.8

Students

cid cname credits564 564-2 4308 417 2

Coursessid cid Grade123 564 A

Enrolled

RelationInstances

Notethattheschemasimposeeffectivedomain/typeconstraints,i.e.Gpacan’tbe“Apple”

253

2ndPartoftheModel:Querying

“FindnamesofallstudentswithGPA>3.5”

Wedon’ttellthesystem howorwhere togetthedata- justwhatwewant,i.e.,Queryingisdeclarative

Actually,Ishowedhowtodothistranslationforamuchricherlanguage!

SELECT S.nameFROM Students SWHERE S.gpa > 3.5;

Tomakethishappen,weneedtotranslatethedeclarativequeryintoaseriesofoperators…we’llseethisnext!

254

Virtuesofthemodel

• Physicalindependence(logicaltoo),Declarative

• Simple,elegantclean:Everythingisarelation

• Whydidittakemultipleyears?- Doubteditcouldbedoneefficiently.

255

2.RelationalAlgebra

256

RDBMSArchitecture

• HowdoesaSQLenginework?

SQLQuery

RelationalAlgebra(RA)

Plan

OptimizedRAPlan Execution

Declarativequery(fromuser)

Translatetorelationalalgebraexpresson

Findlogicallyequivalent- butmoreefficient- RAexpression

Executeeachoperatoroftheoptimizedplan!

top related