Querying Relational Data: Algebra Gerome Miklau UMass Amherst CMPSCI 645 – Database Systems Jan 21, 2010 Some slide content courtesy of Zack Ives, Ramakrishnan & Gehrke, Dan Suciu, Ullman & Widom Thursday, January 21, 2010
Querying Relational Data: Algebra
Gerome MiklauUMass Amherst
CMPSCI 645 – Database Systems
Jan 21, 2010
Some slide content courtesy of Zack Ives, Ramakrishnan & Gehrke, Dan Suciu, Ullman & Widom
Thursday, January 21, 2010
Next lectures
• Today– Relational model, relational algebra
• Next Tuesday– SQL
• Homework 1 will be on these topics
Thursday, January 21, 2010
Relational Database: Definitions
• Relational database: a set of relations• Relation: made up of 2 parts:
– Instance : a table, with rows and columns. – Schema : specifies name of relation, plus
name and type/domain of each column.
Restriction: all attributes are of atomic type, no nested tables
Students(sid: string, name: string, login: string, age: integer, gpa: real).
Thursday, January 21, 2010
Relational instances: tablesArity (number of attributes) is 5
Students
column, attribute, field
row, tuple
Attribute value
A relation is a set of tuples: no tuple can occur more than once– Real systems may allow duplicates for efficiency or other
reasons – we’ll come back to this.
Thursday, January 21, 2010
Relational Query Languages• Query languages: Allow manipulation and retrieval
of data from a database.• Query Languages != programming languages!
– QLs not expected to be “Turing complete”.– QLs not intended to be used for complex calculations.– QLs support easy, efficient access to large data sets.
Thursday, January 21, 2010
Preliminaries
• A query is applied to one or more relation instances
• The result of a query is a relation instance.• Input and output schema:
– Schema of input relations for a query are fixed – The schema for the result of a given query is also fixed:
determined by definition of query language constructs.
Query Q: R1..Rn → R’
Thursday, January 21, 2010
What is an “Algebra”
• Mathematical system consisting of:– Operands --- variables or values from
which new values can be constructed.– Operators --- symbols denoting procedures
that construct new values from given values.
Thursday, January 21, 2010
What is the Relational Algebra?
• An algebra whose operands are relations or variables that represent relations.
• Operators are designed to do the most common things that we need to do with relations in a database.– The result is an algebra that can be used
as a query language for relations.
Thursday, January 21, 2010
Relational Algebra• Operates on relations, i.e. sets
– Later: we discuss how to extend this to bags• Five operators:
– Union: ∪– Difference: -– Selection: σ– Projection: Π – Cartesian Product: ×
• Derived or auxiliary operators:– Intersection, complement– Joins (natural, equi-join, theta join)– Renaming: ρ– Division: /
Thursday, January 21, 2010
Example Database
sid name
1 Jill2 Bo3 Maya
fid name
1 Diao2 Saul8 Weems
sid cid
1 6451 6833 635
cid name sem
645 DB F05683 AI S05635 Arch F05
fid cid
1 6452 6838 635
STUDENT Takes COURSE
PROFESSOR Teaches
Thursday, January 21, 2010
1. Union and 2. Difference
sid name1 Jill2 Bo3 Maya
R1 sid name1 Jill4 Bob
R2
sid name2 Bo3 Maya
sid name1 Jill2 Bo3 Maya4 Bob
R1 – R2R1 ∪ R2
Thursday, January 21, 2010
What about Intersection ?
• It is a derived operator• R1 ∩ R2 = R1 – (R1 – R2)• Also expressed as a join (we’ll see
later)
R1 R2 R1 – R2
Thursday, January 21, 2010
3. Selection• Returns all tuples which satisfy a
condition• Notation: σc(R)• Examples
σCID > 600 (Course)σname = “AI” (Course)
• The condition c can be =, <, ≤, >, ≥, <>
cid name sem
645 DB F05683 AI S05635 Arch F05
Course
Thursday, January 21, 2010
4. Projection• Eliminates columns, then removes duplicates• Notation: Π A1,…,An (R)• Example: project cid and name
Π cid, name (Course)Output schema: Answer(cid, name)
cid name sem
645 DB F05683 AI S05645 DB S05
Coursecid name
645 DB683 AI
Answer
Π
Thursday, January 21, 2010
5. Cartesian Product
• Each tuple in R1 with each tuple in R2
• Notation: R1 × R2
• Very rare in practice; mainly used to express joins
Also called “Cross Product”
Thursday, January 21, 2010
Cartesian Product
16
sid cid
1 6451 6833 635
sid name1 Jill2 Bo
Student TakesStudent × Takes
sid name sid cid1 Jill 1 6451 Jill 1 6831 Jill 3 6352 Bo 1 6452 Bo 1 6832 Bo 3 635
Thursday, January 21, 2010
Renaming
• Changes the schema, not the instance• Notation: ρ B1,…,Bn (R)• Example:
ρcourseID, cname, term (Course)
cid name sem
645 DB F05683 AI S05645 DB S05
CoursecourseID cname term645 DB F05683 AI S05645 DB S05
ρ
Thursday, January 21, 2010
Natural Join• Notation: R1 R2
• Meaning: R1 R2 = ΠA(σC(R1 × R2))
• Where:– The selection σC checks equality of all
common attributes– The projection eliminates the duplicate
common attributes
Thursday, January 21, 2010
Natural join example
19
sid name1 Jill2 Bo3 Maya
sid cid
1 6451 6833 635
Takes
Student
sid name cid
1 Jill 6451 Jill 6833 Maya 635
Student Takes
Thursday, January 21, 2010
Example Database
sid name
1 Jill2 Bo3 Maya
fid name
1 Diao2 Saul8 Weems
sid cid
1 6451 6833 635
cid name sem
645 DB F05683 AI S05635 Arch F05
fid cid
1 6452 6838 635
STUDENT Takes COURSE
PROFESSOR Teaches
Thursday, January 21, 2010
Natural join questions
• Given the schemas R(A, B, C, D), S(A, C, E), what is the schema of R S ?
• Given R(A, B, C), S(D, E), what is R S ?
• Given R(A, B), S(A, B), what is R S ?
– R(A,B,C,D,E)
– Cartesian Product
– Intersection
Thursday, January 21, 2010
Theta Join
• A join that involves a predicate• R1 θ R2 = σ θ (R1 × R2)
• Here θ can be any condition: =, <, ≠, ≤, >, ≥
Example: Student sid<sid Takes
Thursday, January 21, 2010
Equi-join
• A theta join where θ is an equality• R1 A=B R2 = σ A=B (R1 × R2)• Very useful join in practice
• Example: Student sid=sid Takes
Thursday, January 21, 2010
Semijoin
R S = Π A1,…,An (R S)where A1, …, An are the attributes in R
The semijoin of R and S is the set of tuples of R that agree with at least one tuple of S on all attributes common to the schema of R and S.
sid name1 Jill2 Bo3 Maya
sid cid1 6451 6833 635
TakesStudent
sid name
1 Jill3 Maya
Student Takes
Thursday, January 21, 2010
Division• A derived operator useful for queries like:
Find students who have enrolled in all systems courses.
• Let R have 2 fields, x and y; S have only field y:• R/S = • i.e., R/S contains all x tuples (students) such that for
every y tuple (course) in S, there is an xy tuple in R.• Or: If the set of y values (courses) associated with an x
value (student) in R contains all y values in S, the x value is in R/S.
• In general: attributes of S must be subset of attributes of R: • R(A1 ... An, B1, ... Bm) and S(B1 ... Bm)
{ (x) | ∀ (y) ∈ S, ∃ (x,y) ∈ R }
Thursday, January 21, 2010
Division examplessno pnos1 p1s1 p2s1 p3s1 p4s2 p1s2 p2s3 p2s4 p2s4 p4
pnop2
pnop2p4
pnop1p2p4
snos1s2s3s4
snos1s4
snos1
A
B1B2
B3
A/B1 A/B2 A/B3
Thursday, January 21, 2010
Expressing division using basic operators
Idea: For R/S, compute all x values that are not `disqualified’ by some y value in S. an x value is disqualified if, by attaching y value from S,
we obtain an xy tuple that is not in R.
Disqualified x values: Πx((Πx(R) × S)-R)
R / S: Πx(R) - all disqualified tuples
Thursday, January 21, 2010
Combining operators: complex expressions
Πname,sid (σname=”DB” (Students (Takes Course)))
Students CourseTakes
σname=”DB”
Πname,sid
Thursday, January 21, 2010
Algebraic Equivalences
• Relational algebra has laws of commutativity, associativity, etc. that imply certain expressions are equivalent.
Definition: Query Equivalence
Two queries Q and Q’ are equivalent if:
for all instances D, Q(D) = Q’(D)
Thursday, January 21, 2010
Query OptimizationIs Based on Algebraic Equivalences
• Equivalent expressions may be different in cost of evaluation!
σc ∧ d(R) ≡ σc( σd(R) )
σc (R ⋈ S) ≡ σc(R) ⋈ S
• Query optimization finds the most efficient representation to evaluate (or one that’s not bad)
R ⋈ (S ⋈ T) ≡ (R ⋈ S) ⋈ T)
cascading selection
join associativity
pushing selections
Thursday, January 21, 2010
Operations on BagsA bag = a set with repeated elementsRelational Engines work on bags, not sets !All operations need to be defined carefully on bags• {a,b,b,c}∪{a,b,b,b,e,f,f}={a,a,b,b,b,b,b,c,e,f,f}• {a,b,b,b,c,c} – {b,c,c,c,d} = {a,b,b}• σC(R): preserves the number of occurrences
• ΠA(R): no duplicate elimination
• Cartesian product, join: no duplicate elimination
Thursday, January 21, 2010
Beware: Bag Laws != Set Laws
• Some, but not all algebraic laws that hold for sets also hold for bags.
• Example: the commutative law for union (R ∪ S = S ∪ R ) does hold for bags.– Since addition is commutative, adding the
number of times x appears in R and S doesn’t depend on the order of R and S.
Thursday, January 21, 2010
Example of the Difference
• Set union is idempotent, meaning that S ∪ S = S.
• However, for bags, if x appears n times in S, then it appears 2n times in S ∪ S.
• Thus S ∪ S != S in general.
Thursday, January 21, 2010
Relational calculus
•What is a “calculus”?– The term "calculus" means a system of
computation – The relational calculus is a system of
computing with relations
34
Thursday, January 21, 2010
Relational calculus (in 1 slide)
We will study another logic-based formalism for queries called Datalog later.
Name and sid of students who are taking the course “DB”English:
{xname, xsid | ∃xcid∃xterm Students(xsid,xname) ∧ Takes(xsid,xcid) ∧ Course(xcid,”DB”, xterm) }RC:
RA: Πname,sid (Students Takes σname=”DB” (Course)
Where are the joins?
Thursday, January 21, 2010
Algebra v. Calculus
• Relational Algebra: More operational; very useful for representing execution plans.
• Relational Calculus: More declarative, basis of SQL
• The calculus and algebra have equivalent expressive power (Codd)
A language that can express this core class of queries is called Relationally Complete
Thursday, January 21, 2010
What can’t you express in RA,RC?
• Can I get from Oakland to Boston in 2 flights?
• Can I get from Oakland to Reno?
37
depart arrive
NYC Reno
NYC Oakland
Boston Tampa
Oakland Boston
Tampa NYC
Thursday, January 21, 2010