CSE544Introduction
Monday, March 27, 2006
Staff• Instructor: Dan Suciu
– CSE 662, [email protected]
– Office hours: Wednesdays, 12pm-1pm
• TA: Bhushan Mandhani– Office hours: TBA
• Mailing list: [email protected]– http://mailman.cs.washington.edu/mailman/private/cse544
• Web page: (a lot of stuff already there) http://www.cs.washington.edu/544
Course Times
• Mon, Wed, 10:30-12
• Final: – 8:30-10:20 a.m. Monday, Jun. 5, 2006– In this room
Goals of the Course
• Using database systems
• Foundations of data management.
• Issues in building database systems.
• Current research topics in databases.
Format
Basic structure:
• Lectures on Wednesdays
• Paper discussions on Mondays (reviews !)
Content
• Data modeling basics (3weeks):– SQL/XQuery with homeworks on Postgres/Galax– Logical foundations of databases
• Transactions (2weeks):– concurrency control (locks, timestamps)– recovery (undo, redo, undo/redo)
• Topics in query execution/optimization (3weeks)• Database security (1 week)• Databases + IR, Probabilistic databases (1 week)
Textbooks
Won’t follow any book, but you may want to consult them if you need more details to understand a topic
• Database Management Systems, Ramakrishnan• The Complete Book, GarciaMolina, Ullman, Widom• Xquery from the Experts, Howard Katz, Ed.• Data on the Web, Abiteboul, Buneman, Suciu• Theory of database systems, Abiteboul, Hull, Vianu
Grading
• Homework: 20%
• Paper reviews: 20%
• Participation in the discussions: 10%
• Project: 30%
• Final: 20%
Homework: 20%
HW1:
minor programming in SQL and Xquery
HW2:
problem sets, no programming
theory, optimizations, query execution, transactions
Project: 30%
• Choose from a list of mini-research topics, or come up with your own
• Open ended
• Write short research paper (2-3 pages)
• Conference-style presentation
Project: 30%
• Goals: apply database principles to a new problem– Understand and model the problem
– Research and understand related work (1-2 papers)
– Propose some new approach (creativity will be evaluated)
– Implement some part
• NOT intended to be a major software development
• Amount of work may vary widely between groups
Project: 30%
Milestones:• Groups of 1-3 assembled by 4/5
• Proposals due by 4/10
• Short research papers (2-3pages) due by 5/30
• Presentations on 5/31 in class (MAY TAKE LONGER THAN 12pm)
Paper Reviews: 20%
• There will be reading assignments
• Papers are discussed Mondays
• You have to write the reviews by Sunday night
Final: 20%
• June 5, 8:30-10:30, same room
• Challenging and fun
Database
What is a database ?
Give examples of databases
Database
What is a database ?
• A collection of files storing related data
Give examples of databases
• Accounts database; payroll database; UW’s students database; Amazon’s products database; airline reservation database
Database Management System
What is a DBMS ?
Give examples of DBMS
Database Management System
What is a DBMS ?
• A big C program written by someone else that allows us to manage efficiently a large database and allows it to persist over long periods of time
Give examples of DBMS
• DB2 (IBM), SQL Server (MS), Oracle, Sybase
• MySQL, Postgres, …
Market Shares
From 2004 www.computerworld.com
• IMB: 35% market with $2.5BN in sales
• Oracle: 33% market with $2.3BN in sales
• Microsoft: 19% market with $1.3BN in sales
An Example
The Internet Movie Databasehttp://www.imdb.com
• Entities: Actors (800k), Movies (400k), Directors, …
• Relationships:who played where, who directed what, …
Want to store and process locally; what functions do we need ?
Functionality
1. Create/store large datasets
2. Search/query/update
3. Change the structure
4. Concurrent access to many user
5. Recover from crashes
6. Security (not here, but in other apps)
Possible Organizations
• Files
• Spreadsheets
• DBMS
1. Create/store Large Datasets
• Files
• Spreadsheets
• DBMS
Yes, but…
Not really…
Yes
2. Search/Query/Update
• Simple query:– In what year was ‘Rain man’ produced ?
• Multi-table query:– Find all movies by ‘Coppola’
• Complex query:– For each actor, count her/his movies
• Updating– Insert a new movie; add an actor to a movie; etc
2. Search/Query/Update
• Files
• Spreadsheets
• DBMS
Simple queries
Multi-table queries(maybe)
All
Updates: generally OK
3. Change the Structure
Add Address to each Actor
• Files
• Spreadsheets
• DBMS
Very hard
Yes
Yes
4. Concurrent Access
Multiple users access/update the data concurrently
• What can go wrong ?
• How do we protect against that in OS ?
• This is insufficient in databases; why ?
4. Concurrent Access
Multiple users access/update the data concurrently
• What can go wrong ?– Lost update; resulting in inconsistent data
• How do we protect against that in OS ?– Locks
4. Concurrent Access
X = Read(Accounts, A);X.amount = X.amount - 100;Write(Accounts, A, X);
Y = Read(Accounts, B);Y.amount = Y.amount + 100;Write(Accounts, B, Y);
X = Read(Accounts, A);X.amount = X.amount - 100;Write(Accounts, A, X);
Y = Read(Accounts, B);Y.amount = Y.amount + 100;Write(Accounts, B, Y);
Transfer $100 fromaccount A to B:
Find total amountin A and B:
X = Read(Accounts, A);Y = Read(Accounts, B);S = X.amount + Y.amountreturn S
X = Read(Accounts, A);Y = Read(Accounts, B);S = X.amount + Y.amountreturn S
What can go wrong ? Do locks help ?
5. Recover from crashes
X = Read(Accounts, A);X.amount = X.amount - 100;Write(Accounts, A, X);
Y = Read(Accounts, B);Y.amount = Y.amount + 100;Write(Accounts, B, Y);
X = Read(Accounts, A);X.amount = X.amount - 100;Write(Accounts, A, X);
Y = Read(Accounts, B);Y.amount = Y.amount + 100;Write(Accounts, B, Y);
CRASH !
What is the problem ?
Enters a DMBS
Data files
Database server(someone else’s
C program) Applications
connection
(ODBC, JDBC)
“Two tier system” or “client-server”
DBMS = Collection of Tables
Still implemented as files,but behind the scenes can be quite complex
Directors: Movie_Directors:
Movies:
“data independence”
id fName lName
15901 Francis Ford Coppola
. . .
mid Title Year
130128 The Godfather 1972
. . .
id mid
15901 130128
. . .
1. Create/store Large Datasets
Use SQL to create and populate tables:
CREATE TABLE Actors ( Name CHAR(30) DateOfBirth CHAR(20)) . . .
CREATE TABLE Actors ( Name CHAR(30) DateOfBirth CHAR(20)) . . .
INSERT INTO ActorsVALUES(‘Tom Hanks’, . . .)INSERT INTO ActorsVALUES(‘Tom Hanks’, . . .)
Size and physical organization is handled by DBMS
We focus on modeling the database
Will study data modeling in this course
2. Searching/Querying/Updating
• Find all movies by ‘Coppola’
• What happens behind the scene ?
SELECT titleFROM Movies, Directors, Movie_DirectorsWHERE Directors.lname = ‘Coppola’ and Movies.mid = Movie_Directors.mid and Movie_Directors.id = Directors.id
SELECT titleFROM Movies, Directors, Movie_DirectorsWHERE Directors.lname = ‘Coppola’ and Movies.mid = Movie_Directors.mid and Movie_Directors.id = Directors.id
We will study SQL in gory details in this course
We will discuss the query optimizer in class.
3. Changing the Structure
Add Address to each Actor
ALTER TABLE Actor ADD address CHAR(50) DEFAULT ‘unknown’
ALTER TABLE Actor ADD address CHAR(50) DEFAULT ‘unknown’
Lots of cleverness goes on behind the scenes
3&4 Concurrency&Recovery:Transactions
• A transaction = sequence of statements that either all succeed, or all fail
• E.g. Transfer $100 BEGIN TRANSACTION;
UPDATE AccountsSET amount = amount - 100WHERE number = 4662
UPDATE AccountsSET amount = amount + 100WHERE number = 7199
COMMIT
BEGIN TRANSACTION;
UPDATE AccountsSET amount = amount - 100WHERE number = 4662
UPDATE AccountsSET amount = amount + 100WHERE number = 7199
COMMIT
Transactions
• Transactions have the ACID properties:A = atomicity
C = consistency
I = isolation
D = durability
4. Concurrent Access
• Serializable execution of transactions– The I (=isolation) in ACID
We study three techniques in this course
Locks
Timestamps
Validation
5. Recovery from crashes
• Every transaction either executes completely, or doesn’t execute at all– The A (=atomicity) in ACID
We study three types of log files in this course
Undo log file
Redo log file
Undo/Redo log file