Top Banner
Introduction Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems September 9, 2004 Some slide content courtesy of Susan Davidson & Raghu Ramakrishnan
34

Introduction Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems September 9, 2004 Some slide content courtesy of Susan.

Dec 19, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Introduction Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems September 9, 2004 Some slide content courtesy of Susan.

Introduction

Zachary G. IvesUniversity of Pennsylvania

CIS 550 – Database & Information Systems

September 9, 2004

Some slide content courtesy of Susan Davidson & Raghu Ramakrishnan

Page 2: Introduction Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems September 9, 2004 Some slide content courtesy of Susan.

2

Welcome to CIS 550, Database and Information Systems!

Instructor: Zachary Ives, zives@cis 576 Levine Hall North Office hours: Tuesday, 3:00-4:00PM (after class)

TA: T.J. Green, tjgreen@cis Office hours: Thursday, 3:00-4:00PM

Newsgroup: upenn.cis.cis550Home page: www.seas.upenn.edu/~zives/cis550/

Texts and readings: Ramakrishnan & Gerke, Database Systems, 3rd ed. Supplementary papers (to be handed out in class) Other books may be useful, esp. Brundage’s Using

XQuery

Page 3: Introduction Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems September 9, 2004 Some slide content courtesy of Susan.

3

Course Format and Grading

Roughly one major topic area per week to two weeks Readings in the text & research papers Occasionally, summaries/commentary on papers (5%) Homework assignment for each topic area (30%)

One midterm (10%), one final exam (20%) Project (30%) – groups of 3-4:

Build a “GMail”/Hotmail clone on top of a database, or Build a P2P system for synchronizing tables (Or propose your own idea)

General participation, discussion, intangibles (5%)

Page 4: Introduction Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems September 9, 2004 Some slide content courtesy of Susan.

4

Why This Course?

Most CS courses concentrate on code – our interest is managing and representing data

Warning: this course doesn’t focus on teaching SQL or how to be an Oracle DBA (though it will get you started)

… So what in the world are we studying for 14 weeks???

Page 5: Introduction Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems September 9, 2004 Some slide content courtesy of Susan.

5

What Do We Do with Data?

Page 6: Introduction Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems September 9, 2004 Some slide content courtesy of Susan.

6

Some Ways to Represent Information

Page 7: Introduction Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems September 9, 2004 Some slide content courtesy of Susan.

7

Example: An Encyclopedia Entry(www.wikipedia.com)

A database is an information set with a regular structure. Its front-end allows data access, searching and sorting routines. Its back-end affords data inputting and updating. A database is usually but not necessarily stored in some machine-readable format accessed by a computer. There are a wide variety of databases, from simple tables stored in a single file to very large databases with many millions of records, stored in rooms full of disk drives or other peripheral electronic storage devices.

Databases resembling modern versions were first developed in the 1960s. A pioneer in the field was Charles Bachman.

The most useful way of classifying databases is by the programming model associated with the database. Several models have been in wide use for some time. Historically, the hierarchical model was implemented first, then the network model, then the relational model overcame with the so-called flat model accompanying it for low-end usage…

Page 8: Introduction Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems September 9, 2004 Some slide content courtesy of Susan.

8

Example: To-Do List

Buy school supplies due 9/7 Go to orientation on 9/7 Exercise every M/W/F Buy Philly postcards

How does this differ from the plain text model? What might you do with it that you couldn’t?

Page 9: Introduction Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems September 9, 2004 Some slide content courtesy of Susan.

9

Example: Your PDA/Cell Phone

Event Day When Who WhereLunch 10/24 1pm Zack Cavanaugh’sAdvice10/25 9am Dr. Smith 599 LevineBiking 10/26 9am Jane PottruckDinner 10/26 6PM Jane Food Court

Calendar

Who Phone Email OfficeZack 6-2789 zives 576 Levine NDr. Smith 6-1234 drsmith 599 LevineJane 543-2198 jane 2220 Walnut St.

Contacts

Page 10: Introduction Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems September 9, 2004 Some slide content courtesy of Susan.

10

What If We Want to Include Contact Info on Our Calendar?

Do we also want to keep e-mail addresses, telephone numbers etc.?

Should we expand the number of columns in our table:

Event When Who-name Who-email Who-tel …. Where

Lunch 1pm Zack zives 6-2789…. Cav……

What is the trade-off in terms of entering data?

Page 11: Introduction Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems September 9, 2004 Some slide content courtesy of Susan.

11

“Link” Calendar with Contacts?

Why can’t we “link” calendar entries with contact info, and show the results of the two? The link could be based on something as simple as

the person's name (What’s the danger here? What else might work

better?)

This brings up an issue – how to “follow links” If we were to do this in Java, how might it be done?

Page 12: Introduction Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems September 9, 2004 Some slide content courtesy of Susan.

12

Another Kind of Link: Classes and Subclasses

Person has attributes: ssn PennID set of user IDs given name family name …

Student IS A person who: takes courses is given grades is taught listens to lectures in class, OR over the Web, OR on videotape

This is yet another kind of information How have you previously seen such relationships encoded?

Page 13: Introduction Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems September 9, 2004 Some slide content courtesy of Susan.

13

Data Representation and Modeling

All of the data we’ve seen have an implicit data model

The data model includes some basic assumptions about what’s an “item” of data, how to interpret it, and so on

The relational data model was the first model for data that is independent of its data structures and implementation A theory of normalization guides you in designing

relations Concepts from the relational data model have been

adapted to form object-oriented data models (with classes and subclasses), XML models, etc.

There are “sibling” fields to databases that consider: natural language models (how to understand words) document models (how to match words and documents) ontologies (how to define relationships between classes)

Page 14: Introduction Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems September 9, 2004 Some slide content courtesy of Susan.

14

The DBMS Provides an Interfaceover the Database

A database (DB) is a large, integrated collection of data Generally is cohesive in “some” way

A DB models a real-world organization or unit A database management system (DBMS) is a

software package designed to store and manage databases Reliable storage & recovery of 100s of GB Querying/updating interface and API (for applications and

Web pages) Support for many concurrent users

Why do we need a DBMS, instead of coding in Java?

Page 15: Introduction Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems September 9, 2004 Some slide content courtesy of Susan.

15

DBMS Benefit #1: Generality and Declarativity

Don’t require the programmer or user to know details like indices, sort orders, machine speeds, disk speeds, concurrent users, etc.

Instead, the programmer/user programs with a logical model in mind

The DBMS “makes it happen” based on an understanding of relative costs of different methods

Page 16: Introduction Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems September 9, 2004 Some slide content courtesy of Susan.

16

Benefit #2: Efficiency and Scale

Size of personal address book is probably less than 100 entries, but there are things we'd like to do quickly and efficiently: “Give me all appointments on 10/28” “When am I next meeting Jim?”

“Program” these as quickly as possible (and make them resilient to data format changes)

Scale to a corporate calendar with hundreds of thousands of entries

Page 17: Introduction Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems September 9, 2004 Some slide content courtesy of Susan.

17

Benefit #3: Management of Concurrency and Reliability

Suppose other people are allowed access to your calendar and are allowed to modify it? How do we stop two people changing the file at the same time and leaving it in a physical (or logical) mess?

Suppose the system crashes while we are changing the calendar. How do we recover our work?

This requires a basic concept…

Page 18: Introduction Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems September 9, 2004 Some slide content courtesy of Susan.

18

Transactions

Key concept for concurrency is that of a transaction : an atomic sequence of database actions (read/write) on data items (e.g. calendar entry).

Key concept for recoverability is that of a log: keeping track of all actions carried out by the db.

Page 19: Introduction Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems September 9, 2004 Some slide content courtesy of Susan.

19

The Layers of the DBMS

API/GUI

Optimizer

Storage Mgr

Exec. Engine

Storage

Catalog

Query

Physical plan

Pages

RequestsData

Pages

Stats

Schemas

(Simplification!)

Buffer Mgr

Index/file/rec MgrData/etc Requests

RequestsData/etc

Logging, recovery

Red = logicalBlue = physical

Page 20: Introduction Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems September 9, 2004 Some slide content courtesy of Susan.

20

The Database Abstraction Provided by the DBMS

We think of databases at two levels: Logical structure:

What users/programmers see – program or query interface

Physical structure: Organization on disk, indices, etc.

The logical level is further split into: Overall database design (conceptual; seen by

the DB designer) Views that various users get to see

Page 21: Introduction Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems September 9, 2004 Some slide content courtesy of Susan.

21

The Three-level Architecture forDatabases

View 1 View 2 … View N

Physical Level(file organization, indexing)

Schema Logical,Conceptual Level

Page 22: Introduction Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems September 9, 2004 Some slide content courtesy of Susan.

22

Data Independence

A user of a relational database system should be able to use the database without knowing about how the precisely how data is stored, e.g.

After all, you don't worry IEEE floating-point when you do division in a Java program or with a calculator

SELECT When, WhereFROM CalendarWHERE Who = “Jane"

Page 23: Introduction Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems September 9, 2004 Some slide content courtesy of Susan.

23

More on Data Independence

Logical data independence Protects the user from changes in the logical structure of the data:

could reorganize the calendar “schema” without changing how we query it

Physical data independenceProtects the user from changes in the physical structure of data:

could add an index on who (or sort by when) without changing how the user would write the query, but the query would execute faster (query optimization)

Page 24: Introduction Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems September 9, 2004 Some slide content courtesy of Susan.

24

Presentation Layer (4th Tier): Data-Driven Web Sites

“Data driven web sites” also add an HTML “presentation” layer on top of what we’ve seen

Or they use XML plus “style sheets” to get the same effect

view

HTML

Processing

Page 25: Introduction Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems September 9, 2004 Some slide content courtesy of Susan.

25

An Issue: 80% of the World’s Data is Not in a DB!

Examples: scientific data

(large images, complex programs that analyze the data) personal data WWW and email

(some of it is stored in something resembling a DBMS)

Data management is expanding to tackle these problems

Flexibility – data management imposes many constraints to make problems solvable

Must deal with entities outside our control

In this course, we’ll start by focusing on databases, but eventually look “outside the box” at the Web and at gluing together data from many places

Page 26: Introduction Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems September 9, 2004 Some slide content courtesy of Susan.

26

Combining Databases with Mediators(a kind of middleware)

A layer above the three-tiered architecture, to combine multiple databases/sources on the Web Some of these are databases over which we have no control Some must be accessed in special ways We generally need to think about how to translate between

different database formats

XML

“Mediated Schema”

Page 27: Introduction Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems September 9, 2004 Some slide content courtesy of Susan.

27

How Does One Build a Database?

Start with a conceptual model Design & implement schema Write applications using DBMS and other tools

Many ways of doing this where the hard problems are taken care of by other people (DBMS, API writers, library authors, web server, etc.)

Common applications include PHP/JSP/servlet-driven web sites

The DBMS takes care of query optimization and execution

Page 28: Introduction Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems September 9, 2004 Some slide content courtesy of Susan.

28

Conceptual Design

STUDENT COURSETakes

namesid cid name

PROFESSOR

Teaches

semester

fid name

Page 29: Introduction Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems September 9, 2004 Some slide content courtesy of Susan.

29

Designing a Schema (Set of Relations)

Convert to tables +constraints

Then need to do “physical” design: the layout on disk, indices, etc.

sid name

1 Jill

2 Bo

3 Maya

fid name

1 Ives

2 Saul

8 Roth

sid cid

1 550-0103

1 700-1003

3 500-0103

cid name sem

550-0103 DB F03

700-1003 AI S03

501-0103 Arch F03

fid cid

1 550-0103

2 700-1003

8 501-0103

STUDENT Takes COURSE

PROFESSOR Teaches

Page 30: Introduction Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems September 9, 2004 Some slide content courtesy of Susan.

30

Applications Use Queries in SQL

<html><body> <!-- hypotheticalEmbeddedSQL: SELECT *

FROM STUDENT, Takes, COURSE WHERE STUDENT.sid = Takes.sID AND Takes.cID = cid --></body></html>

Structured Query Language Based on restricted first-order logic expressions over relations Not procedural – defines constraints on the output

Converted into a query plan that exploits properties; run over the data by the query optimizer and query execution engine

Page 31: Introduction Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems September 9, 2004 Some slide content courtesy of Susan.

31

Processing the Query

SELECT * FROM STUDENT, Takes, COURSE

WHERE STUDENT.sid = Takes.sID AND Takes.cID = cid

STUDENT

Takes COURSE

Merge

Hash

by cid by cidOptimizer

ExecutionEngine

StorageSubsystem

Web Server / UI / etc

Page 32: Introduction Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems September 9, 2004 Some slide content courtesy of Susan.

32

DBMSs in the Real World

A huge industry for 20% of the world’s data! Big, mature relational databases

IBM, Oracle, Microsoft “Middleware” above these

SAP, PeopleSoft, dozens of special-purpose apps “Application servers” Integration and warehousing systems Current trends:

Web services; XML everywhere Smarter, self-tuning systems

Page 33: Introduction Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems September 9, 2004 Some slide content courtesy of Susan.

33

So What about Database Research?

Not focusing on the problems of Oracle… Understanding what’s possible to do with XML

Better query processing Better languages for meta-info (e.g., constraints)

Data streams Peer-to-peer architectures Integrating data from different formats Lots of theory and systems-building

You’ll see familiar concepts in this course from operating systems and from complexity theory/logic

… And from programming languages, AI planning, …

Page 34: Introduction Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems September 9, 2004 Some slide content courtesy of Susan.

34

In this Course...

Study relational databases, their design, how to query, what forms of indices to use.

Beyond relational algebra: a logical model of data (Datalog), recursion

XML and semi-structured data models Understanding DB internals

How DBs are built Performance implications

Integrating and mediating between databases (a huge problem today)