File and Databases CS208. File Organization The three principal file organizations are Sequential Direct Indexed Sequential.

Post on 21-Dec-2015

227 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

File and Databases

CS208

File Organization

The three principal file organizations are Sequential Direct Indexed Sequential

Sequential File Organization

Records physically stored one after another in order determined by key field

Advantages: Very efficient when many or all records in a

file need to be accessed Cost of tape and drives very low

Disadvantages: Major drawback is very slow access time for a

particular record. Must rewrite all records following record

insertion.

Direct File Organization

Records stored at a specific address, determined by their key field A mathematical technique called hashing converts

the key field value into a corresponding address Advantages:

Record can be quickly accessed by going directly to its address

Disadvantages: Must be done using random access storage

(disk/optical) which have higher cost than sequential

Can only use one key

Indexed SequentialFile Organization

Compromise between direct and sequential methods Records stored sequentially Index created that records the address of each

record Advantages:

Good compromise between previous two methods Can have multiple index tables to use multiple

keys Disadvantages:

Slower than direct-access

What is a “database”?

A “database” is just a collection of related data.

Databases can exist in many forms. Examples: Sheets of paper in folders in a file cabinet A book (think of it as a collection of sentences

and illustrations) Books in a collection (e.g., a library) Sets of 3"x5" cards containing notes Maps and other geographic information systems Blood samples in a medical laboratory

Common Elements Sets of data and information composed

of, and/or represented by: bits alphanumeric symbols lines and shapes in drawings, pictures, and

maps audio and video recordings actual substances

A means by which the sets of data and information are organized in order to facilitate access to individual desired sets

Examples of Access Methods

Phone book - Collection of several independent databases, each consisting of names and corresponding phone numbers: Blue-pages governmental listings:

primary arrangement alphabetical by type of government (city, county, state, federal),

secondary arrangement alphabetical by agency within type of government,

tertiary arrangement alphabetical by office within agency

Examples of Access Methods (continued)

White-pages personal listings: arranged alphabetically by surname within surname by first names

Yellow-pages listings: primary arrangement by type of business, secondary arrangement alphabetically by

company within type of business, plus various special groupings (e.g.,

restaurants by ethnic type)

Flat File Databases (DBs)

Flat-file DBs are like the DBs you can construct in a single spreadsheet page All the information in the DB is in one

file consisting of one array of rows and columns.

For example:SSN Surname First Name(s) Telephone Number123-45-6789 Doe J ohn X. 303-555-1234987-65-4321 Smith Martin 720-111-2222567-89-0123 J ohnson Billy Bob 303-444-5555

Flat File Advantages/Disadvantages

Advantages Simple Good for few records with few fields

Disadvantages Unnecessary duplication of data or data

redundancy Inconsistent, incomplete or inaccurate data,

lacking data integrity Changes in data are difficult to implement Separate and isolated data with limited data

sharing

Data Heirarchcy in Computerized Databases

Character - single letter, number or special character

Field - a set of related characters Record - collection of related fields File - collection of related records Database - collection of related files

Types of Databases

Individual - often on a PC used by one person

Company or shared - usually on a mainframe and managed by a database administrator Example: Common operational

databases contain information about company operations

Distributed databases - have data stored in multiple locations, but the data is accessible through communications networks

Proprietary databases - created by an organization, and stored information is offered to others for a fee Examples include Dialog Information Services

and Dow Jones Interactive Publishing

Types of Databases

Database Terminology

Database Management System (DBMS) –Allows a user to deal with data in logical terms, without having to understand the computer's physical view.

Logical data view – How humans see things

Physical data view – How things are stored in a computer

DBMS – provides storage, retrieval, analysis, sorting, and printing of information in a database.

DataBase Management System (DBMS)

DBMS is:

• A collection of program independent, interrelated data

• A set of programs to access the data

• Information about a particular enterprise

• An environment that is both convenient and efficient to use.

Current DBMS Systems Mainframe database vendors:

Oracle IBM DB2 Microsoft (SQL Server) Sybase Informix

Desktop: MS Access Borland Paradox

Some free database systems (Unix) :

Postgres MySQL, mSQL Predator

DBMS Uses

The processing power of a DBMS allows it to: Sort Match Link Aggregate Skip fields Calculate Arrange

Database Administrator (DBA)

Coordinates all the activities of the database system.

Must have a good understanding of the enterprise’s information resources and needs.

Database Administrator (DBA)

Database administrator's duties include: Database definition Database modification Granting user authority to access the

database for security and privacy Specifying integrity constraints Acting as liaison with users Monitoring DB performance and

responding to changes in requirements Decides strategy for backup and

recovery

Types of DBMS Users End-User:

Non-specialist accessing data via a query language Naïve user accessing data via a special-purpose

interface Performs data retrieval and update (extend/modify)

Applications Programmer: Writes programs that use the DB by embedding

queries to the DB in a HLL Develops interfaces for the naïve user

DBMS Organization

The four principal DBMS organizations are Hierarchical Network Relational Object-Oriented

Hierarchical Databases Viewed as branches of an upside-down tree

Each item is subordinate to its parent item Only one parent per item

Any element (node) in the database is linked only to the elements directly above it and directly below it. If parent node is deleted, all the child nodes

are as well New parent node must be created before

adding a new child node No direct relationships between child nodes

Hierarchical Databases Limited by rigid structure Typically require custom programming Example:

Original computer-based databases were designed for banking.

Hierarchical databases were appropriate for such purposes, e.g.:

individual accounts can be grouped by family or business;

sets of accounts, grouped by branch; accounts in different branches, grouped by

city; accounts in different cities, grouped by state.

Hierarchical Model

- Must always begin at top to search for data- One parent per child, no other relationships allowed

CourseNum Name Section

DeptNum Name

IDNum Name

IDNum Name Course

Department

Courses Students

Professors

Network Databases Permits links among all components

(i.e. elements can be linked to other elements anywhere in the database, not just those directly above and below)

The interconnected design allows for access via multiple pathways

Can be extremely difficult to manage

The World-Wide Web is a very large example of a network database.

Network Model

- Added data paths across the tree (instead of just up and down)- Reduced time required to access data, but increased overhead space requirements

CourseNum Name Section

DeptNum Name

IDNum Name

IDNum Name Course

Department

Courses

Students

Professors

Relational Databases

A relational database is a set of one or more tables that together embody information about a set of related concepts and entities.

The tables are connected (related) via fields within the table that are shared by a pair of tables.

Relational Model

Based on tables of objects (the data), rather than specific paths (ways to access the data)

Department

CoursesStudents

Professors

DeptNum Name

… …… …… …

IDNum Name DeptNum Course

… … … …… … … …… … … …

Num ProfID Section

… … …… … …… … …

IDNum Name Course

… … …… … …… … …

Relational Database Rules

Each row is unique (distinct) Each column name is unique within a

table It is permissible to have the same name

for a column in two different tables in the same database. To distinguish between them we use

a qualified name: Pet.Name vs Family.Name

Relational Databases In a Relational DB, the information

content of a table does not depend on either The order of the rows; or The order of the columns

In other words, the rows and columns of a table can be rearranged at will without affecting the table's information content

File (Table) = RelationRows (Tuples) = RecordsColumns (Attributes) = Fields

Relationships between objects are defined by common attributes

Num Name Credits

CS208 CS Fundamentals 3

MT360 Calculus 4

CS320 C Programming 3

Courses

Relational DB Terminology

Records(Rows, Tuples)

Fields(Columns, Attributes)

Primary Keys

In a Relational DB, each table Must have a primary key (unique identifier) Must have no duplicate rows

A primary key is A data attribute (column), or a combination

of attributes, that uniquely identifies each record in the table.

A simple key consists of a single attribute A composite key consists of two or more

attributes

Primary Keys (continued)

Primary Key Provides unique way to identify each

record Can be obvious from the structure of the

table. If there is no easy natural choice, you can add a column containing a unique identifier.

May consist of the entire record (especially with two-column tables, which occur often in the development of RDBs)

Other Types of Keys

Secondary Key A column that is used to aid in the

retrieval of information from a table. A secondary key is not required to have

unique values in each of its rows,

Foreign Key A column used to retrieve information

from one table (i.e., is a secondary key) that is also the primary key in another table. Foreign keys are a major tool in Relational DBs.

Relational Database Design

Design Goals:

Avoid redundant data

Ensure that relationships among attributes are represented

Facilitate checking updates for violation of database integrity constraints

Creating a Student Database

Student ID First Name Middle Initial Last Name Home Address School Address Street Address City State Zip

Home Phone Work Phone Cell Phone Course ID Course Name Course Instructor

Database NormalizationOptimize the Tables by:

Storing each piece of data once and only once (i.e. Eliminate redundant data).

Ensuring data dependencies make sense (only storing related data within a particular table).

Maintaining data integrity.

First Normal Form

Eliminate repeating groups in individual tables.

Create a separate table for each set of related data.

Identify each set of related data with a primary key.

1st Normal Student DatabaseStudentStudent IDFirst NameMiddle InitialLast Name Address

Student IDAddress TypeStreet AddressCityStateZip

PhoneStudent IDPhone TypePhone Number

StudentCourseStudent IDCourse IDCourse NameCourse Instructor

Second Normal Form

Create separate tables for sets of values that apply to multiple records

i.e. Remove partial data dependencies.

Relate these tables using a foreign key.

2nd Normal Student DatabaseStudentStudent IDFirst NameMiddle InitialLast Name

AddressStudent IDAddress TypeStreet AddressZip

PhoneStudent IDPhone TypePhone Number

StudentCourseStudent IDCourse IDCourse NameCourse Instructor

ZipCodeZipCityState

Third Normal Form

Eliminate any fields that do not depend on the key.

3rd Normal Student DatabaseStudentStudent IDFirst NameMiddle InitialLast Name

AddressStudent IDAddress TypeStreet AddressZip

PhoneStudent IDPhone TypePhone Number

StudentCourseStudent IDCourse ID

ZipCodeZipCityState

CourseCourse IDCourse NameCourse Instructor

Relational Design Summary

Data is stored in records, inside of tables

Primary keys uniquely identify a record

Foreign keys link data in one table to the primary key in another table

Designs should maintain data integrity

Normalization concepts should be used

Dis/Advantages of Relational Datebases

Disadvantages Require more overhead

Advantages Cut down on needless repetition of

information Ensure more accuracy Facilitate updating and deletion of

information. Design avoids errors that occur when

adding/deleting information from flat files

Query Languages

Query languages allow non-programmers to question the DBMS

Structured Query Language (SQL) –the only standard structured query language

Structured Query Language (SQL)

Pronounced either "S, Q, L" or "sequel"

Widely used standard set of commands and syntax for doing things with Relational DBMSs

Used especially for query and retrieval

Includes commands for defining Relational DBs, conducting transactions, storing data, etc.

Each Relational DBMS also has additional features unique to it, because SQL does not handle all the practical details involved in using a Relational DB

Select

The select clause directs the DBMS to choose a subset of fields from one or more tables.

For example,

SELECT last_name, first_name

chooses all last names and first names and will place them in this order in the result set.

From clause The from clause directs the query toward

one or more tables. from Student (directs the query to use

the student table) Used with the SELECT clause:

SELECT last_name, first_name FROM Student

chooses the last and first name of all rows in Student table.

Note: As long as the column names within the tables used in the from clause are unique, qualified names are not required.

Where clause

A where clause specifies a selection criterion that we will use to limit our choice of records.

A where clause follows the name of the table.

Format:

SELECT field1,…, fieldN

FROM tablename

WHERE (boolean expression)

Boolean Expressions

Boolean expressions used in where clauses typically involve a field compared to another field or a field compared to a value.

For example:

WHERE GPA > 3.50

WHERE last_name > “M”

WHERE amount_owed >= amount_paid

Comparison Operators

= is equal to

> is greater than

< is less than

>= is greater than or equal to

<= is less than or equal to

not is not equal to

Simple SQL QueryTitle Price Category Publisher

Java Intro 109.99 ComputersPrentice

Hall

Calculus 129.99 Math Kaufman

Advanced C

115.99 Computers Tech Inc

Philosophy 83.99 Lib ArtsPrentice

HallSELECT *FROM TextbooksWHERE category=‘Computers’

SELECT *FROM TextbooksWHERE category=‘Computers’

Textbooks

Title Price Category Publisher

Java Intro 109.99 ComputersPrentice

Hall

Advanced C

115.99 Computers Tech Inc

“selection”

Simple SQL QueryTitle Price Category Publisher

Java Intro 109.99 ComputersPrentice

Hall

Calculus 129.99 Math Kaufman

Advanced C

115.99 Computers Tech Inc

Philosophy 83.99 Lib ArtsPrentice

HallSELECT Title, Price, PublisherFROM TextbooksWHERE Price < 110

SELECT Title, Price, PublisherFROM TextbooksWHERE Price < 110

Textbooks

“selection” and“projection”

Title Price Publisher

Java Intro 109.99Prentice

Hall

Philosophy 83.99Prentice

Hall

Joins

You can use more than one table in a query.

The join clause tells the DBMS that you are using two tables.

You must also specify the fields on which you are linking the files, using the ON clause.

Example

SELECT Textbooks.Title, Price, RequiredFROM Textbooks, CourseTextsON Textbooks.Title = CourseTexts.TitleWHERE Category = ‘Computers’

The dot (.) notation specifies the table then the field within the table to reconcile the ambiguity with the two names.

Join SQL Query

Title Price Category Publisher

Java Intro 109.99 ComputersPrentice

Hall

Calculus 129.99 Math Kaufman

Adv C 115.99 Computers Tech Inc

Philosophy 83.99 Lib ArtsPrentice

Hall

Textbooks

Title Course Required

Java Intro CS434 Yes

Adv C CS422 No

CourseTexts

Title Price Required

Java Intro 109.99 Yes

Adv C 115.99 No

SELECT Textbooks.Title, Price, Required

FROM Textbooks, CourseTexts

ON Textbooks.Title = CourseTexts.Title

WHERE Category = ‘Computers’

SELECT Textbooks.Title, Price, Required

FROM Textbooks, CourseTexts

ON Textbooks.Title = CourseTexts.Title

WHERE Category = ‘Computers’

Database Concerns

Privacy/Unauthorized Access to Data

Data is easier to gather and exploit using a computer, so precautions must be taken to guard the data.

Accuracy/Completeness

Owner of database must ensure accuracy of the data.

Users must take data with grain of salt -

must verify

top related