Intro: Databases Database (DB): Collection of related datadjmoon/db/db-notes/db-intro.pdf · 2020-01-20 · representation (physical level), and the results converted into a user-oriented

Intro: Databases

• Database (DB):

– Collection of related data

• Has following characteristics:

– Logically coherent collection of data with inherent meaning

– Designed for specific purpose

– Represents some aspect of the real world: the miniworld or universe of discourse

• DB is not a random collection of facts

• Concept of DB independent of Database Management System (DBMS)

• Prior to DBMS concept, DBs maintained as flat file (traditional) systems

1

Intro: Flat File Systems

• Flat file system:

– One or more data files accessed via dedicated programs

• Typical large organization (e.g., university) has many departments, each with specificneeds

– Each department has own set of data

– Each department has own set of apps for processing data

– Data stored in one or more data files accessed by app programs which define

∗ Record/field structure (object/attribute structure)

– Apps written in high-level language

– Generally, significant overlap in data stored in various departments

• This approach leads to many problems:

1. Data redundancy

– Data stored multiple places

– Ramifications:

(a) Wasted resources

∗ Wasted disk space

∗ Wasted effort

∗ Both result in wasted money

(b) Inconsistent data (as a result of updates)

∗ Doubtful if a datum that occurs in multiple files will be updated simulta-neously

· Files will be inconsistent for a time interval

∗ Clerical errors may result in permanent inconsistencies

2. Concurrency control

– Have many independent programs with uncontrolled access to same data

2

Intro: Flat File Systems (2)

3. Interdependence of data and programs

– Data structure of DB embedded in app programs

– Based on structures defined using data types inherent in host language

– Ramifications:

(a) Difficulty in modifying DB/apps

∗ If modify record structure or field data type, all apps must be modifiedaccordingly

(b) Difficulty sharing data (among departments)

∗ Most likely incompatible record formats in different locations, so cannotshare files easily

(c) Difficulty extracting new info

∗ If want to extract info not anticipated at design time, either

i. Extract manually using results from existing software

ii. Create new app

– The interdependence tends to promote a proliferation of apps

∗ This tends to result in much ad hoc code

∗ This in turn tends to result in problems related to

(a) Data integrity

· Checks that data meet certain constraints

(b) Security

· Limiting users to authorized data

(c) Concurrency control

· Limiting number of users accessing a given piece of data at any one time

• These problems motivated the development of DBMSs

3

Intro: DBMS Overview

• DBMS:

– General purpose software system that facilitates definition, construction, manip-ulation, and maintenance of a DB

– Definition:

∗ Specifying objects/entities, logical structure of data, data types, constraints,relationships, ...

– Construction:

∗ Creating the DB - installing data

– Manipulation:

∗ Insertion, deletion, retrieval, update of data

– Maintenance:

∗ Monitoring performance, changing storage structures, changing access paths

• DBMS Characteristics

– General purpose

∗ Independent of any world of discourse or app

∗ Can represent any type of data (within restrictions of DBMS)

– Single, central repository for data

∗ Eliminates redundancy problem (and associated problems, e.g.,Cost, consistency)

∗ Data logically related

∗ Note: Some duplication required to create associations among dataReferred to as controlled redundancy

– Data independent of apps

∗ Data accessed by queries that are independent of physical record structure

∗ User deals with conceptual representation of data, not implementational de-tails

∗ Achieved via data abstraction

∗ Implemented in terms of data structure/module called catalog (dictionary/datadictionary/data directory)

4

Intro: DBMS Overview (2)

– Access controlled by a module called the DB Manager

∗ Manager responsible for

· Concurrency control

· Security

· Backup and recovery

· Integrity constraints

∗ Since all data access controlled by manager, problems associated with aboveeliminated

– Multiple interfaces (usually provided)

∗ Each presents data in format specific to type of user

∗ Limits access to what is needed by that user

5

Intro: DBMS Architecture - History

• Conference on Data Systems Languages (CODASYL)

– Purpose was to establish standards for DBs

– 1967 - created Database Task Group (DBTG)

– DBTG charged with generating standards for environment for creation of DBand manipulation of data

– 1969 - DBTG initial report

• Codd

– 1970 - seminal paper proposing relational model

– Proposed 8 services that should be supported by any full DBMS:

1. Data storage, retrieval, and update

∗ Primary purpose of DBMS

∗ Physical details of data storage should be hidden from user

2. User-accessible catalog

∗ Stores metadata - data about data

∗ Stores info pertaining to all aspects of DB design, usage, and maintenance

3. Transaction support

∗ Transaction: an atomic operation on the DB

· Consists of a read and/or write

∗ Transaction executes in its entirety, or not at all

∗ Insures consistency of DB in case of DBMS failure


5. Recovery

∗ Enables DB to be returned to a consistent state in case of failure

6. Restriction of unauthorized access

7. Support for data communication

∗ Provide for remote access of DB

8. Integrity support

∗ Insure data is correct and consistent

6

Intro: DBMS Architecture - History (2)

• DBTG

– 1971 - formal proposal

– DBMS should consist of 3 components:

1. Network schema

∗ Describes logical organization of DB

∗ Includes

(a) DB name

(b) Structure of each record type

(c) Data types of each record field

2. Subschema

∗ Describes DB as seen by users

3. Data management language

∗ Used to define structure of data

∗ Used to manipulate data

∗ Proposed 3 sub-languages:

(a) Schema data definition language (DDL)

· Defines schema

· Used by DBA

(b) Subschema DDL

· Defines parts of DB required by apps

(c) Data manipulation language

· For manipulation of data

· Used by anyone querying DB

– Not adopted by ANSI

7

Intro: DBMS Architecture - History (3)

• ANSI Standards Planning and Requirements Committee (SPARC)

– 1975 - proposed 3-level architecture with data dictionary

– Based on IBM (Codd) proposals

– Reflected need for independent layer between implementational and applicationlevels

– Purpose of 3-level architecture:

∗ Users should be able to access same data, but with customized view

· Should be able to change one view without affecting others

∗ Physical data storage should be invisible to user

∗ Changes to physical storage should not affect user view

∗ Changes to physical storage should not affect internal structure of DB

∗ Changes to internal structure should not affect user view

• ANSI-SPARC proposal did not become formal standard

– Is basis for modern DBMS architectures

8

Intro: ANSI-SPARC Architecture

• 3-level ANSI-SPARC Architecture:

1. External level

– Presents subsets (views) of the DB

– Each view customized for a particular user

∗ Limits what is accessible to the user

∗ Display of data may be different for same data in different views

– May containing data not actually stored in DB (derived data)

2. Conceptual level

– ”Community” representation of entire DB

– Data represented logically (structure in terms of components, data types, andsizes), independent of physical storage (not in terms of bytes)

– Types of information represented:

∗ Entities, attributes, relations

∗ Constraints

∗ Semantic info

∗ Security and integrity info

9

Intro: ANSI-SPARC Architecture (2)

3. Internal level

– Physical (implementational) representation of data

∗ Storage allocation

∗ Access mechanisms

∗ Record disk address

∗ Record structure

∗ Data compression and encryption

4. Physical level

– Machine level

10

Intro: Schemas

• Schema:

– Logical structure of data

• Need schema for each level of ANSI-SPARC architecture

– Schema for a level describes data at that level only

– One internal schema that describes

1. Data fields of records (physical)

2. Indices

3. Access methods

4. ...

– One conceptual schema that describes

1. Data fields of records (logical)

2. Relationships among data

3. Constraints

4. ...

– Multiple external schemas (subschemas) that describe

1. Same aspects as subset of conceptual level

• Schema diagram used to represent structure graphically

• Schemas created by DB designers for a particular domain

• Schema is relatively static

• Often referred to as an intension of the DB

• A populated DB is often referred to as a(n)

– Instance

– State

– Extension

– Snapshot

11

Intro: Data Models

• Data model:

– Integrated collection of concepts for describing data, relationships between data,and constraints on data in an organization (Connelly and Begg)

– Schemas represented in terms of a data model

• Data models are used at each level of the DBMS

• Models do not need to be the same across DBMS levels

• Data models are categorized in terms of their degree of abstraction

1. High-level/conceptual/object-based models

– Highest degree of abstraction

– Describe DB in terms of

(a) Entities

∗ Concepts to be represented

(b) Attributes

∗ Characteristics of the entities

(c) Relationships

∗ Associations among entities

– Example paradigms:

(a) Entity-relation (ER) model

(b) Object-oriented

(c) Semantic

(d) Functional

12

Intro: Data Models (2)

2. Representational/Implementational/record-based models

– Describe DB in terms of logical records

– Correspond closely to way data represented physically, while hiding the details

– Applicable to external and conceptual DBMS levels

– Better than OO for representing structure

– Poorer than OO for representing constraints


(a) Relational model

∗ Represent DB as tables

(b) Network (legacy)

∗ Represent DB as collections of records

∗ Represent relations as sets of records

∗ Graph-based representation

(c) Hierarchical (legacy)

∗ Same representation as Network, but record may only have 1 parent

∗ Tree-based representation

– Relational preferred because the other 2 require knowledge of physical repre-sentation

3. Physical models

– Describe DB at physical level


(a) Unifying

(b) Frame memory

13

Intro: Data Definition Languages

• Once models and schemas have been established for a particular domain, the schemasmust be installed in the DBMS

• Schemas defined in terms of data definition languages (DDLs)

• Need a DDL for each model used in the DB

• Potentially, need 3:

1. One for the model at the external level (view DDL)

2. One for the model at the conceptual level (DDL)

3. One for the model at the internal level (storage DDL)

• In practice, usually a single DDL used for all levels

• DDL statements compiled and results stored in catalog

14

Intro: Data Manipulation Languages

• Queries (requests for data) are posed in terms of data manipulation languages(DMLs)

• Used to add, delete, retrieve, and modify data

• 2 general types:

1. Non-procedural

– High-level

– Declarative - specify what to retrieve, not how

– Retrieve sets of records per query

2. Procedural

– Specify what to retrieve and how to retrieve it

– Retrieve single record per query

• Stand-alone DML called query language

• Fourth generation languages

– Higher-level than set-at-a-time languages

– Non-procedural

– Examples:

∗ Form generators

∗ Report generators

∗ Graphics generators

∗ Application generators

15

Intro: Data Abstraction

• One of key concepts underlying ANSI-SPARC 3-level architecture is data abstraction

– The user is insulated from implementational details

• Data abstraction is achieved by having 3 levels, each potentially with its own datamodel

• The key to data abstraction is the catalog

– Catalog stores

1. Schemas for each level and mappings

2. Data names, types, sizes

3. Relation names

4. Integrity constraints

5. Indices

6. Access paths

7. Authorized user names

– In addition to enabling data abstraction, other benefits include

1. Metadata stored centrally; provides control

2. May id who owns data

3. Redundancies/inconsistencies more easily id’d

4. Impact of change determined prior to implementation

5. Security enforced

6. Integrity enforced

• When a query is posed in a DML, it must be converted into the implementationalrepresentation (physical level), and the results converted into a user-oriented repre-sentation (view level)

• In order to do this, the DBMS must support mapping between

1. The schema at the external level and the schema at the conceptual level, and

2. The schema at the conceptual level and the schema at the internal level

16

Intro: Data Abstraction (2)

• These mappings are stored in the catalog

• By storing all schemas and mappings in the catalog, a DBMS can be independent ofany particular domain

• Physical data independence is the ability to change the internal schema without hav-ing to alter higher-level schemas or applications

– Examples of such changes are

1. File reorganization

2. Change of access path

– This does not include changes to the data itself

• Logical data independence is the ability to change the conceptual schema withouthaving to alter the external schema or applications

– Examples of such changes are

1. Adding/deleting record types

2. Extending a record type

3. Changing constraints

– More difficult to achieve than physical data independence

• The above refer to program-data independence

17

Intro: Data Abstraction (3)

• Program-operation independence refers to the ability to change implementation ofdata operations without having to change the interface

– This is primarily related to OO models

18

Intro: People Involved with DBMS

• DBMS staff

1. DataBase Administrator (DBA)

– Person of group of people with overall responsibility for DBMS

– Involved with

(a) Designing DB (schemes, etc.)

(b) Monitoring performance

(c) Modifying DB as needed

(d) Granting privileges

(e) Evaluating and acquiring supplementary software

2. Systems analysts

– Determine requirements of end users

– Develop specs for canned transactions

3. Applications programmers

4. Systems designers and implementers

– Create DBMS itself

5. Tool developers

– Design and implement software packages to facilitate use of DBMS

6. Operators and maintenance personnel

– Sys admin personnel

• Types of end users:

1. Naive/parametric

– Have no knowledge of DBMS details

– Interact via canned transactions

2. Casual users

– Interact via query languages

3. Sophisticated users

– Interact at all levels

19

Intro: Primary DB Modules

1. Stored data manager

• Controls all access to the data

20

Intro: Primary DB Modules (2)

2. Compilers

(a) DDL compiler

• Accessed by DB staff

• Handles DB definition and privileged commands

• Results stored in catalog

(b) Query compiler

• Accessed by casual users

• Handles general DB queries

• Results passed to query optimizer for efficient data access

(c) Precompiler

• Accessed by application programmers

• Converts embedded DB code in app to object code

• Rest of program compiled by host language compiler

• 2 results linked into single program

3. Run time DB processor

• Executes user ”programs”

(a) Privileged commands

(b) Executable query plans

(c) Canned transactions

• Interacts with catalog and data manager


5. Recovery and backup

21

Intro: Secondary DB Software

1. Utilities

• Loaders

– Convert existing data files into format accessible by DBMS

– Conversion tool converts from one DBMS format to another

• Backup

• DB storage reorganization

– Convert from one file organization to another

– Used to improve performance

• Performance monitors

2. Tools and environments

• CASE tools

– Computer Aided Software Engineering

– DB design

• Data dictionary

– Expanded catalog

– Stores additional info:

∗ Usage statistics

∗ Design rationale

∗ Semantics

– Benefits of catalog

∗ Serves as documentation of DB design

∗ Useful for maintenance and performance monitoring

• Application development environments

• Communications software

22

Intro: Client/Server Model

• Early DBMSs centralized

– All processing performed on a central machine

– User access via remote terminals with no processing capability

• Client/server model enabled by intelligent terminals

– Server hosts DBMS software

– Client executes apps locally

– Client accesses server when needs specialized resources

– Connected via network

• Client/server architectures characterized in terms of tiers

1. 2-tier model

– 1 server, multiple clients

– Functionality can be allocated in several ways

(a) Transaction/query server model

∗ Server hosts DB, query, and transaction functionality

∗ Client executes apps that contact server when need to access DB

∗ Connections achieved using standards like ODBC and JDBC

(b) OODBMS approach

∗ Uses a ”more integrated” (i.e., arbitrary) approach

∗ Much functionality migrated to client

· Client may host user interface, data dictionary functions, compilers,optimizers, ...

· Server hosts DB (data storage), concurrency control, recovery

∗ Server often referred to as data server as primary task is repository forDB

2. n-tier model

– Has 1 or more intermediate layers

– Middle tier often called application/web server

– Stores rules, checks client credentials, ...

– Accepts client requests and forwards to server

– Forwards (partially) processed results to client

23

Intro: DBMS Classification

• DBMSs can be classified along a number of dimensions:

1. Data model

– Primary means of identification

– Primary model is relational

– Newer models include object and object-relational

– Legacy models include hierarchical and network

2. Number of users

– Single

– Multi-user

3. Number of sites hosting DBMS

– Centralized

∗ Single site

– Distributed

∗ Software resides on multiple servers connected via network

∗ Variations:

· HomogeneousSame software at all sites

· HeterogeneousMultiple autonomous DBs at several sites

· Federated DBMSLoosely coupled DBMSs with some autonomy

4. Cost

5. Access path

6. General v special purpose

– General not associated with any application

– Special purpose designed for one particular app

24

Intro: Advantages of DBMS Approach

1. Control of data redundancy

2. Data consistency

3. More info from the same data

4. Sharing of data

5. Improved data integrity

• Integrity results from consistency and validity

• Expressed in terms of constraints

• Integrity constraints (referred to as business rules) implemented as rules thatverify data on entry, modification, or deletion

6. Improved security

7. Enforcement of standards

• Design controlled by central authority

8. Economy of scale

• As a result of reduced redundancy

9. Balance among conflicting requirements

• Needs of different users may be at odds

• DBA can make informed decisions based on various needs

• Resulting schemas (should) be those with greatest overall benefit to organization

10. Improved accessibility to data and responsiveness

• As a result of software utilities like report generators, query languages, etc.

11. Increased productivity

• Users are insulated from implementational details

12. Improved maintenance

• Data independence allows changes to be made at one level without needing tomake changes at other levels

25

Intro: Advantages of DBMS Approach (2)

13. Increased concurrency

14. Improved backup and recovery

• Result of transaction and concurrency control, and centralized access of DB

15. Persistent storage of program objects

• Relevant to OO DB

• Objects exist independently of apps

• DBMS provides for storage and conversion between program representation andDBMS format

16. Efficient query processing

• DBMS provides access paths for efficient retrieval of data

• Compilers may optimize queries

17. Multiple interfaces

18. Ability to represent complex relationships among data

19. Allow inference and actions via rules

• Rules may be associated with DB

• Allow inference of new data

• DBMS may also allow procedures stored independently of apps

26

Intro: Disadvantages of DBMS Approach

1. Complexity

• DBMS is a complex piece of software

• To use to advantage, must understand all aspects

• Poor decisions at early stages of DB design can be costly

2. Cost

• DBMS is expensive

• Often requires additional hardware (processing power, disk space, memory, ...)

• Cost of conversion

(a) Converting existing apps

(b) Converting existing data

(c) Training personnel

3. Size (see above)

4. Performance

• Due to domain-independent nature of DBMS software, queries are slower due tomappings between levels

5. Higher impact of failure

• If DBMS crashes, entire organization affected

6. Higher impact of security breach

• While DBMS provides greater security measures, if security is breached the entireorganization is affected

27

Intro: Databases Database (DB): Collection of related datadjmoon/db/db-notes/db-intro.pdf · 2020-01-20 · representation (physical level), and the results converted into a user-oriented

Documents