Chap3 Files DBMS Data Warehouse

8/3/2019 Chap3 Files DBMS Data Warehouse

http://slidepdf.com/reader/full/chap3-files-dbms-data-warehouse 1/29

File Management

Bit : Smallest unit of data a computer can process (0 or 1)

Byte: 8 bits (one character: e.g. a letter, digit, symbol)

UNICODE: UTF-8, UTF-16, UTF-32

Field: A logical grouping of characters e.g. student¶s name appears inname field

Record: A logical group of related fields (student name, course, marks,

rank)

File: A logical group of related records

Database: A logical group of related files.

Primary Key: At least one field in a record that uniquely identifies that

record.

Secondary key: have some identifying information but need not be unique.Eg. Student surname

Foreign Key: it provides relationship between 2 tables.A foreign key is afield in a relational table that matches the primary key of another table.



File Management

(Accessing Records from Computer Files)

Sequential File Organization : Data records are retrieved in the samephysical sequence in which they are stored (magnetic tapes e.g. in taperecorders)

Direct File Access Method: Records can be retrieved in any sequence,without regard to actual physical order on storage medium (eg CD).

File Environment: Major Problems

i) Data Redundancy: Duplicate data in several files linked with differentapplications developed over time (waste of data entry efforts, physicalstorage)

ii) Data Inconsistency: Caused by data redundancy ± data values acrossfiles not synchronized (change in customer address should get changed inall associated files)



iii) Data Isolation: With applications uniquely designed and implemented, datafiles are likely to be organized differently, stored in different formats (sales in units of

thousands, lakhs)

Manual comparison of printed outputs from different applications (product-wise

daily sales, product-wise inventory) ± defeats the purpose of introducing IT.

iv) Data Integrity: Difficulty to enforce data integrity constraints uniformly acrossmultiple data files.(eg. Student roll number should not contain an alphabet)

v) Concurrency Problems: While one application may be up-dating a record,another may access that record. Thus, second application may not get desired

result.

vi) security: new files added on an ad-hoc basis; more applications, more peopleaccessing data.

vii) Data dependency: in file environment, applications & their associatedapplications are dependent.

viii) No central listing of files: hundreds of application and data files, no oneknows what the application did or what data they required.



Database

An organized logical grouping of related files.

Data are integrated and related so that one set of software programsprovides access to all the data, alleviating many problems of fileenvironment.



Centralized Database:

Advantages

All files at one location; Database Administration is easy; Maintainingconsistency, protecting from unauthorized access, implementing disaster recovery is easy,

Di sadvantages

Vulnerable to single point of failure ± Access to all the users is affected

Transmission costs and delays in access when users are at far-off distances

Distributed Databases:

Replicated (Complete copies of entire database at many locations toalleviate single-point-of-failure problem in centralized database and toincrease user responsiveness)

However, maintenance overheads are high (Maintaining consistencyamong replicated databases)



DBMS

Partitioned database:

( Database is sub-divided so that each location has a portion of the entiredatabase to address local needs of users)

Advantage: Responsibilities of database maintenance gets distributed

Disadvantage: Telecommunication costs, difficulties in enforcing securitypolicy.



DBMS

A set of software programs provide access to database

Permits an organization to centralize data, manage them efficiently,

provide access to stored data by application programs Acts as interface between application programs and physical data

files

Provide users with tools to add, delete, maintain, display, print,search, select, sort and update data

Ability to share data and process resources



DBMS

Issue: Users may belong to HR, Sales, Inventory, Service Divisions, etc.There needs could be different.

How can a single, unified Database meets differing user requirements

Solution:

DBMS minimizes such problems by providing two different views of thedatabase:

Physical: actual physical arrangement and location of data in direct

access storage devices. Used to make efficient use of storage

devices.

Logical: user view. What is in the database.

There is only one physical view but there could be many logical views-each specifically tailored for a particular individual.



Capabilities ± Advantages of

DBMS Persistence: attributes permanently stored on a hard drive until removed.

Query Ability: requesting attribute information. Ex, how many 2 door cars intexas are green?

Concurrency: change/read data simultaneously

Backup: replication done 2 fight failure

Rule enforcement: rules defined for filling a particular attribute field. Theycan easily b added and removed as required.

Security: limit on who can access or change data.

Computation: calculations like summing, average, sort etc done easily.

Change and access logging: often required to know who accessed whichattributes, what and when changed. This can b done by logging services datkeep a record of access occurrences and changes.



DBMS Concepts

DBMS contains 4 major components:

1. Data model: defines the way data are conceptually structured(network,

hierarchical, relational, multidimensional etc)

2. Data definition language: language used by programmers to specify the type

and structure of the database (link b/w logical and physical views)

3. Data manipulation language: help extract data from database as per

information requests. Help to sort, display and delete data. Eg.

SQL(structured query language)

4. Data Dictionary: stores definitions of data elements(field), data characteristics suchas ownership, authorization, security.

Reduces data inconsistencies (as data elements r defined), makes programdevelopment faster(as data details r predefined), easier to modify(presence of elements

in database is mentioned)

Metadata: data dictionaries r a form of metadata. Metadata is Information aboutinformation. Helps companies 2 track d transaction n analyze its success.



Normalization:

Normalization is a set of rules for transforming data to d one with moredesirable properties of internal consistency, minimal redundancy andmaximum stability.

Properties of a normalized design

The process systematically removes duplication of data which leads tominimum storage requirements

Since data items are stored in minimum number of places, chances of datainconsistency are minimized

Normalized structures are optimum for updates (insert, update and delete)as such an operation needs access to minimum amount of data (at the costof retrieval !)



DBMS

Functional Dependency

An attribute A is µfully functionally dependent¶ on a set of attributes B, if there does not exist a subset of attributes constituting B on which A is

functionally dependent.

Student_cd Student_nm Address

K105 ABC A-123, XYZ

K106 DEF Z-333,ABC

Attribute student_nm is functionally dependent on the attribute Student_cd,because a value of student_cd determines value of student_nm

Same logic for Address being functionally dependent on student_cd

Address is functionally dependent on (student_cd+student_nm). H¶ever, itis not fully functionally dependent on this composite key because it isfunctionally dependent on student_cd alone.

Continued in word document. NORMALIZATION



Online transaction processing Systems

OLTP systems:

Record business interactions as they occur(real time). Theysupport day to day operations of an organization

Optimized for efficiently processing and storing transactions

Based on rules of normalization

Not designed to deliver large aggregates (may take largeprocessing power and time, many records will be locked whenaggregates are produced - These can have serious impact onefficiency of OLTP systems.

Benefits

Online Transaction Processing has two key benefits:simplicity and efficiency. Reduced paper trails and the faster,more accurate forecasts for revenues and expenses are bothexamples of how OLTP makes things simpler for businesses.



OLTP Systems

Business Intelligence measures are:

often aggregates of hundreds/thousands/millions of transactions. To provide these

aggregates efficiently require a different type of data storage optimization.

Problems arise when we try to use OLTP systems as the source for BI.

The issue of Archiving( An archive is a collection of historical records, as well as theplace they are located)

If transaction tables become large, OLTP systems become slow. Archiving cause problems for BI(Business intelligence) as we may want to havecomparative trends over several years.

The issue of Disparate Systems

An organization may have several OLTP systems for different operations ± order processing, accounting, manufacturing, personnel. Even in ERP systems it is unlikelythat all transactional data is at one place.

BI measures may be based on data from different systems. They treat organizationas a whole.



OLTP Systems

The issue of Disparate Systems (Contd.)

e.g. a measure could be µProfit margin for a particular product¶. For this we need list of raw materials frommanufacturing system, cost of those materials fromaccounting system, cost of labor required to produce the

product from the time entry system and amount paid for theproduct from order processing system. Thus to compute thismeasure, we need to combine data across systems.

Further the coding schemes, calendars, etc. may be

different. The same product may be µ123¶ in one system andµABC¶ in the other. We need to find some common ground tocombine the data from these systems.



Data Warehouse

(Analogy with physical warehouse: An Explanation)

Goods are entered into a warehouse after they areinspected and found suitable for their intended users.

Once in a warehouse, goods are organized so as to beaccessible when needed. Items that are required together

are grouped for efficient access.

The warehouse system includes capabilities for:

Checking to see what items of potential value happen to bethere

Retrieving stored items

To facilitate management control over these valuable assets

We will see that each of these characteristic of a physical

warehouse has a corresponding characteristic in a datawarehouse as well.



Data Warehouse

A subject oriented, integrated, non-volatile and time-variant collection of

data in support of management¶s decisions.

³Subject-oriented´: Data are organized by business topic (not by customer no. or some other key as in the case of transaction databases)

³Integrated´: Data are stored as a single unit, not as a collection of filesthat may have different structures

³Nonvolatile´:

Data do not keep changing.

New data may be added on ascheduled basis, but old data are not discarded

³Time-variant´: A time dimension is explicitly included in the data so thattrends and changes over time can be studied.



Data Warehouse

Data warehouse may have data from operational sources in theorganization as well as from outside the organization (Content)

[Operational databases and Data Warehouse are distinct ± Important]

The data warehouse is organized so as to facilitate use of this data for decision making (Organization)

The data warehouse provides tools by which end users can access thisdata (Tools)



The data Warehouse Architecture

Elements of Data Warehouse

A data warehouse includes:

1. Transactional or operational databases and external sources from whichthe data warehouse is populated. (Note that you would not handle asavings account or manage inventory with a data warehouse)

2. A process to extract data from databases and bring it into the datawarehouse. This process must often transform the data from databasestructure to internal formats of data warehouse.

3. A process to cleanse the data to ensure data quality.

4. A process to load the cleansed data into the data warehouse.

(The four processes Extract, transform, cleanse, loading are collectivelycalled Data Staging)

5. A process to create desired summaries of data: pre-calculated totals,

averages etc., which are expected to be requested often.




Elements of Data Warehouse

6. Metadata, ³data about data´ ± Information about contents of datawarehouse

7. The data warehouse database itself (detailed and summary data).

[ A Data warehouse is not used for processing individual transactions. Itsdatabase need not be optimized for individual record retrieval based onkeys. Instead, it is optimized for access patterns used for analysis]

8. Query Tools to enable OLAP (On-line analytical processing) ± enduser interface for posing questions to DW.

9. A DW may also include automated tools for uncovering patterns in

data (Data mining)

9. The users for whom DW is built.




(Justifying the Data Warehouse)

A DW project has to be put in business perspective. (Build a business case

for DW before planning a DW). A DW project should be tied with solvingbusiness problems

A business case should, however include:

Type of data to be included in DW

Decisions likely to be made using this data

Manner in which these decisions are made without DW

How these decisions will benefit the organization

If benefits cannot be quantified, try to quantify some associated

parameters.

Ex: ³Optimize an advertising budget of $4 m / yr´ ; ³Minimize discountsfrom full fare which are currently $200 m /yr´ ; Choose suppliers whoseproducts result in fewest warranty claims which currently cost us $25 m /yr to handle´




(Justifying the Data Warehouse)

None of these statements say anything about how much the DW willbenefit the orgn. But all provide information through which top managementcan understand/appreciate its benefits.



The data Warehouse Architecture (An Illustration)

Sales person Pet Type No. Sold

Adams Cats 3

Adams Dog 12

Adams Fish 5

Adams Gerbils 4

Baker Cats 2

Baker Dogs 4

Baker Fish 22

Baker Gerbils 5

Carstens Cats 6

Carstens Dogs 2

Carstens Fish 3

Carstens Gerbils 16

(Relational Fact Table for Basic Pet Store Sales Data for a specific time period)



The Data Warehouse Architecture (An Illustration)

The sales pertain to a pet store with three salesperson that sells four different types of pets.

Even with this small data, it is difficult to analyse performance of salespersons pet-wise. In practice, there may be millions of such rows.

Cats Dogs Fish Gerbils

-------------------------------------------------------------------------------------------------

Adams 3 12 5 4

Baker 2 4 22 5Carstens 6 2 3 16

-------------------------------------------------------------------------------------------------




In the context of data warehouse, we have two types of data:Dimension Data and Fact Data

Dimension Data: the dimensions along which data can be analysed. (In

this ex. Data can be anaysed by sales person or by type of pet. Thus :³Salesperson´ and ³ Type of pet´ are dimension data. In reality, time is alsoa dimension in most databases (e.g. To study performance trends over time). Here, we have skipped this by saying that data is for a specific timeperiod.

Fact Data: Actual data along the dimensions

Data warehouse generally store data using multidimensional databases

(or simply a dimensional database). It stores data in the form of a giant

hypercube (extension of cube to 4 or more dimensions)




SalesTransactions

SalesPerson

Time

Pet

Fact Table

Dim. Table

Star Schema





Analyzing Contents of a Data warehouse

(Discovery of information from data)

On-line Analytical Processing (OLAP):an approach to swiftly answer multi-dimensional analytical queries

Databases configured for OL AP use a multidimensional data model,allowing for complex analytical and ad-hoc queries with a rapid executiontime. They borrow aspects of navigational databases and hierarchicaldatabases that are faster than relational databases.[6]

Automated Analysis: Data Mining

Definition: ³ The efficient discovery of valuable, non-obvious informationfrom large collection of data´

(In data mining, special software sifts through the large data to discover new facts and relationships.)

Advantage: Unexpected relationships may get revealed some of which are

of much business value.



Analyzing Contents of a Data warehouse (OLAP versus Data Mining)

OLAP Query Data Mining Investigation

(Involves techniques like forecasting,pattern matching)

Which customers spent most with uslast yr?

Which type of customers are likely to

spend most with us in the next yr?

How much did the bank lose from loandefaulters in past 2 yrs?

What are the characteristics of thecustomers most likely to default?

What were the highest selling fashionitems in our Store at location X?

What additional products are mostlikely to be sold to customers who buysportswear?

Which store made the highest saleslast yr?

In which area should we open a newstore next year?

Chap3 Files DBMS Data Warehouse

Documents