Data Storage, Retrieval and DBMS

33 SREERAM ACADEMY (FORMERLY SREERAM COACHING POINT)

DATA STORAGE, RETRIEVAL AND DATA BASE MANAGEMENT SYSTEMS

Data

Data are raw facts or observations or assumptions or occurrence about physical

phenomenon or business transaction.

They are objective measurement of attributes of entities like people, place, things and

events.

Data is a collection of facts, which is unorganized but can be organized into useful

information.

Data should be accurate but need not be relevant, timely or concise.

It can exist in different forms e.g. picture, text, sound or all of these together.

CONCEPTS RELATED TO DATA

Double Precision: Real data values are commonly called single precision data because each

real constant is stored in a single memory location. This usually gives seven significant digits

for each real value. In many calculations, particularly those involving iteration or long

sequences of calculations, single precision is not adequate to express the precision required.

To overcome this limitation, many programming languages provide the double precision data

type. Each double precision is stored in two memory locations, thus providing twice as many

significant digits.

Logical Data Type: Use the Logical data type when you want an efficient way to store data that

has only two values. Logical data is stored as true (.T.) or false (.F.)

Characters: Choose the Character data type when you want to include letters, numbers,

spaces, symbols, and punctuation. Character fields or variables store text information such as

names, addresses, and numbers that are not used in mathematical calculations. For example,

phone numbers or zip codes, though they include mostly numbers, are actually best used as

Character values.

Strings: A data type consisting of a sequence of contiguous characters that represent the

characters themselves rather than their numeric values. A String can include letters, numbers,

spaces, and punctuation. The String data type can store fixed-length strings ranging in length

from 0 to approximately 63K characters and dynamic strings ranging in length from 0 to

approximately 2 billion characters. The dollar sign ($) type-declaration character represents a

String.

Variable is something that may change in value. E.g. - No. Of words in different pages of a

book.


KEY: relational means of specifying uniqueness. A database key is an attribute utilized to sort

and/or identify data in some manner. Each table has a primary key which uniquely identifies

records. Foreign keys are utilized to cross-reference data between relational tables.

The primary key of a relational table uniquely identifies each record in the table. It can either

be a normal attribute that is guaranteed to be unique (such as Social Security Number in a

table with no more than one record per person) or it can be generated by the DBMS (such as a

globally unique identifier, or GUID, in Microsoft SQL Server). Primary keys may consist of a

single attribute or multiple attributes in combination.

Examples:

Imagine we have a STUDENTS table that contains a record for each student at a university. The student's unique student ID number would be a good choice for a primary key in the STUDENTS table. The student's first and last name would not be a good choice, as there is always the chance that more than one student might have the same name.

A candidate key is a combination of attributes that can be uniquely used to identify a

database record without any extraneous data. Each table may have one or more candidate

keys. One of these candidate keys is selected as the table primary key.

Referential integrity: A feature provided by relational database management systems

(RDBMS's) that prevents users or applications from entering inconsistent data. Most RDBMS's

have various referential integrity rules that you can apply when you create a relationship

between two tables.

For example, suppose Table B has a foreign key that points to a field in Table A. Referential integrity would prevent you from adding a record to Table B that cannot be linked to Table A. In addition, the referential integrity rules might also specify that whenever you delete a record from Table A, any records in Table B that are linked to the deleted record will also be deleted. This is called cascading delete. Finally, the referential integrity rules could specify that whenever you modify the value of a linked field in Table A, all records in Table B that are linked to it will also be modified accordingly. This is called cascading update. Consider the situation where we have two tables: Employees and Managers. The Employees table has a foreign key attribute entitled Managed By which points to the record for that employees manager in the Managers table. Referential integrity enforces the following three rules: 1. We may not add a record to the Employees table unless the Managed By attributes points

to a valid record in the Managers table. 2. If the primary key for a record in the Managers table changes, all corresponding records in

the Employees table must be modified using a cascading update. 3. If a record in the Managers table is deleted, all corresponding records in the Employees

table must be deleted using a cascading delete.

Alternate Key: The alternate keys of any table are simply those candidate keys which are not

currently selected as the primary key. An alternate key is a function of all candidate keys

minus the primary key.


Secondary Key: Secondary keys can be defined for each table to optimize the data access.

They can refer to any column combination and they help to prevent sequential scans over the

table. Like the primary key, the secondary key can consist of multiple columns. A candidate

key which is not selected as a primary key is known as Secondary Key.

Index Fields: are used to store relevant information along with a document.

Currency FieldsThe currency field accepts data in dollar form by default.

Date Fields The date field accepts data entered in date format.

Integer Fields The integer field accepts data as a whole number.

Text Fields The text field accepts data as an alpha-numeric text string.

Information

It is the data that has been converted into a meaningful and useful context for specific end

users.

To obtain information data form is aggregated, manipulated and organized, its content

analysed and evaluated and placed in proper context for human use.

Information exists as reports, in a systematic textual format or as graphics in an organized

manner.

Information must be relevant, timely, accurate, concise and complete and should apply to

the current situation.

It should be condensed into useable length.

Data storage hierarchy

Character: It is the basic building block of data which consists of letters,

numeric digits or special characters. These are put together in a FIELD.

Field: It is a meaningful collection of related characters. It is the smallest logical

data entity that is treated as a single unit in data processing. For example, If we

are processing employees data of a company, we may have

1. Employee code field

2. Employee name field

3. An hours worked field

4. Hourly pay rate field

5. Tax rate deduction field.


Record: Fields are grouped together to form a record. An employee

record would be a collection of fields of one employee.

Record can be divided into Physical and Logical Record

Basis Physical Record Logical Record

Meaning A physical record refers to the

actual portion of a medium on

which data is stored.

A Logical record refers to the way a

user views a record. It contains all

the data related to a single item.

Independence Portion of same logical record

may be located in different

physical records or part of logical

records may be located in one

physical record.

A logical record is independent of

its physical environment

Example A group of pulses recorded on a

magnetic tape or disk, series of

holes pushed into paper tape.

It can be a payroll record for an

employee, or a record of all the

changes made by a customer in a

departmental store.

File: A file is a number of related records that are treated as a unit. For

example, a collection of employee records for one company would be

an employee file.

FILE

Employee 2

Employee 1

Employee No XXX

XXX

Character

Salary

Field


Transaction File and Master File

Basis Master File Transaction File

Data Life Master file contains relatively

permanent records for identification

and summarizing statistical

information.

These files contain temporary

data which is to be processed

in combination with master file

Content It contains current or nearly current

data, which is updated regularly.

These files generally contain

information used for updating

master files.

Data Size It rarely contains detailed transaction

details.

It contains detailed data.

Examples The product files, customer files,

employee files etc.

Purchase orders, job cards,

invoices, etc.

Access

method

These are usually maintained on direct

access storage devices

These are usually maintained

on sequentially as well as direct

access storage devices.

Redundancy It can never be redundant as it has to

be updated regularly.

Once the transaction files are

used to update the master file,

it is no longer required and will

be considered redundant.

File Organization

I. Serial File Organization

Records are arranged one after the other in no particular order- other than,

chronological order in which records are added to the file. This type of organization is

commonly found with transaction data, where records are created in a file in the order

in which transaction takes place.

II. Sequential File Organization

1. In sequential file, records are stored one after another in an ascending or

descending order determined by the key field of the records.

2. In Payroll example, the records of the employee file may be organized

sequentially by employee code sequence.


3. Sequentially organized files that are processed by computer systems are

normally stored on storage media such as magnetic tape, punched paper,

punched cards or magnetic disks.

4. To access these records, the computer must read the file in sequence from the

beginning. The first record is read and processed first, then the second record

in the file sequence, and so on. To locate a particular record, the computer

program must read in each record in sequence and compare its key field to the

one that is needed. The retrieval search ends only when desired key matches

with the key field of the currently read record.

Merits:

Simple to understand

Only record key required to locate record.

Efficient and Economical if the activity rate is high i.e. proportion of file

records processing.

Inexpensive I/O devices may be used.

Reconstruction of files relatively easy since a built in back up us usually

available.

Demerits:

Even in low activity rate entire fields are processed.

Transaction must be stored and placed in sequence prior to processing.

When files are accumulated in between timelines of data deteriorates.

High data redundancy since same data stored in several files sequenced

on different key.

Applications:

Payroll systems

Electricity billing or any other billing where each record need to be

accessed.


III. Direct File Access Organization

A- Self -Addressing method: A record key is used as its relative address. Therefore, we can

computer the records address directly from the record key and the physical address of the

first record in the file.

B- Indexed Sequential File Organization :

1. A computer provides a better way to store information like the card catalogue; indeed, most public libraries today keep their card catalogues on a computer. For each book in the library, a data record is created that contains information gathered from the various card catalogues. For example, the title of the book, the author's name, the physical location of the book, and any other relevant information. A record is generally composed of several fields, with each field used to store a particular piece of information. For example, we might store the author's last name in one field and the first name in a separate field. All the records (one for each book) are collected and stored in a file. The file containing the records is typically called the data file.

2. Indexes are created so that a particular record in the data file can be located quickly. For example, we could create an author index, a title index, and a subject index. The indexes are typically stored in a separate file called the index file.

3. An index is a collection of "keys", one key for each record in the data file. A key is a subset of the information stored in a record. When an index is created, the key values are extracted from one or more fields of each record. The value of each key determines its order in the index (i.e., the keys are sorted alphabetically or numerically). Each key has an associated pointer that indicates the location in the data file of the corresponding complete record. To find a particular record, a matching key is quickly located in the index, and then the associated pointer is used to locate the complete record.

4. Consider the problem of locating a particular book in a library containing thousands of books. Public libraries long ago developed the card catalogue as a means to efficiently locate a particular book. Usually there were at least three card catalogues, one with cards arranged in order by the name of the author, another arranged by the title of the book, and a third arranged by subject heading. Each card contained information about the book, most importantly its location in the library. Therefore, by knowing the name of the author, the title of the book, or the appropriate subject heading, you could use the card catalogues

DIRECT ACCESS

DIRECT SEQUENTIAL

ACCESS

RANDOM ACCESS

SELF DIRECT

ADDRESSING METHOD

(A)

INDEX SEQUENTIAL

ADDRESSING METHOD

(B)

ADDRESS GENERATION

METHOD

INDEXED RANDOM


to quickly determine the location of a particular book. The card catalogues can be thought of as indexes.

5. Consider the author index. There is a filing cabinet containing a card for each book in the library, filed in alphabetical order by the author's name. Each drawer in the cabinet is labelled, perhaps "A-E", "F-J", and so on. There are two broad kinds of searches that you might want to perform on the author index.

6. First, you might want to make a list containing the name of every book in the library. To do this you would start in the first drawer with the first card, and look at each card in order until you reached the last card in the last drawer. This is called a "sequential" search because you look at each card in the catalogue in sequential order.

7. Second, you might want to know the names of the books in the library that were written by Thomas Jefferson. Instead of examining every card in the catalogue, you are first guided by the labels on the drawers to the second drawer, the "F-J" drawer. You are then guided by the tabs inside the drawer to the names that start with the letter "J". This is called a "random" search. For any particular card, you can use the labels (or indexes) to go almost directly to the desired card.

8. Actually locating the Thomas Jefferson card(s) involves both a random and sequential search. We use random access to go directly to the correct drawer and correct tab inside the drawer. The labels (or indexes) allow us to very quickly get close to the card of interest. After locating the "J" tab inside the "F-J" drawer, we then use sequential access to locate the particular Thomas Jefferson card(s) of interest.

Merits:

Allows efficient and economical use of sequential processing techniques

when activity rate is high.

Permits quick access to records in relatively sufficient way. This activity

is a small fraction of the total work load.

Demerits:

Less efficient in the use of storage space than other organization

Slow access to records because of using indexes. Relatively expensive

hardware and software resources are required.

Application:

Inventory control where sequential access and also inquiry required.

Students registration system.

C- Random File Organization

Randomizing procedure is characterised by the fact that records are stored in such a

way that there is no relationship between the keys of the adjacent records. The

technique provides for converting the record key number to a physical location

represented by a disk address through a computational procedure.


Transactions can be processed in any order and written at any location through the

stored file. The desired records can be directly accessed using randomizing

procedure without accessing all other records in the file.

Merits:

Access to records for inquiry and updating possible immediately.

Immediate updating of several files as a result of single transaction is possible.

No need for sorting.

Demerits:

Risk to records in the on-line file line, loss of accuracy, breach of security.

Special backup and reconstruction procedures are established.

Less efficient in the use of storage space than sequentially organized file.

Relatively expensive software and hardware resources required.

Application:

Any type of inquiry such as

Railway reservation or Air reservation system.

o The Best File Organization

File management involves logical organization of data supplied to a computer in a predetermined

way. Storing data in a particular place is called a FILE. The file is created using a set of instructions

called PROGRAM. The data created in the file depends on the following factors:-

1. Data Dependence

2. Data Redundancy

3. Data Integrity

File Management Software

It is a software package that helps the users to organize data into files, process them and

retrieve the information.

The users can create report, formats, enter data into records, search records, sort them

and prepare reports.

They are designed for micro computers and menu- driven allowing end users to create files

by giving easy to use instructions.

Following are the criteria in choosing file organisation method:

1. File Volatility


(i) File Volatility is the number of additions and deletions to the file in a given period

of time. E.g. Payroll file of a company where the employee register is constantly

changing is a highly volatile file, and therefore direct access method is better.

2. File Activity

(i) File activity refers to the proportion of records accessed on a run to the no. Of

records in a file.

(ii) In case of real time files where each transaction is processed immediately only one

master record is accessed at a time, direct access method is appropriate.

(iii) In case where almost every record is accessed for processing sequentially ordered

file is appropriate.

3. File Interrogation

(i) File interrogation refers to the retrieval of information from a file.

(ii) If the retrieval of individual records must be fast to support a real time

operation such as Airline reservation then some kind of direct

organization is required.

(iii) If on the other hand, requirements for data can be delayed, then all the

individuals requests of information can be batched and run in a single

processing run with a sequential file organization.

4. File Size

(i) Large files which require many individual references to records with immediate

response must be organized under direct access method.

(ii) In case of small files, it is better to search the entire file sequentially or with a more

efficient binary search, to find an individual record than to maintain complex

indexes or complex direct addressing schemes.

Problems of the File Processing Systems:

i. Data Redundancy: Same data is stored in different files since the data files are

independent. This result in lot of duplicated data and a separate file maintenance program

is necessities to update each file.

ii. Data Dependence: The component of a file processing system depends on one another,

and therefore changes were made in the format and structure of data in a file. Changes

have to be made in all programs that use this file.

iii. Data integrity: The same data is found in different forms in different files. Checking the

validity of data could not be uniformly implemented with the result that data in one file


may be correct and in another file wrong. Special computer programs were written to

retrieve data from such independent files which are time consuming and expensive.

iv. Data Availability: Since data is scattered in many files, it would be necessary to look into

many files before relying on a particular data. Due to non- uniformity in the file design, the

data may have different identification numbers in different files and obtaining the

necessary data will be difficult.

v. Management control: Uniform policies and standards cannot be set since the data is

scattered in different files. It is difficult to relate such files and difficult to implement a

decision due to non- uniform coding of the data files.

DATA BASE MANAGEMENT SYSTEMS

Database

A Database is a collection of related and ordered information organised in such a way that

information can be assessed quickly and easily. Hence, an organised logical group of related files

would constitute a database.

According to G.M.Scott, - A database is a computer file system that uses a particular file

organisation to facilitate rapid updating of individual records, simultaneous updating of related

records, easy access to all records by all application programs and rapid access to all stored data

which must be bought together for a particular routine report or inquiry or a special purpose

report or inquiry

Types of Databases:-

1. Operational Databases: These databases keep the information needed to support the

operation of an organization .These are mainly day to day working database e.g.

customer, employee and inventory database, etc.

2. Management Databases: These databases keep the selected information and data

extracted mainly from operational and external database.

3. Information warehouse Databases: A Data warehouse stores the data of current and

previous years. It is a central source of data that has been standardized and integrated

so that it can be used by managers and other end user professionals throughout an

organization.

4. Distributed Databases: These are the databases of local work group and department at

branch offices, manufacture plants and other work sites, regional offices, etc. Main aim

of these databases is to ensure that organization database is distributed but updated

concurrently.


Advantages:

Local computer on the network offers immediate response to local needs.

Systems can be expanded in modular fashion as needed.

Since many small computers are used, the system is not dependent on one

large that could shut down the network if failed.

Equipment operating and management costs are often lower.

Micro computers tends to be less complex than large systems, therefore the

system is more useful to local users.

5. End user database: These databases consist of various data files of word, Excel and

database which end user has generated.

6. External Databases: These are also known as online databases provided by various data

banks or organizations at nominal fee.

7. Test Databases: These are informative databases available normally on CD- Rom disk

for certain price.

8. Images databases: These databases contain alpha numeric information. These are

available either on Internet or in CD at certain price.

9. Object oriented databases: This is a type of database structure developed to be

suitable to changing application needs. When integrated database structures were

developed, the need for OODB was felt. Database with relational qualities that are

capable of manipulating text, data, objects, images and audio/ video clips are used by

organisations. With OODB, OOP has been developed. In OOP (object oriented

programming), every object is described as a set of attributes describing what the

object is. The behaviour of the object is also included in the program. Objects with


similar qualities and behaviour can be grouped together. OOP is more useful in

decision making.

10. Partitioned Database (Partial Distribution): Some databases are centrally managed and

some managed in a decentralised manner. This approach is called partitioned

database. For e.g., financial, marketing, administrative data can be maintained in

headquarters whereas production data may be maintained in decentralised locations.

Factors to be addressed in maintaining a database:

1. Installation of Database:

Correct installation of the DBMS product.

Ensuring that adequate file space is available

Allocate the disc space for database properly.

Allocation of data files in standard sizes for input out balancing.

2. Memory usage:

How are buffers being used?

How the DBMS uses main memory?

What the programs in main memory have?

3. Input/ Output ( I/O) Contention:

Achieving maximum I/O performance is one of the most important

aspects if timing. Understanding how the data are accessed by end-

users is critical to I/O contention.

Clock speed of CPU requires more time management of I/O.

Simultaneous or separate use of I/O Devices.

Spooling, buffering, etc. can be used.

4. CPU usage:

Multi programming and multi-processing improves performance in

query processing.

Monitoring CPU load.

Mixture of online/ background processing need to be adjusted.

Mark jobs that can be processed in run off period to unload the machine

during peak working hours.


Components of a Database Environment

1. Database files: These files have data elements stored in database file organization formats.

The database is created in such a way so as to balance the data management objective to

speed, multiple access paths, minimum storage, program data independence and

preservation of data integrity.

2. A Database Management System (DBMS): DBMS is a set of system software program that

manages the database files. Request for access to files, updating of records and retrieval of

data is done by DBMS. The DBMS has the responsibility for data security, which is vital in a

database environment since database is accessed by many users.

3. The users: Users consist of both traditional users and application programmers, who are

not traditionally considered as users. Users interact with the DBMS indirectly via

application programs or directly via a simple query language.

Classification of DBMS Users:

Nave users who are not aware of the presence of the database system supporting

the usage.

Online users who may communicate with database either directly through online

terminal or indirectly through user interface or application programs. Usually they

acquire some skill and experience in communicating with the database.

Application programmers who are responsible for developing the application

programs and user interfaces.

DBA who can execute centralized control and is responsible for maintaining the

database.

The user interaction with the DBMS includes the definition of the logical relationships in

the database, input and maintenance of data, changing and deletion and manipulation of

data.

4. A host interface system: This is that part of DBMS which communicates with the

application programs. The host language interface interprets instructions in high level

language application programs, such as COBOL and BASIC programs that requests data

from files so that the data needed can be retrieved. During this period the OS interacts

with the DBMS. Application programs do not contain information about the file, thus the

program is independent of a database system.

5. The application programs: These programs perform the same functions as they do in

conventional system, but they are independent of the data files and use standard data

definitions. This independence and standardisation make rapid special purpose program

development easier and faster.


6. A Natural Language Interface System: The query permits online update and inquiry by

users who are relatively un -sophisticated about computer systems. This language is often

termed English- like because instructions of this language are usually in form of a simple

command in English, which are used to accomplish an enquiry task. Query language also

permits online programming of simple routines by managers who wish to interact with the

data. The natural language may also facilitate managers to generate special reports.

7. The data dictionary: Data dictionary is a centralized depository of information, in a

computerized form, about the data in database. The data dictionary also contains the

scheme of the database i.e. the name of each item in the database and a description and

definition of its attributes along with the names of the programs that use them and who is

responsible for the data authorization tables that specify users and the data and programs

authorized for their use. Their descriptions and definitions are referred to as the data

standards. Maintenance of a data dictionary is the responsibility of the DBA.

8. Online access and update terminals: These may be adjacent to computer or even

thousands of miles away. They may be dumb terminals, smart terminals or

microcomputers.

9. The output system or report generators: This provides routine job reports, documents and

special reports. It allows programmers, managers and other users to design output reports

without writing an application program in a programming language.

10. File Pointer: It is pointers that is placed in the last field of a record and contains the

address of another related record thus establish a link between records. It directs the

computer system to move to that related record.

11. Linked List: A Linked list is a group of data records arranged in an order, which is based on

embedded pointers. An embedded pointer is a special data field that links one record to

another by referring to the other record. The field is embedded in the first record, i.e. It is

a data element within the record.

Factors contributing to the Architecture of a Database:

1. External View

It is also known as user view.

As the name suggests, it includes only those application programs which are user

concerned.

It is described by users/ programmers by means of external schema.

2. Conceptual View

It is also known as global view.

It represents the entire data base and includes all data base entries


It is defined by conceptual schema and describes all records, relationships,

constraints and boundaries.

3. Internal view

It is also known as physical view

It describes the data structure and the access methods

It is defined by internal schema and indicates how data will be stored

Out of the above three, External view is USER DEPENDENT and the rest two are

USER INDEPENDENT.

Data Independence

1. In a database an ability to modify a schema definition at one level is done without

affecting a schema in the next higher level.

2. It facilitates logical data independence

3. It assures physical data independence.

Structure of Database

The logical organizational approach of the database is called the Database structure. There are

three basic structures available, viz. Hierarchical, and Relational and Network database

structure.

Ext. Schema

2

Ext. Schema

1

Ext. Schema

3

Conceptual Schema

Physical Schema


Hierarchical Database Structure

In this type of architecture records are logically arranged into a hierarchy of

relationships.

Records are logically arranged in a tree pattern. Hierarchy structure implements one to

one and one to many relationships. All records in hierarchy are called nodes.

Each node is related to other in a parent- child relationship as each parent record may

have one or more child record but no child record may have more than one parent

record.

The top parent record in the hierarchy is called the root.

Features of Hierarchy Database:-

i. Hierarchically structured database are less flexible than any other database structure

because the hierarchy of records must be determined and implemented before a

search can be conducted, or in other words, the relationships between records are

relatively fixed by the structure.

ii. Managerial use of query language to solve the problem may require multiple searches

and proof which is very time consuming. Thus, analysis and planning activities, which

frequently involve ad-hoc management queries of the database, may not be supported

as effectively by a hierarchical DBMS as they are by other database structures.

iii. Ad-hoc queries made by managers that require different relationships other than that

are already implemented in the database may be difficult or time consuming to

accomplish.

iv. Records are logically structured in inverted tree pattern.

v. It implements one to one and one to many relationships.

vi. Each record or node in hierarchy is related to other records in a parent- child

relationship.

vii. Child to many parents type logical structure finds difficulty in processing.

viii. Processing with group records of natural relations can be done faster.


Relational Database Structure

An example of such a situation may be the representation of Actors, Movies, and Theatres.

In order to know who plays what and where, we need the combination of

These three attributes. However, they each relate to each other cyclically. So to resolve

this, we would need to establish parent tables with Actor - movie, movie - Theatre, and

Theatre - Actor. These would each contain a portion of the Primary Key in the Actor,

Movie, and Theatre table.

ACTOR MOVIE THEATRE

Kamalhaasan Manmadhan Ambu Satyam

Dhanush Aadukalam PVR

Karthi Siruthai INOX

Trisha Manmadhan Ambu Satyam

Tammanna Siruthai PVR

i. This is a model where more than one data file is compared.

ii. More than one file is compared at a time with the help of a common key field.

iii. Each file is converted into a table and the analysis is done on the tables with the

help of common key field.

iv. The row of the table represents the list of records and the column represents data

field.

v. It is not necessary to maintain the entire file in a single physical location but it can

be maintained geographically at any place.

vi. This is more suitable for wider analysis of data from different locations.

ACTOR

MOVIE

MOVIE

THEATRE

THEATRE

ACTOR

MOVIE

THEATRE

ACTOR


vii. Queries are easily possible because software interacts with different records at the

same time.

Network Database Structure

This structure is more useful when data is transmitted from one place to another

place that is one-to-one mode, many-to-many model. This type of structure is

found in organizations where online data processing is carried out.

DBMS (Language)

I. Data Definition Language:

DDL defines the conceptual schema providing a link between the logical and physical

structures of the database. The logical structure of a database is schema. A subschema

is the way a specific application views the data from the database.

Following are the functions of DDL:

i. They define the physical characteristics of each record, field in the record,

fields type and length, fields logical name and also specify relationships among

the records.

ii. They describe the schema and subschema.

iii. They indicate the keys of the record

iv. They provide means for associating related records or fields

v. They provide for data security measures.

vi. They provide for logical and physical data independence.

II. Data manipulation Language

DML is a Database Language used by database users to retrieve, insert, delete and

update data in a database.


Following are the functions of DML:

They provide the data manipulation techniques like deletion, modification,

insertion, replacement, retrieval, sorting and display of data or records.

They facilitate use of relationships between the records

They enable the user and application program to be independent of the physical

data structure and database structures maintenance by allowing to process data on

a logical and symbolic basis rather than on a physical location basis.

They provide for independence of programming languages by supporting several

high-level procedural languages like COBOL, C++, etc.

STRUCTURE OF DBMS

I. DDL Compiler

It converts data definition statements into a set of tables.

Tables contain meta data (data about the data) concerning the database.

It gives rise to a format that can be used by other components of database.

II. Data Manager

It is the central software component

It is referred to as database control system

It converts operation in users queries to physical file system.

III. File manager

It is responsible for file structure

It is responsible for managing the space

It is responsible for acting block containing required record.

It is responsible for requesting block from disk manager.

It is responsible for transmitting required record to data manager.

IV. Disk Manager

It is a part of the operating system

It carries out all physical input/output operations.

It transfers block/page requested by file manager.


V. Query manager

It interprets users online query

It converts to an efficient series of operations.

In a form it is capable of being sent to data manager.

It uses data dictionary to find structure of relevant portion of database.

It uses information to modify query.

It prepares an optimal plan to access database for efficient data retrieval.

VI. Data Dictionary

It maintains information pertaining to structure and usage of data and meta data.

It is consulted by the database users to learn what each piece of data and various

synonyms of data field means.

DATA BASE ADMINISTRATOR

A DBA is a person who actually creates and maintains the database and also carries out the

policies developed by the DA. Job of the DBA is a technical one. He is responsible for defining the

internal layout of the database and also for ensuring that the internal layout optimizes system

performance, especially in main business processing areas.

Main functions of a DBA are:-

1. Determining the physical design of a database and specify the hardware resource

requirement for the purpose. This can be done by determining the data requirement

schedule and accuracy requirements, the way and frequency of data access, search

strategies, physical storage requirements of data, level of security needed and the

response time requirement.

2. Define the contents of the database.

3. Use of data definition language (DDL) to describe formats relationships among various

data elements and their usage.

4. Maintain standard and control to the database.

5. Specify various rules, which must be adhered to while describing data for a database.

6. Allow only specified users to access the database by using access controls thus prevent

unauthorised access.

7. DBA also prepares documentation which includes recording the procedures, standard

guidelines and data descriptions necessary for the efficient and continuous use of

database environment.


8. DBA ensures that the operating staff perform its database processing related

responsibilities which include loading the database, following maintenance and security

procedures, taking backups, scheduling the database for use and following, restart and

recovery procedures after some hardware or software failure in a proper way.

9. DBA monitors the database environments.

10. DBA incorporates any enhancements into the database environment, which may include

new utility program or new system releases.

Structured Query Language

SQL is a query language that enables to create relational database which are sets of related

information stored in tables.

It is a set of commands for creating, updating and accessing data from database.

It allows programmers, managers and other users to ask ad-hoc queries of the database

interactively without the aid of programmers. It is a set of about 30 English like commands

such as Select..From.where.

SQL has following features:

a. Simple English like commands

b. command syntax is easy

c. Can be sued by non- programmers.

d. Can be used for different type of DBMS

e. Allows user to create, update database.

f. Allows retrieving data from database without having detailed information about

structure of the records and without being concerned about the processes the DBMS users

to retrieve the data.

g. Has become standard practice for DBMS.

Since SQL is used in many DBMS, managers who understand SQL are able to use the same set of

commands regardless of the DBMS software that they may use.

PROGRAM LIBRARY MANAGEMENT SYSTEM

Program library management system provides several functional capabilities to facilitate effective

and efficient management of the data centre software inventory. The inventory may include

application and system software program code, job control statements that identify resources

used and processes to be performed and processing parameters which direct processing.

Some of the capabilities are as follows:


a. Integrity- each source program is assigned a modification number and version number and

each source statement is associated with a creation date. Security to program libraries, job

control language sets and parameters file is provided through the use of passwords,

encryption, data compression facilities and automatic backup creation.

b. Update- Library management systems facilitate the addition, deletion, re-sequencing, and

editing of library members.

c. Reporting- With use of its facilities a list of additions, deletions and modifications along

with library catalogue and library member attributes can be prepared for management

and auditor review.

d. Interface- Library software packages may interface with the operating system, job

scheduling, access control system and online program management.

Need for Documentation:

It provides a method to understand the various issues related with software

development.

It provides means to access details related to system study, system development,

system testing, system operational details.

It provides details associated with further modification of software.

4 types of documentation are required prior to delivery of customized software to

a customer :

Strategic and application plans

Application systems and program documentation

Systems software and utility program documentation

Database documentation, Operation manual, User manual, Standard

manual, Backup manual and others.

DATA WAREHOUSE

A Data warehouse is a computer database that collects, integrates and stores an organisations

data with the aim of producing accurate and timely management information and supporting data

analysis. It provides tools to satisfy the information needs of employees or all organizational levels

and not just for complex data queries. It made possible to extract archived operational data and

overcome inconsistencies between different legacy data formats.

A Data Mart is a subset of a Data Warehouse. Most organizations do start designing a data mart to

attend to immediate needs. To keep it simple, consider Data Mart as a data reserve that satisfies


certain aspect of business or just one application (or a process). Data Warehouse is a super set

that engulfs all such mini Data marts to form one big reservoir of information.

Characteristics of Data warehouse

1. It is subject oriented, means data are organized according to subject instead of

application. The organized data (according to subject) contains only the information

necessary for decision support processing.

2. Encoding of data is often inconsistent when the data resides in many separate

applications in the operational environment but when data are moved from the

operational environment into the data warehouse they assume a consistent coding

convention.

3. Data warehouse contains a place for storing historical data to be used for comparison,

trends and forecasting.

4. Data are not uploaded or changed in anyway once they enter the data warehouse but

are only loaded and accessed.

COMPONENTS OF A DATA WAREHOUSE (W.R.T figure)

Data Sources

Data sources refer to any electronic repository of information that contains data of interest for management use or analytics. This definition covers mainframe databases (e.g. IBM DB2, ISAM, Adabas, Teradata, etc.),client-server databases (e.g. IBM DB2, Oracle database, Informix, Microsoft SQL Server etc.), PC databases (eg Microsoft Access), spreadsheets (e.g. Microsoft Excel) and any other electronic store of data. Data needs to be passed from these


systems to the data warehouse either on a transaction-by-transaction basis for real-time data warehouses or on a regular cycle (e.g. daily or weekly) for offline data warehouses.

Data Transformation

The Data Transformation layer receives data from the data sources, cleans and standardises it, and loads it into the data repository. This is often called "staging" data as data often passes through a temporary database whilst it is being transformed. This activity of transforming data can be performed either by manually created code or a specific type of software could be used called an ETL tool. Regardless of the nature of the software used, the following types of activities occur during data transformation:

Comparing data from different systems to improve data quality (e.g. Date of birth for a customer may be blank in one system but contain valid data in a second system. In this instance, the data warehouse would retain the date of birth field from the second system)

standardising data and codes (e.g. If one system refers to "Male" and "Female", but a second refers to only "M" and "F", these codes sets would need to be standardised)

integrating data from different systems (e.g. if one system keeps orders and another stores customers, these data elements need to be linked)

performing other system housekeeping functions such as determining change (or "delta") files to reduce data load times, generating or finding surrogate keys for data etc.

Data Warehouse

The data warehouse is a relational database organised to hold information in a structure that best supports reporting and analysis. Most data warehouses hold information for at least 1 year and sometimes can reach half century, depending to the Business/Operations data retention requirement. As a result these databases can become very large.

Reporting

The data in the data warehouse must be available to the organisation's staff if the data warehouse is to be useful. There are a very large number of software applications that perform this function, or reporting can be custom-developed. Examples of types of reporting tools include:

Business intelligence tools: These are software applications that simplify the process of development and production of business reports based on data warehouse data.

Executive information systems: These are software applications that are used to display complex business metrics and information in a graphical way to allow rapid understanding.

OLAP Tools: OLAP tools form data into logical multi-dimensional structures and allow users to select which dimensions to view data by.

Data Mining: Data mining tools are software that allows users to perform detailed mathematical and statistical calculations on detailed data warehouse data to detect trends, identify patterns and analyse data.


Metadata

Metadata, or "data about data", is used to inform operators and users of the data warehouse about its status and the information held within the data warehouse. Examples of data warehouse metadata include the most recent data load date, the business meaning of a data item and the number of users that are logged in currently.

Operations

Data warehouse operations comprises of the processes of loading, manipulating and extracting data from the data warehouse. Operations also cover user management, security, capacity management and related functions

Optional Components

In addition, the following components also exist in some data warehouses:

1. Dependent Data Marts: A dependent data mart is a physical database (either on the same hardware as the data warehouse or on a separate hardware platform) that receives all its information from the data warehouse. The purpose of a Data Mart is to provide a sub-set of the data warehouse's data for a specific purpose or to a specific sub-group of the organisation.

2. Logical Data Marts: A logical data mart is a filtered view of the main data warehouse but does not physically exist as a separate data copy. This approach to data marts delivers the same benefits but has the additional advantages of not requiring additional (costly) disk space and it is always as current with data as the main data warehouse.

3. Operational Data Store: An ODS is an integrated database of operational data. Its sources include legacy systems and it contains current or near term data. An ODS may contain 30 to 60 days of information, while a data warehouse typically contains years of data. ODS's are used in some data warehouse architectures to provide near real time reporting capability in the event that the Data Warehouse's loading time or architecture prevents it being able to provide near real time reporting capability.

Different methods of storing data in a data warehouse

All data warehouses store their data grouped together by subject areas that reflect the general usage of the data (Customer, Product, Finance etc.). The general principle used in the majority of data warehouses is that data is stored at its most elemental level for use in reporting and information analysis.

Within this generic intent, there are two primary approaches to organising the data in a data warehouse.

The first is using a "dimensional" approach. In this style, information is stored as "facts" which are numeric or text data that capture specific data about a single transaction or event, and "dimensions" which contain reference information that allows each transaction or event to be classified in various ways. As an example, a sales transaction would be broken up into facts such as the number of products ordered, and the price paid, and dimensions such as date, customer, product, geographical location and sales person. The main advantages of a dimensional approach are that the Data Warehouse is easy for business staff with limited information technology


experience to understand and use. Also, because the data is pre-processed into the dimensional form, the Data Warehouse tends to operate very quickly. The main disadvantage of the dimensional approach is that it is quite difficult to add or change later if the company changes the way in which it does business.

The second approach uses database normalisation. In this style, the data in the data warehouse is stored in third normal form. The main advantage of this approach is that it is quite straightforward to add new information into the database, whilst the primary disadvantage of this approach is that it can be quite slow to produce information and reports.

The Advantages of using a Data Warehouse are:

1. Enhanced and user access to a wide variety of data.

2. Increased Data consistency

3. Increased productivity and decreased computational cost.

4. It is able to combine data from different sources, in one place.

5. It provides an infrastructure that could support change to data and replication of the

changed data back into the operational systems.

Concerns in using data warehouse

Extracting, cleaning and loading data could be time consuming. Data warehousing project scope might increase. Problems with compatibility with systems already in place e.g. transaction processing

system. Providing training to end-users, who end up not using the data warehouse. Security could develop into a serious issue, especially if the data warehouse is web

accessible.

Types of Data Warehouses

With improvements in technology, as well as innovations in using data warehousing techniques, data warehouses have changed from Offline Operational Databases to include an Online Integrated data warehouse.

Offline Operational Data Warehouses are data warehouses where data is usually copied and pasted from real time data networks into an offline system where it can be used. It is usually the simplest and less technical type of data warehouse.

Offline Data Warehouses are data warehouses that are updated frequently, daily, weekly or monthly and that data is then stored in an integrated structure, where others can access it and perform reporting.

Real Time Data Warehouses are data warehouses where it is updated each moment with the influx of new data. For instance, a Real Time Data Warehouse might incorporate data from a Point of Sales system and is updated with each sale that is made.


Integrated Data Warehouses are data warehouses that can be used for other systems to access them for operational systems. Some Integrated Data Warehouses are used by other data warehouses, allowing them to access them to process reports, as well as look up current data.

BACKUP AND RECOVERY

Recovery is a sequence of tasks performed to restore a database to some point-in-time.

'Disaster recovery' differs from a database recovery scenario because the operating system

and all related software must be recovered before any database recovery can begin.

Database files that make up a database: Databases consist of disk files that store Data.

When you create a database either using any database software command-line utility, a

main database file or root file is created. This main database file contains database tables,

system tables, and indexes. Additional database files expand the size of the database and

are called dbspaces.

A dbspace contains tables and indexes, but not system tables.

A transaction log is a file that records database modifications. Database modifications

consist of inserts, updates, deletes, commits, rollbacks, and database schema changes. A

transaction log is not required but is recommended. The database engine uses a

transaction log to apply any changes made between the most recent checkpoint and the

system failure. The checkpoint ensures that all committed transactions are written to disk.

During recovery the database engine must find the log file at specified location. When the

transaction log file is not specifically identified then the database engine presumes that

the log file is in the same directory as the database file.

A mirror log is an optional file and has a file extension of .mlg. It is a copy of a transaction

log and provides additional protection against the loss of data in the event the transaction

log becomes unusable.

Online backup, offline backup, and live backup: Database backups can be performed while

the database is being actively accessed (online) or when the database is shutdown (offline)

When a database goes through a normal shutdown process (the process is not being

cancelled) the database engine commits the data to the database files An online database

backup is performed by executing the command-line or from the 'Backup Database' utility.

When an online backup process begins the database engine externalizes all cached data

pages kept in memory to the database file(s) on disk. This process is called a checkpoint.

The database engine continues recording activity in the transaction log file while the

database is being backed up. The log file is backed up after the backup utility finishes

backing up the database. The log file contains all of the transactions recorded since the last

database backup. For this reason the log file from an online full backup must be 'applied'

to the database during recovery. The log file from an offline backup does not have to

participate in recovery but it may be used in recovery if a prior database backup is used.


A Live backup is carried out by using the backup utility with the command-line option. A

live backup provides a redundant copy of the transaction log for restart of your system on

a secondary machine in the event the primary database server machine becomes

unusable.

Full and Incremental database backup: Full backup is the starting point for all other types

of backup and contains all the data in the folders and files that are selected to be backed

up. Because full backup stores all files and folders, frequent full backups result in faster

and simpler restore operations.

Incremental backup stores all files that have changed since the last FULL, DIFFERENTIAL

OR INCREMENTAL backup. The advantage of an incremental backup is that it takes the

least time to complete.

For example, you're running a backup on Friday: this first backup always would be a full backup by default. Then, upon your working with theses files on Monday, Leo Backup performs the incremental backup: this backup will transfer only those files that changed since Friday. A Tuesday backup will carry only those files that changed since Monday. And the same course for the following days.

Core phases in developing a backup and recovery strategy

1. Create backup and recovery commands: The commands should be verified with the actual

results produced to ensure that desired results are produced.

2. Time estimates from executing backup and recovery commands help to get a feel for how

long will these tasks take. This information helps in identifying what command will be

executed and when.


3. Document the backup commands and create procedures outlining backups which are kept

in a file. Also identify the naming convention used as well as the king of backups

performed.

4. Incorporate health checks into the backup procedures to ensure that the database is not

corrupt. Database health check can be performed prior to backing up a database or on a

copy of the database from the back up.

5. Deployment of backup and recovery consists of setting up backup procedures on the

production server. Verification of the necessary hardware in place and any other

supporting software required to perform these tasks must be done. Modify procedures to

reflect the change in development.

6. Monitor backup procedures to avoid unexpected errors. Make sure that any changes in

the process are reflected in the documentation.

Data Centre and the challenges faced by the management of a data

centre:

i. A Data centre is a centralized repository for the storage, management and dissemination

of data and information.

ii. Data centre is a facility used for housing a large amount of electronic equipment, typically

computers and communication equipment.

iii. The purpose of a data centre is to provide space and bandwidth connectivity for server in a

reliable, secure and scalable environment.

iv. It also provides facilities like housing websites, providing data serving and other services

for companies. Such type of data centre may contain a network operation s centre (NOC)

which is restricted access area containing automated system that constantly monitor

server activity, web traffic, network performance and report even slight irregularities to

engineers so that they can stop potential problems before they occur.

Challenges:

Maintaining Infrastructure A Data centre needs to set up an infrastructure comprising of

a member of electronic equipment, typically computers and band width connectivity for

server in a reliable secure and saleable environment.

Skilled Human Resources a Data centre needs skilled staff expert at network management

having software and hardware operating skill.

Selection of Technology- A Data centre also faces the challenge of proper selection of

technology crucial to the operation of the data centre.

Maintaining system performance A Data centre has to maintain maximum uptime and

system performance, while establishing sufficient redundancy and maintaining security.


DATA MINING

Data mining is the extraction of implicit, previously unknown and potentially useful information

from data. It searches for relationship and global patterns that exist in large databases but are

hidden among the vast amount of data. These relationships represent valuable knowledge about

database and objects in the database that can be put to use in the areas such as decision support,

prediction, forecasting and estimation.

In other words, data mining is concerned with the analysis of data and the use of software

techniques used for finding patterns and regularities in sets of data. It is the computer responsible

for finding the patterns by identifying the underlying rules and features in the data.

Stages in data mining

1. Selection: Selecting or segmenting the data according to some criteria so that sub sets of

the data can be determined.

2. Pre- processing: This is the data cleansing stage where certain information is removed

which is deemed unnecessary and may slow down queries. Also the data is re-configured

to ensure a consistent format as there is a possibility of inconsistent formats because the

data is drawn from several sources.

3. Transformation: The data is not merely transferred across but transformed in that overlays

may be added. For example, Demographic overlays are commonly used in market

research. The data is made usable and navigable.

4. Data mining: This stage is concerned with the extraction of patterns from the data. A

pattern can be defined as a given set of facts. One popular example of data mining is using

past behaviour to rank customers. Such tactics have been employed by financial

companies for years as a means of deciding whether or not to approve loans and credit

cards.

5. Integration and Evaluation: The patterns identified by the systems are interpreted into

knowledge which can then be used to support human decision making. For example,

prediction and classification tasks, summarising the contents of a database or explaining

observed phenomenon.

Data Storage, Retrieval and DBMS

Documents

character data type

single precision data

logical data type

string data type

data double precision

double precision data

real data values

typedeclaration character