Lecturer 4: File Handling - WordPress.com€¦ · File Handling • The logical and physical organisation of files. • Serial and sequential file handling methods. • Direct and

Lecturer 4: File Handling

File Handling • The logical and physical organisation of files.

• Serial and sequential file handling methods.

• Direct and index sequential files.

• Creating, reading, writing and deleting records

from a variety of file structures.

• Creating code to carry out the above operations

Describe the Logical Organization of file? • A file is logically organised as follows:

• A record is a collection of data that belongs together,

e.g. all the data about an individual person.

• A data item is an individual field of a record and

usually contains one piece of data, e.g. a date, first

name, age.

• These fields are collected together to form records.

• Records are collected together to form a file.

• A file is made up of records containing fields.

Logical Organization • A file is logically organised as follows:

Logical Organization

Characteristics of a Sequential Device

Characteristics of a Random or Direct Access

Slow.

Inexpensive.

Access time is dependent on current position.

Fast

Expensive.

Have an almost constant access time.

• This data is stored on magnetic tape (sequential

device).

• Magnetic disk (floppy or hard disk) or optical

media (CD-ROM/CD-R/ CD-RW /DVD etc.)

(Direct access devices).

Serial Access

• Each record is stored, one after the other, with

no regard to any logical order.

• It is the simplest form of file organisation.

• This type of technique is normally used for

storing records for further processing.

Features of Serial Access

• Easy to implement on magnetic tape.

• Generally slow access.

• Usually used for:

Further processing (sorting of records).

As temporary files to store transaction data.

• Suitable for batch search servicing i.e. we can group

together a number of requests and process them as a

group.

• Not suitable for:

on-line access because it is too slow.

A master file as the whole file has to be searched for

a particular record, starting at the beginning.

Sequential Access • Records are stored one after the other but are sorted

using a key sequence.

• Less flexible and more organised than a serial file.

• Records are kept in some pre-defined order e.g. names

stored alphabetically, or records stored numerically.

• Retrieval is achieved by scanning the entries in the

same order e.g. 001, 002, 003, 004, 005 etc, so if we

want record number 200 then records 001 to199 have to

be scanned first.

Features of Sequential Access

• Records stored in pre-defined order.

• Sequential access to successive records.

• Suited to magnetic tape.

• To maintain the sequential order:

• Updating is a complicated and difficult task.

• Records will usually need to be moved by one

place in order to add (slot in) a record in the

proper sequential order.

• Deleting records will usually require that

records be shifted back one place to avoid

gaps in the sequence.

Features Sequential Access • Very useful for:

• Transaction processing where the hit rate is

very high e.g. payroll systems.

• When the whole file is processed as this is

quick and efficient.

• Access times are still too slow (no better on

average than serial) to be useful in on-line

applications.

Random or Direct Access

• Records are accessed directly, allowing records

to be read in any order. For example, to read

record 005 you just jump directly to it.

• Data can be read or write anywhere in the file.

• The medium being used must allows a jump to

any point in the file (disk storage).

• Is not favoured if the demand is primarily for

sequential processing.

Hash Coding • Enable direct retrieval of desired records

without the need to search files or indices.

• The hashing algorithm is first applied to one of

the keys of the record (e.g. driving licence

number, student number or National Insurance

number).

• Converts the key to an address, by mathematical

or logical calculations.

• Direct addressing is used when records have to

be searched frequently in an unpredictable

fashion.

Hash Coding -Example

• The sale of goods in shops where details about

individual items have to be made available

simultaneously in a random fashion at many

points (check out lanes in a supermarket).

• One method is to divide the primary key by a

prime number and use the remainder as the

address.

• Divide the student number with a prime number,

say 97, and use the remainder as the storage

location in the file.

Hash Coding • (For example, let's take a Student No, 1069, and

divide it by 97. We get a remainder 2, which is

the location of that student record.) The

remainder will be between 0-96. This gives us 97

potential locations for records.

• Once records are stored in this fashion, retrieval

simply involves supplying a student number,

which will be used by the hashing algorithm to

locate the desired student record.

• Once records are stored in this fashion, retrieval

simply involves supplying a student number,

which will be used by the hashing algorithm to

locate the desired student record.

Advantages of hash coding • Rapid access to records in a direct fashion. It

doesn't make use of large index tables and

dictionaries and therefore response times are

very fast.

• Collision requires the creation of overflow area.

Two keys can sometimes calculate to the same

address.

• Example if there is a student number 3300,

division by 97 will produce a remainder 2. (say

1069) in storage location 2. the extra record will

have to be kept in an overflow area.

Disadvantages of hash coding

Disadvantages of hash coding • If hashing produces more than one location for

each record, response time may increase because

of the necessity to search the overflow area when

the key in the hash address does not match the

key we are looking for.

• Sometimes storage space can be wasted if there

are not enough records to occupy the reserved

spaces. For example, if we are using 97 as the

prime key there should be close to 97 records to

go into these predetermined locations. If we

choose to divide by 9713 there should be around

9713 records to optimise the use of storage space.


• Sometimes storage space can be wasted if there

are not enough records to occupy the reserved

spaces.

• For example, if we are using 97 as the prime key

there should be close to 97 records to go into

these predetermined locations. If we choose to

divide by 9713 there should be around 9713

records to optimise the use of storage space.

• The table of locations almost always reveals that

records are not kept in sequential order by the

key.


• Indeed the records are kept in a pseudo random

fashion. Therefore sequential processing of such

a file can raise awkward problems. Suppose we

wish to produce a sequential list of student

numbers; then, for efficiency, we have to keep a

separate sequentially sorted copy. Hash coding

is therefore not used in applications that involve

frequent sequential processing of records. A

more suitable technique would be to use an

indexed sequential file organisation.

Indexed Sequential • Organises the file into sequential order, usually

based on a key field, similar in principle to the

sequential access file.

• However, it is also possible to directly access

records by using a separate index file.

• An indexed file system consists of a pair of files:

• one holding the data

• one storing an index to that data. The index

file will store the addresses of the records

stored on the main file.

• May be more than one index created for a data

file e.g. a library may have its books stored on

computer with indices on author, subject.

Indexed Sequential

• There are two types of indexed files:

• Fully Indexed

• Indexed Sequential

Indexed Sequential • An index to a fully indexed file will contain an

entry for every single record stored on the main

file.

• The records will be indexed on some key e.g.

student number. Very large files will have

correspondingly large indices.

• The index to a (large) file may be split into

different index levels.

• When records are added to such a file, the index

(or indices) must also be updated to include their

relative position and change the relative position

of any other records involved.

Indexed Sequential • This is basically a mixture of sequential and

indexed file organisation techniques. Records

are held in sequential order and can be accessed

randomly through an index. Thus, these files

share the merits of both systems enabling

sequential or direct access to the data.

• The index to these files operates by storing the

highest record key in given cylinders and tracks.

• Note how this organisation gives the index a tree

structure.

• Obviously this type of file organisation will

require a direct access device, such as a hard

disk.

Indexed Sequential • Indexed sequential file organisation is very

useful where records are often retrieved

randomly and are also processed in (sequential)

key order.

• Banks may use this organisation for their auto-

bank machines i.e. customers randomly access

their accounts throughout the day and at the end

of the day the banks can update the whole file

sequentially.

Advantages /Disadvantages of Indexed Sequential

• Advantages :

• Allows records to be accessed directly or

sequentially.

• Direct access ability provides vastly superior

(average) access times.

• Disadvantages :

• Several tables must be stored for the index

makes for a considerable storage overhead.

• The addition/deletion of records is complex.

Because frequent updating can be very

inefficient, especially for large files, batch

updates are often performed.

Physical file organization

• There are various ways in which a file is

physically stored on a tape or disk.

• The information is initially mapped onto the

physical blocks, and eventually onto the tracks

and sectors of a disk.

• At Hope, we keep records of students and each

student has a unique identification number that

is used as a primary key field, e.g. 10052329.

• For further illustration purposes we will assume

that Hope only has 999 students, catering for a

range of ID’s from 001 to 999, hence the

following file.

Physical file organization Student_ ID_ Number Student_Surname

001

002

003

004

005

006

…

…

…

999

George

Hugh

Adams

Murray

Sinclair

Patterson

…

…

…

Cookson

Sequential file organization • In order to access record 005, ‘Sinclair’, the R/W

(read/write) head, which is positioned at the beginning of the file, would need to read records 001 through to 004 first. If we held 999 students records on file, accessing the last record would take a long time.

• A preferred method would be to implement sequential file organisation on a disk but this is not possible, so direct access would be the preferred method of file storage and retrieval.

Sequential file organization • Added to the disk is an index, which is loaded into RAM

and defines the relationship between the primary key

and the corresponding disk address:

• The index tells the disk R/W head where to look for the

data (sector and track).

• The R/W head goes directly to the correct disk track

position, waits for the correct sector to rotate under the

head and then retrieves the student’s record.

• Due to the size of the index (holding in our case pointers

to 999 records and their relevant disk addresses), a

compromise sometimes has to be reached between

direct and sequential file organisation.

Sequential file organization

Index Sequential file organization To store such information would require a vast amount of memory. In order to avoid this and reduce our index file size, we could simply omit the last digit as shown below:

Index Sequential file organization This time-space compromise would reduce the

demand on memory and the time spent processing

the data.

If we were to look for record 010 we will have an

immediate access to it (provided there are no other

IDs within the same region, i.e. 011 to 019),

otherwise the records would be accessed

sequentially through the index, until the required

record is reached.

Criteria for selecting file organisation

• There are four main criteria to be considered when choosing a file organisation technique: • File use ratio (hit rate) • File volatility • File size • User requirements

File Ratio (Hit Rate) number of records that are accessed • File Ratio = ---------------------------------------------------- the total number of records in the file • If the ratio is high it indicates that the majority of

records are used regularly which means sequential/serial file organisation may be the appropriate method.

• If the ratio is low (say 5% to 10%) then the implication is that the ability to retrieve a desired record quickly is crucial and therefore direct file organisation should be recommended

File Ratio – High Ratio example

• Payroll production is high activity file. • Organisations production of payroll and

payslips is a regular event, which can be either weekly or monthly.

• Requires processing of all or nearly all the employee records and therefore the file-use ratio will be close to or equal to one (100%).

• Thus sequential file organisation is preferred.

File Ratio – Medium ratio example • Customer accounts in banks: Both random and

sequential access are required. • Several customers should be able to withdraw

cash from a cash dispensers simultaneously and randomly

• The bank should be able to update all customer accounts periodically by sequential processing.

• Indexed sequential file organisation may therefore be the most suited to this type of application.

File Ratio – low ratio example

Airline ticket reservations: only one record is accessed at a time. This record is required quickly and therefore direct accessing is most appropriate.

Calculating the File Ratio

Examples:

• File has 8,000 records, 250 of which are

accessed and updated per week. File use ratio = 250 / 8000 = 0.03125 per week (very low)

• 4100 records are accessed per week. File use ratio = 4100 / 8000 = 0.5125 per week (medium)

• All but 400 were accessed weekly, i.e. 7600

accessed per week. file use ratio = 7600 / 8000 = 0.95 per week (very high)

File Volatility

• This indicates how often files require

modification and updating, e.g. insertions and

deletions.

• Highly volatile files are not usually indexed, as

this would entail excessive overheads in too

frequently updating the index and file.

• Indexing is used when the data is fairly stable.

• When files are large serial/sequential location

techniques give longer access times. Thus large files are

usually indexed or direct files.

File Size

User Requirements • The main factor to concern most users is how they

access the files:

• Batch access :If they are happy to use batch access

then sequential file organisation is likely to be

appropriate providing the file activity is reasonable.

• Interactive access: If the user needs to operate

interactively then direct access will be required

which will mean indexed or hash coded files.

Minor Criteria

• The type of storage device available e.g.

magnetic tape will only allow serial/sequential

access.

• The ease (or complexity) of actually

implementing the file organisation technique

with the data concerned.

• Availability/features/cost of software to handle

the organisation technique preferred according to

other factors.

Physical and Relative Addresses • To retrieve records we must know where they

are stored.

• There are two ways of indicating the location in

which they are stored:

1. Physical Addresses

2. Relative Addresses

Physical Addresses

• Tell us the actual physical location

of the record on the storage

medium

• e.g. on a magnetic disk we would

need to know the cylinder, track

and sector which held the record.

Relative Addresses • Used by Modern file organisation techniques.

• The address is provided according to its position

in the file and not its physical location on the

storage device.

• The 56th record in a file would have a an address

of 56, independent of its physical location.

• Must be converted to physical ones at some point

for the computer to find the record.

File Content • Files will have very different contents according

to the work that they are created to assist.

• The number and type of users may also have an

affect.

• Private one user files : • These are created to be used by one operator (hold data

for one job).

• Private database files: • They store data for a group of related users (e.g.

managers in an organisation).

• Several programs may well operate on the same

database file(s) e.g. a student file may be used to

produce student identity cards, update course/exam

results and produce mail shots.

File Content • Public files (Shared Files)

• These are also called shared files. They are created in

order that users of a common computing service can

all access each other’s files either in parts or in their

entirety, as specified by the producers of the files.

• Public database files • These are also called databanks and are databases that

are open to public enquiry. They usually concentrate

on a particular field such as medicine, law, finance etc.

Often they are not a free service but charge a

subscription/registration fee and/or charge for usage.

File Classification 1. Master File: Contains permanent records that are updated

by adding, deleting or editing data.

2. Transaction File: Contains records of changes, additions

and deletions made to a master file that may be

summarised before storage in the master file.

3. Table File: Contains a table of static data e.g. tax rates

that is referenced by one of the other types of files.

4. Report File: Contains information that has been prepared

by the user for display or spooling to a printer e.g. output

of the maintenance run of a Pascal program.

5. Control File: A small file containing file handling records.

6. History File: Backup files from past runs.

Batch Processing

• In batch processing, data is stored during working

hours and then copied to a secondary storage medium

such as a magnetic tape or server during the evening or

whenever the computer is idle.

• Batch processing usually requires the use of the

computer or a peripheral device for an extended period

of time.

• Once the batch job begins, it continues until it is done

or until an error occurs.

Lecturer 4: File Handling - WordPress.com€¦ · File Handling • The logical and physical organisation of files. • Serial and sequential file handling methods. • Direct and

Documents