Lecturer 4: File Handling
Lecturer 4: File Handling
File Handling • The logical and physical organisation of files.
• Serial and sequential file handling methods.
• Direct and index sequential files.
• Creating, reading, writing and deleting records
from a variety of file structures.
• Creating code to carry out the above operations
Describe the Logical Organization of file? • A file is logically organised as follows:
• A record is a collection of data that belongs together,
e.g. all the data about an individual person.
• A data item is an individual field of a record and
usually contains one piece of data, e.g. a date, first
name, age.
• These fields are collected together to form records.
• Records are collected together to form a file.
• A file is made up of records containing fields.
Logical Organization • A file is logically organised as follows:
Logical Organization
Characteristics of a Sequential Device
Characteristics of a Random or Direct Access
Slow.
Inexpensive.
Access time is dependent on current position.
Fast
Expensive.
Have an almost constant access time.
• This data is stored on magnetic tape (sequential
device).
• Magnetic disk (floppy or hard disk) or optical
media (CD-ROM/CD-R/ CD-RW /DVD etc.)
(Direct access devices).
Serial Access
• Each record is stored, one after the other, with
no regard to any logical order.
• It is the simplest form of file organisation.
• This type of technique is normally used for
storing records for further processing.
Features of Serial Access
• Easy to implement on magnetic tape.
• Generally slow access.
• Usually used for:
Further processing (sorting of records).
As temporary files to store transaction data.
• Suitable for batch search servicing i.e. we can group
together a number of requests and process them as a
group.
• Not suitable for:
on-line access because it is too slow.
A master file as the whole file has to be searched for
a particular record, starting at the beginning.
Sequential Access • Records are stored one after the other but are sorted
using a key sequence.
• Less flexible and more organised than a serial file.
• Records are kept in some pre-defined order e.g. names
stored alphabetically, or records stored numerically.
• Retrieval is achieved by scanning the entries in the
same order e.g. 001, 002, 003, 004, 005 etc, so if we
want record number 200 then records 001 to199 have to
be scanned first.
Features of Sequential Access
• Records stored in pre-defined order.
• Sequential access to successive records.
• Suited to magnetic tape.
• To maintain the sequential order:
• Updating is a complicated and difficult task.
• Records will usually need to be moved by one
place in order to add (slot in) a record in the
proper sequential order.
• Deleting records will usually require that
records be shifted back one place to avoid
gaps in the sequence.
Features Sequential Access • Very useful for:
• Transaction processing where the hit rate is
very high e.g. payroll systems.
• When the whole file is processed as this is
quick and efficient.
• Access times are still too slow (no better on
average than serial) to be useful in on-line
applications.
Random or Direct Access
• Records are accessed directly, allowing records
to be read in any order. For example, to read
record 005 you just jump directly to it.
• Data can be read or write anywhere in the file.
• The medium being used must allows a jump to
any point in the file (disk storage).
• Is not favoured if the demand is primarily for
sequential processing.
Hash Coding • Enable direct retrieval of desired records
without the need to search files or indices.
• The hashing algorithm is first applied to one of
the keys of the record (e.g. driving licence
number, student number or National Insurance
number).
• Converts the key to an address, by mathematical
or logical calculations.
• Direct addressing is used when records have to
be searched frequently in an unpredictable
fashion.
Hash Coding -Example
• The sale of goods in shops where details about
individual items have to be made available
simultaneously in a random fashion at many
points (check out lanes in a supermarket).
• One method is to divide the primary key by a
prime number and use the remainder as the
address.
• Divide the student number with a prime number,
say 97, and use the remainder as the storage
location in the file.
Hash Coding • (For example, let's take a Student No, 1069, and
divide it by 97. We get a remainder 2, which is
the location of that student record.) The
remainder will be between 0-96. This gives us 97
potential locations for records.
• Once records are stored in this fashion, retrieval
simply involves supplying a student number,
which will be used by the hashing algorithm to
locate the desired student record.
• Once records are stored in this fashion, retrieval
simply involves supplying a student number,
which will be used by the hashing algorithm to
locate the desired student record.
Advantages of hash coding • Rapid access to records in a direct fashion. It
doesn't make use of large index tables and
dictionaries and therefore response times are
very fast.
• Collision requires the creation of overflow area.
Two keys can sometimes calculate to the same
address.
• Example if there is a student number 3300,
division by 97 will produce a remainder 2. (say
1069) in storage location 2. the extra record will
have to be kept in an overflow area.
Disadvantages of hash coding
Disadvantages of hash coding • If hashing produces more than one location for
each record, response time may increase because
of the necessity to search the overflow area when
the key in the hash address does not match the
key we are looking for.
• Sometimes storage space can be wasted if there
are not enough records to occupy the reserved
spaces. For example, if we are using 97 as the
prime key there should be close to 97 records to
go into these predetermined locations. If we
choose to divide by 9713 there should be around
9713 records to optimise the use of storage space.
Disadvantages of hash coding
• Sometimes storage space can be wasted if there
are not enough records to occupy the reserved
spaces.
• For example, if we are using 97 as the prime key
there should be close to 97 records to go into
these predetermined locations. If we choose to
divide by 9713 there should be around 9713
records to optimise the use of storage space.
• The table of locations almost always reveals that
records are not kept in sequential order by the
key.
Disadvantages of hash coding
• Indeed the records are kept in a pseudo random
fashion. Therefore sequential processing of such
a file can raise awkward problems. Suppose we
wish to produce a sequential list of student
numbers; then, for efficiency, we have to keep a
separate sequentially sorted copy. Hash coding
is therefore not used in applications that involve
frequent sequential processing of records. A
more suitable technique would be to use an
indexed sequential file organisation.
Indexed Sequential • Organises the file into sequential order, usually
based on a key field, similar in principle to the
sequential access file.
• However, it is also possible to directly access
records by using a separate index file.
• An indexed file system consists of a pair of files:
• one holding the data
• one storing an index to that data. The index
file will store the addresses of the records
stored on the main file.
• May be more than one index created for a data
file e.g. a library may have its books stored on
computer with indices on author, subject.
Indexed Sequential
• There are two types of indexed files:
• Fully Indexed
• Indexed Sequential
Indexed Sequential • An index to a fully indexed file will contain an
entry for every single record stored on the main
file.
• The records will be indexed on some key e.g.
student number. Very large files will have
correspondingly large indices.
• The index to a (large) file may be split into
different index levels.
• When records are added to such a file, the index
(or indices) must also be updated to include their
relative position and change the relative position
of any other records involved.
Indexed Sequential • This is basically a mixture of sequential and
indexed file organisation techniques. Records
are held in sequential order and can be accessed
randomly through an index. Thus, these files
share the merits of both systems enabling
sequential or direct access to the data.
• The index to these files operates by storing the
highest record key in given cylinders and tracks.
• Note how this organisation gives the index a tree
structure.
• Obviously this type of file organisation will
require a direct access device, such as a hard
disk.
Indexed Sequential • Indexed sequential file organisation is very
useful where records are often retrieved
randomly and are also processed in (sequential)
key order.
• Banks may use this organisation for their auto-
bank machines i.e. customers randomly access
their accounts throughout the day and at the end
of the day the banks can update the whole file
sequentially.
Advantages /Disadvantages of Indexed Sequential
• Advantages :
• Allows records to be accessed directly or
sequentially.
• Direct access ability provides vastly superior
(average) access times.
• Disadvantages :
• Several tables must be stored for the index
makes for a considerable storage overhead.
• The addition/deletion of records is complex.
Because frequent updating can be very
inefficient, especially for large files, batch
updates are often performed.
Physical file organization
• There are various ways in which a file is
physically stored on a tape or disk.
• The information is initially mapped onto the
physical blocks, and eventually onto the tracks
and sectors of a disk.
• At Hope, we keep records of students and each
student has a unique identification number that
is used as a primary key field, e.g. 10052329.
• For further illustration purposes we will assume
that Hope only has 999 students, catering for a
range of ID’s from 001 to 999, hence the
following file.
Physical file organization Student_ ID_ Number Student_Surname
001
002
003
004
005
006
…
…
…
999
George
Hugh
Adams
Murray
Sinclair
Patterson
…
…
…
Cookson
Sequential file organization • In order to access record 005, ‘Sinclair’, the R/W
(read/write) head, which is positioned at the beginning of the file, would need to read records 001 through to 004 first. If we held 999 students records on file, accessing the last record would take a long time.
• A preferred method would be to implement sequential file organisation on a disk but this is not possible, so direct access would be the preferred method of file storage and retrieval.
Sequential file organization • Added to the disk is an index, which is loaded into RAM
and defines the relationship between the primary key
and the corresponding disk address:
• The index tells the disk R/W head where to look for the
data (sector and track).
• The R/W head goes directly to the correct disk track
position, waits for the correct sector to rotate under the
head and then retrieves the student’s record.
• Due to the size of the index (holding in our case pointers
to 999 records and their relevant disk addresses), a
compromise sometimes has to be reached between
direct and sequential file organisation.
Sequential file organization
Index Sequential file organization To store such information would require a vast amount of memory. In order to avoid this and reduce our index file size, we could simply omit the last digit as shown below:
Index Sequential file organization This time-space compromise would reduce the
demand on memory and the time spent processing
the data.
If we were to look for record 010 we will have an
immediate access to it (provided there are no other
IDs within the same region, i.e. 011 to 019),
otherwise the records would be accessed
sequentially through the index, until the required
record is reached.
Criteria for selecting file organisation
• There are four main criteria to be considered when choosing a file organisation technique: • File use ratio (hit rate) • File volatility • File size • User requirements
File Ratio (Hit Rate) number of records that are accessed • File Ratio = ---------------------------------------------------- the total number of records in the file • If the ratio is high it indicates that the majority of
records are used regularly which means sequential/serial file organisation may be the appropriate method.
• If the ratio is low (say 5% to 10%) then the implication is that the ability to retrieve a desired record quickly is crucial and therefore direct file organisation should be recommended
File Ratio – High Ratio example
• Payroll production is high activity file. • Organisations production of payroll and
payslips is a regular event, which can be either weekly or monthly.
• Requires processing of all or nearly all the employee records and therefore the file-use ratio will be close to or equal to one (100%).
• Thus sequential file organisation is preferred.
File Ratio – Medium ratio example • Customer accounts in banks: Both random and
sequential access are required. • Several customers should be able to withdraw
cash from a cash dispensers simultaneously and randomly
• The bank should be able to update all customer accounts periodically by sequential processing.
• Indexed sequential file organisation may therefore be the most suited to this type of application.
File Ratio – low ratio example
Airline ticket reservations: only one record is accessed at a time. This record is required quickly and therefore direct accessing is most appropriate.
Calculating the File Ratio
Examples:
• File has 8,000 records, 250 of which are
accessed and updated per week. File use ratio = 250 / 8000 = 0.03125 per week (very low)
• 4100 records are accessed per week. File use ratio = 4100 / 8000 = 0.5125 per week (medium)
• All but 400 were accessed weekly, i.e. 7600
accessed per week. file use ratio = 7600 / 8000 = 0.95 per week (very high)
File Volatility
• This indicates how often files require
modification and updating, e.g. insertions and
deletions.
• Highly volatile files are not usually indexed, as
this would entail excessive overheads in too
frequently updating the index and file.
• Indexing is used when the data is fairly stable.
• When files are large serial/sequential location
techniques give longer access times. Thus large files are
usually indexed or direct files.
File Size
User Requirements • The main factor to concern most users is how they
access the files:
• Batch access :If they are happy to use batch access
then sequential file organisation is likely to be
appropriate providing the file activity is reasonable.
• Interactive access: If the user needs to operate
interactively then direct access will be required
which will mean indexed or hash coded files.
Minor Criteria
• The type of storage device available e.g.
magnetic tape will only allow serial/sequential
access.
• The ease (or complexity) of actually
implementing the file organisation technique
with the data concerned.
• Availability/features/cost of software to handle
the organisation technique preferred according to
other factors.
Physical and Relative Addresses • To retrieve records we must know where they
are stored.
• There are two ways of indicating the location in
which they are stored:
1. Physical Addresses
2. Relative Addresses
Physical Addresses
• Tell us the actual physical location
of the record on the storage
medium
• e.g. on a magnetic disk we would
need to know the cylinder, track
and sector which held the record.
Relative Addresses • Used by Modern file organisation techniques.
• The address is provided according to its position
in the file and not its physical location on the
storage device.
• The 56th record in a file would have a an address
of 56, independent of its physical location.
• Must be converted to physical ones at some point
for the computer to find the record.
File Content • Files will have very different contents according
to the work that they are created to assist.
• The number and type of users may also have an
affect.
• Private one user files : • These are created to be used by one operator (hold data
for one job).
• Private database files: • They store data for a group of related users (e.g.
managers in an organisation).
• Several programs may well operate on the same
database file(s) e.g. a student file may be used to
produce student identity cards, update course/exam
results and produce mail shots.
File Content • Public files (Shared Files)
• These are also called shared files. They are created in
order that users of a common computing service can
all access each other’s files either in parts or in their
entirety, as specified by the producers of the files.
• Public database files • These are also called databanks and are databases that
are open to public enquiry. They usually concentrate
on a particular field such as medicine, law, finance etc.
Often they are not a free service but charge a
subscription/registration fee and/or charge for usage.
File Classification 1. Master File: Contains permanent records that are updated
by adding, deleting or editing data.
2. Transaction File: Contains records of changes, additions
and deletions made to a master file that may be
summarised before storage in the master file.
3. Table File: Contains a table of static data e.g. tax rates
that is referenced by one of the other types of files.
4. Report File: Contains information that has been prepared
by the user for display or spooling to a printer e.g. output
of the maintenance run of a Pascal program.
5. Control File: A small file containing file handling records.
6. History File: Backup files from past runs.
Batch Processing
• In batch processing, data is stored during working
hours and then copied to a secondary storage medium
such as a magnetic tape or server during the evening or
whenever the computer is idle.
• Batch processing usually requires the use of the
computer or a peripheral device for an extended period
of time.
• Once the batch job begins, it continues until it is done
or until an error occurs.