Dbms unit-4

Subject: Database Management System Mukesh Kumar

Subject Code: (NCS-502) Assistant Professor (CSE-Deptt)

UNIT-4

I.T.S Engineering College, Greater Noida

Syllabus:

Transaction Processing Concept: Transaction system, Testing of serializability, serializability of

schedules, conflict & view serializable schedule, recoverability, Recovery from transaction failures, log

based recovery, checkpoints, deadlock handling.

Distributed Database: distributed data storage, concurrency control, directory system.

Transaction: A transaction is a unit of program execution that accesses and possibly updates various

data items. Usually, a transaction is initiated by a user program written in a high-level data-manipulation

language or programming language (for example, SQL, COBOL, C, C++, or Java), where it is delimited

by statements (or function calls) of the form begin transaction and end transaction. The transaction

consists of all operations executed between the begin transaction and end transaction.

ACID Properties:

To ensure integrity of the data, we require that the database system maintain the following properties of

the transactions known as ACID properties.

Atomicity. Either all operations of the transaction are executed properly, or none. There must be no

state in a database where a transaction is left partially completed. States should be defined either

before the execution of the transaction or after the execution/abortion/failure of the transaction.

Consistency. The database must remain in a consistent state after any transaction. No transaction

should have any adverse effect on the data residing in the database. If the database was in a

consistent state before the execution of a transaction, it must remain consistent after the execution of

the transaction as well.

Isolation. Even though multiple transactions may execute concurrently, the system guarantees that,

for every pair of transactions Ti and Tj , it appears to Ti that either Tj finished execution before Ti

started, or Tj started execution after Ti finished. Thus, each transaction is unaware of other

transactions executing concurrently in the system.

Durability. After a transaction completes successfully, the changes it has made to the database

persist, even if there are system failures.

Operation on Transactions:

Transactions access data using two operations:

read(X), which transfers the data item X from the database to a local buffer belonging to the

transaction that executed the read operation.

write(X), which transfers the data item X from the the local buffer of the transaction that executed

the write back to the database.

Let Ti be a transaction that transfers $50 from account A to account B. This transaction can be defined

as:

Ti: read(A);

A := A − 50;

write(A);

read(B);

B := B + 50;

write(B).



UNIT-4


Transaction State: A transaction must be in one of the following states:

• Active, the initial state; the transaction stays in this state while it is executing

• Partially committed, after the final statement has been executed

• Failed, after the discovery that normal execution can no longer proceed

• Aborted, after the transaction has been rolled back and the database has been restored to its state

prior to the start of the transaction

• Committed, after successful completion.

1. A transaction starts in the active state.

2. When it finishes its final statement, it enters the partially committed state. At this point, the

transaction has completed its execution, but it is still possible that it may have to be aborted,

since the actual output may still be temporarily residing in main memory, and thus a hardware

failure may preclude its successful completion.

3. A transaction enters the failed state after the system determines that the transaction can no longer

proceed with its normal execution (for example, because of hardware or logical errors).

4. Failed transaction must be rolled back. Then, it enters the aborted state. At this point, the system

has two options:

• It can restart the transaction, but only if the transaction was aborted as a result of some

hardware or software error that was not created through the internal logic of the transaction.

A restarted transaction is considered to be a new transaction.

• It can kill the transaction. It usually does so because of some internal logical error that can be

corrected only by rewriting the application program, or because the input was bad, or because

the desired data were not found in the database.

5. If a transaction executes all its operations successfully, it is said to be committed. All its effects

are now permanently established on the database system.

Implementation of Atomicity and Durability:

Shadow Copy: Shadow copy is a simple, but extremely inefficient scheme to maintain atomicity and

durability of transaction. This scheme, which is based on making copies of the database, called shadow

copies, assumes that only one transaction is active at a time. The scheme also assumes that the database

is simply a file on disk. A pointer called db-pointer is maintained on disk; it points to the current copy of

the database.

In the shadow-copy scheme, a transaction that wants to update the database first creates a complete copy

of the database. All updates are done on the new database copy, leaving the original copy, the shadow

copy, untouched. If at any point the transaction has to be aborted, the system merely deletes the new

copy. The old copy of the database has not been affected.



UNIT-4


If the transaction completes, it is committed as follows.

1. The operating system is asked to make sure that all pages of the new copy of the database have

been written out to disk.

2. The operating system has written all the pages to disk, the database system updates the pointer

db-pointer to point to the new copy of the database;

3. The new copy then becomes the current copy of the database. The old copy of the database is

then deleted.

The transaction is said to have been committed at the point where the updated db-pointer is written to

disk.

Concurrent Executions

Transaction-processing systems usually allow multiple transactions to run concurrently. Allowing

multiple transactions to update data concurrently causes several complications with consistency.

Ensuring consistency in spite of concurrent execution of transactions requires extra work; it is far easier

to insist that transactions run serially—that is, one at a time, each starting only after the previous one has

completed.

There are two good reasons for allowing concurrency:

Improved throughput and resource utilization: Concurrent transaction increases the

throughput of the system—that is, the number of transactions executed in a given amount of

time.

Reduced waiting time: Concurrent transaction reduces the average response time: the average

time for a transaction to be completed after it has been submitted.

Schedules: The execution sequences that describe chronological order in which instructions are

executed in the system are called schedules.

Let T1 and T2 be two transactions that transfer funds from one account to another.

Transaction T1 transfers $50 from account A to

account B. It is defined as:

T1: read(A);

A := A − 50;

write(A);

read(B);

B := B + 50;

Transaction T2 transfers 10% of the balance from

account A to account B. It is defined as

T2: read(A);

temp := A * 0.1;

A := A − temp;

write(A);

read(B);



UNIT-4


write(B).

B := B + temp;

write(B)

A schedule for a set of transactions must consist of all instructions of those transactions, and must

preserve the order in which the instructions appear in each individual transaction.

Serial Schedule: Serial schedule consists of a sequence of instructions from various transactions, where

the instructions belonging to one single transaction appear together in that schedule.

Now let first execution sequence T1 followed by T2 as schedule 1, and to the second execution sequence

T2 followed by T1 as schedule 2.

Schedule 1 Schedule 2

Serial schedule must be in following order:

Serial schedule 1: r1(A), w1(A), r1(B), w1(B), r2(A), w2(A), r2(B), w2(B)

Serial schedule 2: r2(A), w2(A), r2(B), w2(B), r1(A), w1(A), r1(B), w1(B),

Concurrent Schedule: Concurrent schedule also known as non-serial schedule. A non-serial schedule is

a schedule where the operations of a group of concurrent transactions are interleaved. Eg. Schedule-3

Schedule 3 Schedule 4



UNIT-4


Serializability: Serializability is the classical concurrency scheme. It ensures

that a schedule for executing concurrent transactions is equivalent to one that

executes the transactions serially in some order. It assumes that all accesses to

the database are done using read and write operations..

To ensure serializability, we consider only two operations: read and write.

Between a read(Q) and a write(Q) instruction on a data item Q, a transaction

may perform an arbitrary sequence of operations on the copy of Q that is

residing in the local buffer of the transaction.

Equivalence Schedules

An equivalence schedule can be of the following types −

1. Result Equivalence

If two schedules produce the same result after execution, they

are said to be result equivalent. They may yield the same result

for some value and different results for another set of values.

That's why this equivalence is not generally considered

significant.

2. View Equivalence

Two schedules would be view equivalence if the

transactions in both the schedules perform similar

actions in a similar manner.

For example −

If T reads the initial data in S1, then it also

reads the initial data in S2.

If T reads the value written by J in S1, then it

also reads the value written by J in S2.

If T performs the final write on the data value

in S1, then it also performs the final write on

the data value in S2.

3. Conflict Equivalence

Two operations in a schedule would be conflicting if they have the following properties −

They belong to different transactions.

They access the same data item.

At least one of them is "write" operation.

Two schedules S1 and S2 having multiple transactions with conflicting operations are said to be conflict

equivalent if and only if −

Both the schedules contain the same set of Transactions.

The order of conflicting pairs of operation is maintained in both the schedules.



UNIT-4


Conflict Serializability: Let us consider a schedule S in which there are two consecutive instructions Ii

and Ij, of transactions Ti and Tj , respectively (i ≠ j).

If Ii and Ij refer to different data items, then we can swap Ii and Ij without affecting the results of any

instruction in the schedule.

If Ii and Ij refer to the same data item Q, then the order of the two steps may matter. Since we are

dealing with only read and write instructions, there are four cases that we need to considered:

1. Ii = read(Q), Ij = read(Q). The order of Ii and Ij does not matter, since the same value of Q is

read by Ti and Tj , regardless of the order.

2. Ii = read(Q), Ij = write(Q). If Ii comes before Ij, then Ti does not read the value of Q that is

written by Tj in instruction Ij. If Ij comes before Ii, then Ti reads the value of Q that is written by

Tj. Thus, the order of Ii and Ij matters

3. Ii = write(Q), Ij = read(Q). The order of Ii and Ij matters for reasons similar to those of the

previous case.

4. Ii = write(Q), Ij = write(Q). Since both instructions are write operations, the order of these

instructions does not affect either Ti or Tj. However, the value obtained by the next read(Q)

instruction of S is affected, since the result of only the latter of the two write instructions is

preserved in the database. If there is no other write(Q) instruction after Ii and Ij in S, then the

order of Ii and Ij directly affects the final value of Q in the database state that results from

schedule S.

Thus, only in the case 1 where both Ii and Ij are read instructions does the relative order of their

execution not matter.

Schedule1: showing only the read and write instructions. Schedule2

Let Ii and Ij be consecutive instructions of a schedule S. If Ii and Ij are instructions of different

transactions and Ii and Ij do not conflict, then we can swap the order of Ii and Ij to produce a new

schedule Sꞌ. We expect S to be equivalent to Sꞌ, since all instructions appear in the same order in both

schedules except for Ii and Ij, whose order does not matter.

For the above schedule1- We continue to swap non-conflicting instructions:

1. Swap the read(B) instruction of T1 with the read(A) instruction of T2.

2. Swap the write(B) instruction of T1 with the write(A) instruction of T2.

3. Swap the write(B) instruction of T1 with the read(A) instruction of T2.



UNIT-4


If a schedule S can be transformed into a schedule Sꞌ by a series of swaps of non-conflicting instructions,

we say that S and Sꞌ are conflict equivalent.

The final result of these swaps is a serial schedule. Thus, if a schedule S can be transformed into a

schedule Sꞌ by a series of swaps of non-conflicting instructions, we say that S and Sꞌ are conflict

equivalent.

The concept of conflict equivalence leads to the concept of conflict serializability. We say that a

schedule S is conflict serializable if it is conflict equivalent to a serial schedule.

Recoverability:

1. Recoverable Schedules

A recoverable schedule is one where, for each pair of transactions Ti and Tj such that Tj reads a data

item previously written by Ti, the commit operation of Ti appears before the commit operation of Tj

Let the schedule in which T9 is a transaction that performs only one

instruction: read(A). Suppose that the system allows T9 to commit

immediately after executing the read(A) instruction. Thus, T9 commits

before T8 does. Now suppose that T8 fails before it commits. Since T9 has

read the value of data item A written by T8, we must abort T9 to ensure

transaction atomicity. However, T9 has already committed and cannot be

aborted. Thus, we have a situation where it is impossible to recover correctly

from the failure of T8.

2. Cascadeless Schedules:

If a schedule is recoverable, to recover correctly from the failure of a transaction Ti, we may have to roll

back several transactions. Such situations occur if transactions have read data written by Ti.

A cascadeless schedule is one where, for each pair of transactions Ti and Tj such that Tj reads a data

item previously written by Ti, the commit operation of Ti appears before the read operation of Tj . It is

easy to verify that every cascadeless schedule is also recoverable.

.

Consider the partial schedule Transaction T10 writes a value of

A that is read by transaction T11. Transaction T11 writes a

value of A that is read by transaction T12. Suppose that, at this

point, T10 fails. T10 must be rolled back. Since T11 is

dependent on T10, T11 must be rolled back. Since T12 is

dependent on T11, T12 must be rolled back. This phenomenon,

in which a single transaction failure leads to a series of

transaction rollbacks, is called cascading rollback.

Recovery: Recovery scheme is an integral part of a database system that can restore the database to the

consistent state that existed before the failure. The recovery scheme must also provide high availability;

that is, it must minimize the time for which the database is not usable after a crash.

Failure Classification:

There are various types of failure that may occur in a system, each of which needs to be deal with in a

different manner. The simplest type of failure is one that does not result in the loss of information in the

system. The failures that are more difficult to deal with are those that result in loss of information. We

shall consider only the following types of failure:



UNIT-4


1. Transaction failure. There are two types of errors that may cause a transaction to fail:

a. Logical error. The transaction can no longer continue with its normal execution because of

some internal condition, such as bad input, data not found, overflow, or resource limit

exceeded.

b. System error. The system has entered an undesirable state (for example, deadlock), as a

result of which a transaction cannot continue with its normal execution. The transaction,

however, can be re-executed at a later time.

2. System crash. There is a hardware malfunction, or a bug in the database software or the operating

system, that causes the loss of the content of volatile storage, and brings transaction processing to a

halt. The content of nonvolatile storage remains intact, and is not corrupted. The assumption that

hardware errors and bugs in the software bring the system to a halt, but do not corrupt the

nonvolatile storage contents, is known as the fail-stop assumption. Well-designed systems have

numerous internal checks, at the hardware and the software level that brings the system to a halt

when there is an error. Hence, the fail-stop assumption is a reasonable one.

3. Disk failure. A disk block loses its content as a result of either a head crash or failure during a data

transfer operation. Copies of the data on other disks, or archival backups on tertiary media, such as

tapes, are used to recover from the failure.

Storage Structure We have already described the storage system. In brief, the storage structure can be divided into two

categories –

Volatile storage − As the name suggests, a volatile storage cannot survive system crashes.

Volatile storage devices are placed very close to the CPU; normally they are embedded onto the

chipset itself. For example, main memory and cache memory are examples of volatile storage.

They are fast but can store only a small amount of information.

Non-volatile storage − these memories are made to survive system crashes. They are huge in

data storage capacity, but slower in accessibility. Examples may include hard-disks, magnetic

tapes, flash memory, and non-volatile (battery backed up) RAM.

Recovery and Atomicity

When a system crashes, it may have several transactions being executed and various files opened for

them to modify the data items. Transactions are made of various operations, which are atomic in nature.

But according to ACID properties of DBMS, atomicity of transactions as a whole must be maintained,

that is, either all the operations are executed or none.

When a DBMS recovers from a crash, it should maintain the following −

It should check the states of all the transactions, which were being executed.

A transaction may be in the middle of some operation; the DBMS must ensure the atomicity of

the transaction in this case.

It should check whether the transaction can be completed now or it needs to be rolled back.

No transactions would be allowed to leave the DBMS in an inconsistent state.

There are two types of techniques, which can help a DBMS in recovering as well as maintaining the

atomicity of a transaction −

Maintaining the logs of each transaction, and writing them onto some stable storage before

actually modifying the database.

Maintaining shadow paging, where the changes are done on a volatile memory, and later, the

actual database is updated.



UNIT-4


Log-based Recovery

Log is a sequence of records, which maintains the records of actions performed by a transaction. It is

important that the logs are written prior to the actual modification and stored on a stable storage media,

which is failsafe. An update log record describes a single database write. It has these fields:

Transaction identifier is the unique identifier of the transaction that performed the write

operation.

Data-item identifier is the unique identifier of the data item written. Typically, it is the location

on disk of the data item.

Old value is the value of the data item prior to the write.

New value is the value that the data item will have after the write.

Other special log records exist to record significant events during transaction processing, such as the

start of a transaction and the commit or abort of a transaction. We denote the various types of log

records as:

<Ti start>. Transaction Ti has started.

<Ti, Xj, V1, V2>. Transaction Ti has performed a write on data item Xj . Xj had value V1 before the

write, and will have value V2 after the write.

<Ti commit>. Transaction Ti has committed.

<Ti abort>. Transaction Ti has aborted.

Whenever a transaction performs a write, it is essential that the log record for that write be created

before the database is modified. Once a log record exists, we can output the modification to the database

if that is desirable. The database can be modified using two approaches

Deferred database modification − All logs are written on to the stable storage and the database is

updated when a transaction commits.

Immediate database modification − Each log follows an actual database modification. That is, the

database is modified immediately after every operation.

Using the log, the system can handle any failure that does not result in the loss of information in

nonvolatile storage. The recovery scheme uses two recovery procedures:

• undo(Ti) restores the value of all data items updated by transaction Ti to the old values.

• redo(Ti) sets the value of all data items updated by transaction Ti to the new values.

After a failure has occurred, the recovery scheme consults the log to determine which transactions need

to be redone, and which need to be undone:

Transaction Ti needs to be undone if the log contains the record <Ti start>, but does not contain the

record <Ti commit>.

Transaction Ti needs to be redone if the log contains both the record <Ti start> and the record <Ti

commit>.

Checkpoints

There are two major difficulties with this Log Based Recovery and redo /undo operation.

1. The search process is time consuming.

2. Most of the transactions that, according to our algorithm, need to be redone have already written

their updates into the database. Although redoing them will cause no harm, it will nevertheless

cause recovery to take longer.



UNIT-4


Thus keeping and maintaining logs in real time and in real environment may fill out all the memory

space available in the system. As time passes, the log file may grow too big to be handled at all.

To reduce these types of overhead, we introduce checkpoints. Checkpoint is a mechanism where all the

previous logs are removed from the system and stored permanently in a storage disk. Checkpoint

declares a point before which the DBMS was in consistent state, and all the transactions were

committed. The system periodically performs checkpoints, which require the following sequence of

actions to take place:

1. Output onto stable storage all log records currently residing in main memory.

2. Output to the disk all modified buffer blocks.

3. Output onto stable storage a log record <checkpoint>.

Recovery

The exact recovery operations to be performed depend on the modification technique being used. For the

immediate-modification technique, the recovery operations are:

For all transactions Tk in T that have no <Tk commit> record in the log, execute undo(Tk).

For all transactions Tk in T such that the record <Tk commit> appears in the log, execute redo(Tk).

When a system with concurrent transactions crashes and recovers, it behaves in the following manner −

The recovery system reads the logs backwards

from the end to the last checkpoint.

It maintains two lists, an undo-list and a redo-list.

If the recovery system sees a log with <Tn, Start>

and <Tn, Commit> or just <Tn, Commit>, it puts

the transaction in the redo-list.

If the recovery system sees a log with <Tn, Start>

but no commit or abort log found, it puts the

transaction in undo-list.

All the transactions in the undo-list are then undone and their logs are removed. All the transactions in

the redo-list and their previous logs are removed and then redone before saving their logs.

Deadlock

A system is in a deadlock state if there exists a set of transactions such that every transaction in the set is

waiting for another transaction in the set. More precisely, there exists a set of waiting transactions {T0,

T1, . . ., Tn} such that T0 is waiting for abdata item that T1 holds, and T1 is waiting for a data item that

T2 holds, and . . ., and Tn−1 is waiting for a data item that Tn holds, and Tn is waiting for a data item

that T0 holds. None of the transactions can make progress in such a situation.

Deadlock Handling

There are two principal methods for dealing with the deadlock problem. 1. Deadlock prevention protocol

2. Deadlock detection and deadlock recovery

1. Deadlock Prevention:

There are Two different deadlock prevention schemes using timestamps have been proposed



UNIT-4


(a). The wait–die scheme is a nonpreemptive technique. When transaction Ti requests a data item

currently held by Tj , Ti is allowed to wait only if it has a timestamp smaller than that of Tj (that

is, Ti is older than Tj ). Otherwise, Ti is rolled back (dies).

For example, suppose that transactions T1, T2, and T3 have timestamps 5, 10, and 15,

respectively. If T1 requests a data itemheld by T3, then T1 will wait. If T3 requests a data item

held by T2, then T3 will be rolled back.

(b). The wound–wait scheme is a preemptive technique. It is a counterpart to the wait–die scheme.

When transaction Ti requests a data item currently held by Tj , Ti is allowed to wait only if it has

a timestamp larger than that of Tj (that is, Ti is younger than Tj ). Otherwise, Tj is rolled back

(Tj is wounded by Ti).

For example, with transactions T1, T2, and T3, if T1 requests a data item held by T1, then the

data item will be preempted from T2, and T2 will be rolled back. If T3 requests a data item held

by T2, then T3 will wait.

2. Deadlock Detection and Recovery:

If a system does not employ some protocol that ensures deadlock freedom, then a detection and recovery

scheme must be used. An algorithm that examines the state of the system is invoked periodically to

determine whether a deadlock has occurred.

Deadlock Detection:

A deadlock exists in the system if and only if the wait-for graph contains a cycle. Each transaction

involved in the cycle is said to be deadlocked. To detect deadlocks, the system needs to maintain the

wait-for graph, and periodically to invoke an algorithm that searches for a cycle in the graph.

Wait-for graph without a cycle. Wait-for graph with a cycle.

Wait-for-Graph

Deadlocks can be described precisely in terms of a directed graph called a wait-for graph. This graph

consists of a pair G = (V, E), where V is a set of vertices and E is a set of edges. The set of vertices

consists of all the transactions in the system. Each element in the set E of edges is an ordered pair Ti →

Tj. If Ti → Tj is in E, then there is a directed edge from transaction Ti to Tj , implying that transaction

Ti is waiting for transaction Tj to release a data item that it needs.

When transaction Ti requests a data item currently being held by transaction Tj , then the edge Ti → Tj

is inserted in the wait-for graph. This edge is removed only when transaction Tj is no longer holding a

data item needed by transaction Ti.

Recovery from Deadlock

When a detection algorithm determines that a deadlock exists, the system must recover from the

deadlock. The most common solution is to roll back one or more transactions to break the deadlock.

Three actions need to be taken:



UNIT-4


1. Selection of a victim. Given a set of deadlocked transactions, we must determine which transaction

(or transactions) to roll back to break the deadlock.We should roll back those transactions that will

incur the minimum cost. Many factors may determine the cost of a rollback, including

a. How long the transaction has computed, and how much longer the transaction will compute

before it completes its designated task.

b. How many data items the transaction has used.

c. How many more data items the transaction needs for it to complete.

d. How many transactions will be involved in the rollback.

2. Rollback: Once we have decided that a particular transaction must be rolled back, we must

determine how far this transaction should be rolled back.

(a). Total rollback: Abort the transaction and then restart it. However, it is more effective to roll

back the transaction only as far as necessary to break the deadlock.

(b). Partial rollback requires the system to maintain additional information about the state of all the

running transactions. Specifically, the sequence of lock requests/grants and updates performed

by the transaction needs to be recorded.

3. Starvation. In a system where the selection of victims is based primarily on cost factors, it may

happen that the same transaction is always picked as a victim. As a result, this transaction never

completes its designated task, thus there is starvation. We must ensure that transaction can be picked

as a victim only a (small) finite number of times. The most common solution is to include the

number of rollbacks in the cost factor.

Distributed Database: A distributed database is a collection of multiple interconnected databases,

which are spread physically across various locations that communicate via a computer network. A

distributed DBMS manages the distributed database in a manner so that it appears as one single database

to users.

Features of Distrubuted Database System:

Databases are logically interrelated and interconnected with each other. Often they represent a single

logical database.

Data is physically stored across multiple sites. Data in each site can be managed by a DBMS

independent of the other sites.

The processors in the sites are connected via a network. They do not have any multiprocessor

configuration.

A distributed database is not a loosely connected file system.

A distributed database incorporates transaction processing, but it is not synonymous with a

transaction processing system.

Distributed Database Management System

A distributed database management system (DDBMS) is a centralized software system that manages a

distributed database in a manner as if it were all stored in a single location.

Features

It is used to create, retrieve, update and delete distributed databases.

It synchronizes the database periodically and provides access mechanisms by the virtue of which the

distribution becomes transparent to the users.

It ensures that the data modified at any site is universally updated.



UNIT-4


It is used in application areas where large volumes of data are processed and accessed by numerous

users simultaneously.

It is designed for heterogeneous database platforms.

It maintains confidentiality and data integrity of the databases.

Advantages of Distributed Databases

Following are the advantages of distributed databases over centralized databases.

1. Modular Development − If the system needs to be expanded to new locations or new units, in

centralized database systems, the action requires substantial efforts and disruption in the existing

functioning. However, in distributed databases, the work simply requires adding new computers and

local data to the new site and finally connecting them to the distributed system, with no interruption

in current functions.

2. More Reliable − In case of database failures, the total system of centralized databases comes to a

halt. However, in distributed systems, when a component fails, the functioning of the system

continues may be at a reduced performance. Hence DDBMS is more reliable.

3. Better Response − If data is distributed in an efficient manner, then user requests can be met from

local data itself, thus providing faster response. On the other hand, in centralized systems, all queries

have to pass through the central computer for processing, which increases the response time.

4. Lower Communication Cost − In distributed database systems, if data is located locally where it is

mostly used, then the communication costs for data manipulation can be minimized. This is not

feasible in centralized systems.

Types of Distributed Databases

Distributed databases can be broadly classified into homogeneous and heterogeneous distributed

database environments, each with further sub-divisions, as shown in the following illustration.

Homogeneous Distributed Databases

In a homogeneous distributed database, all the sites use identical DBMS and operating systems. Its

properties are −

The sites use very similar software.



UNIT-4


The sites use identical DBMS or DBMS from the same vendor.

Each site is aware of all other sites and cooperates with other sites to process user requests.

The database is accessed through a single interface as if it is a single database.

Types of Homogeneous Distributed Database

There are two types of homogeneous distributed database −

Autonomous − Each database is independent that functions on its own. They are integrated by a

controlling application and use message passing to share data updates.

Non-autonomous − Data is distributed across the homogeneous nodes and a central or master

DBMS co-ordinates data updates across the sites.

Heterogeneous Distributed Databases

In a heterogeneous distributed database, different sites have different operating systems, DBMS

products and data models. Its properties are −

Different sites use dissimilar schemas and software.

The system may be composed of a variety of DBMSs like relational, network, hierarchical or

object oriented.

Query processing is complex due to dissimilar schemas.

Transaction processing is complex due to dissimilar software.

A site may not be aware of other sites and so there is limited co-operation in processing user

requests.

Types of Heterogeneous Distributed Databases

Federated − The heterogeneous database systems are independent in nature and integrated

together so that they function as a single database system.

Un-federated − The database systems employ a central coordinating module through which the

databases are accessed.

Distributed DBMS Architectures

DDBMS architectures are generally developed depending on three parameters −

Distribution − It states the physical distribution of data across the different sites.

Autonomy − It indicates the distribution of control of the database system and the degree to

which each constituent DBMS can operate independently.

Heterogeneity − It refers to the uniformity or dissimilarity of the data models, system

components and databases.

Architectural Models

1. Client - Server Architecture for DDBMS

This is a two-level architecture where the functionality is divided into servers and clients. The

server functions primarily encompass data management, query processing, optimization and

transaction management. Client functions include mainly user interface. However, they have

some functions like consistency checking and transaction management.

The two different client - server architecture are −

Single Server Multiple Client

Multiple Server Multiple Client (shown in the following diagram)



UNIT-4


Peer- to-Peer Architecture for DDBMS

In these systems, each peer acts both as a client and a server for imparting database services. The peers

share their resource with other peers and co-ordinate their activities.

This architecture generally has four levels of schemas −

Global Conceptual Schema − Depicts the global logical view of data.

Local Conceptual Schema − Depicts logical data organization at each site.

Local Internal Schema − Depicts physical data organization at each site.

External Schema − Depicts user view of data.



UNIT-4


Distributed Data Storage

Consider a relation r that is to be stored in the database. There are two approaches to storing this relation

in the distributed database:

Replication. The system maintains several identical replicas (copies) of the relation, and stores each

replica at a different site. The alternative to replication is to store only one copy of relation r.

Fragmentation. The system partitions the relation into several fragments, and stores each fragment

at a different site.

Fragmentation and replication can be combined: A relation can be partitioned into several fragments and

there may be several replicas of each fragment. In the following subsections, we elaborate on each of

these techniques.

Data Replication:

If relation r is replicated, a copy of relation r is stored in two or more sites. In the most extreme case, we

have full replication, in which a copy is stored in every site in the system.

There are a number of advantages and disadvantages to replication.

Availability. If one of the sites containing relation r fails, then the relation r can be found in another

site. Thus, the system can continue to process queries involving r, despite the failure of one site.

Increased parallelism. In the case where the majority of accesses to the relation r result in only the

reading of the relation, then several sites can process queries involving r in parallel. The more

replicas of r there are, the greater the chance that the needed data will be found in the site where the

transaction is executing. Hence, data replication minimizes movement of data between sites.

Increased overhead on update. The system must ensure that all replicas of a relation r are

consistent; otherwise, erroneous computations may result. Thus, whenever r is updated, the update

must be propagated to all sites containing replicas. The result is increased overhead.

In general, replication enhances the performance of read operations and increases the availability of data

to read-only transactions. However, update transactions incur greater overhead. Controlling concurrent

updates by several transactions to replicated data is more complex than in centralized systems

Data Fragmentation:

If relation r is fragmented, r is divided into a number of fragments r1, r2, . . . , rn. These fragments

contain sufficient information to allow reconstruction of the original relation r. There are two different

schemes for fragmenting a relation:

Horizontal fragmentation

Vertical fragmentation.

Horizontal fragmentation:

1. The Horizontal Fragmentation splits the relation by assigning each tuple of r to one or more

fragments.

2. A relation r is partitioned into a number of subsets, r1, r2, . . . , rn. Each tuple of relation r must

belong to at least one of the fragments, so that the original relation can be reconstructed, if needed. 3. Horizontal fragmentation is usually used to keep tuples at the sites where they are used the most, to

minimize data transfer.

4. A horizontal fragment can be defined as a selection on the global relation r. That is,we use a

predicate Pi to construct fragment ri : ri =σPi (r )

5. We reconstruct the relation r by taking the union of all fragments; that is:

r = r1 ∪ r2 ∪ · · · ∪ rn



UNIT-4


Vertical fragmentation:

1. Vertical fragmentation splits the relation by decomposing the scheme R of relation r.

2. Vertical fragmentation of r(R) involves the definition of several subsets of attributes R1, R2,

………, Rn of the schema R so that: R = R1 ∪ R2 ∪ · · · ∪ Rn

3. Each fragment ri of r is defined by:

ri = ∏Ri (r )

4. The fragmentation should be done in such a way that we can reconstruct relation r from the

fragments by taking the natural join:

r = r1 >< r2 >< r3 >< · · · >< rn

5. The relation r can be reconstructed is to include the primary-key attributes of R in each Ri .More

generally, any superkey can be used. It is often convenient to add a special attribute, called a tuple-

id, to the schema R.

Transparency in Distributed Database:

The user of a distributed database system should not be required to know where the data are physically

located nor how the data can be accessed at the specific local site. This characteristic, called data

transparency, can take several forms:

• Fragmentation transparency. Users are not required to know how a relation has been fragmented.

• Replication transparency. Users view each data object as logically unique. The distributed system

may replicate an object to increase either system performance or data availability. Users do not have

to be concerned with what data objects have been replicated, or where replicas have been placed.

• Location transparency. Users are not required to know the physical location of the data. The

distributed database system should be able to find any data as long as the data identifier is supplied

by the user transaction.



UNIT-4


Commit Protocols: To ensure atomicity, all the sites in which a transaction T executed must agree on

the final outcome of the execution. T must either commit at all sites, or it must abort at all sites. To

ensure this property, the transaction coordinator of T must execute a commit protocol.

Among the simplest and mostwidely used commit protocols is the two-phase commit protocol (2PC).

An alternative is the three-phase commit protocol (3PC), which avoids certain disadvantages of the 2PC

protocol but adds to complexity and overhead.

Two Phase Commit

The steps performed in the two phases are as follows:

Phase 1: Prepare Phase

After each slave has locally completed its transaction, it sends a ―DONE‖ message to the controlling

site. When the controlling site has received ―DONE‖ message from all slaves, it sends a ―Prepare‖

message to the slaves.

The slaves vote on whether they still want to commit or not. If a slave wants to commit, it sends a

―Ready‖ message.

A slave that does not want to commit sends a ―Not Ready‖ message. This may happen when the

slave has conflicting concurrent transactions or there is a time-out.

Phase 2: Commit/Abort Phase

After the controlling site has received ―Ready‖ message from all the slaves:

o The controlling site sends a ―Global Commit‖ message to the slaves.

o The slaves apply the transaction and send a ―Commit ACK‖ message to the controlling site.

o When the controlling site receives ―Commit ACK‖ message from all the slaves, it considers the

transaction as committed.

After the controlling site has received the first ―Not Ready‖ message from any slave:

o The controlling site sends a ―Global Abort‖ message to the slaves.

o The slaves abort the transaction and send a ―Abort ACK‖ message to the controlling site.

o When the controlling site receives ―Abort ACK‖ message from all the slaves, it considers the

transaction as aborted.

Distributed Three-phase Commit

Phase 1: Prepare Phase

o The steps are same as in distributed two-phase commit.

Phase 2: Prepare to Commit Phase

o The controlling site issues an ―Enter Prepared State‖ broadcast message.

o The slave sites vote ―OK‖ in response.

Phase 3: Commit / Abort Phase

o The steps are same as two-phase commit except that ―Commit ACK‖/‖Abort ACK‖ message is

not required.



UNIT-4


Dbms unit-4

Education