XII. Distributed Database Management Systems

1

XII. Distributed Database Management Systems

2

Overview Where is the database located?

Local or remote storage Reasons for remote storage? Options for remote storage

Centralized database Single-site processing, single-site data

Distributed database Multiple-site processing, multiple-site data

2

Fully Distributed Database Management System

3

4

Distributed Database Potential drawbacks of centralized data

Availability (in case of failure) Network communications costs Bottleneck at site

Potential solution Distribute data over remote locations

3

5

Distributed Database Concept Instead of having one, centralized database,

data is spread out among various locations on the distributed network, each of which has its own computer and data storage facilities.

All of this distributed data is still considered to be a single logical database.

6

Distributed DBMS Distributed database management

system Sophisticated software Manages transparency features Maintains ACID principles in a much more

complex environment Involves global (distributed) transactions

involving more than one location Utilizes two-phase commit (2PC) protocol

4

Distributed Database Transparency Features

Distribution transparency

Transaction transparency

Failure transparency

Performance transparency

Heterogeneity transparency

7

8

Distribution Transparency The user just issues the query, and the

result is returned. A person or process anywhere on the

distributed network queries the database.

It is not necessary to know where on the network the data being sought is located.

5

Transaction Transparency Ensures database transactions will

maintain distributed database’s integrity and consistency

Ensures transaction completed only when all database sites involved complete their part

Distributed database systems require complex mechanisms to manage transactions

9

Distributed Concurrency Control Concurrency control is important in

distributed database environment Due to multi-site multiple-process

operations that create inconsistencies and deadlocked transactions

10

6

Effect of Premature COMMIT

11

Two-Phase Commit Protocol (2PC) Guarantees if a portion of a transaction

operation cannot be committed, all changes made at the other sites will be undone To maintain a consistent database state

Requires that each node’s transaction log entry be written before database fragment is updated (write-ahead protocol)

12

7

Two-Phase Commit Protocol (2PC) Defines operations between

coordinator and subordinates Phases of implementation

Preparation Coordinator sends PREPARE TO COMMIT Aborts if all nodes not ready

The final COMMIT Coordinator sends COMMIT message Aborts if all commits not successful

13

Performance and Failure Transparency Performance transparency: Allows a

DDBMS to perform more efficiently than if it were a centralized database

Failure transparency: Ensures the system will operate in case of network failure

Considerations for resolving requests in a distributed data environment: Data distribution Data replication

Replica transparency: DDBMS’s ability to hide multiple copies of data from the user 14

8

Distributed Database Design

• How to partition database into fragments

Data fragmentation

• Which fragments to replicate

Data replication

• Where to locate those fragments and replicas

Data allocation

15

16

Distributing the Data Headquartered in

NY, a company’s database consists of 6 large tables: A, B, C, D, E, F.

With a centralized database, all 6 tables would be located in NY.

9

17

Distributing the Data The company has major sites in Los Angeles,

Memphis, New York, Paris, and Tokyo.

The first and simplest idea in distributing the data would be to disperse the six tables among the five sites, perhaps based on frequency of use of each table.

18

Distributing the Data Tables A and B are kept

at New York

Table C is moved to Memphis

Tables D and E are moved to Tokyo

Table F is moved to Paris.

10

19

Distributing the Data Paris employees can now access Table F without

incurring telecommunications costs associated with accessing Table F in NY.

Local autonomy - Paris employees, e.g., can take responsibility for Table F -- its security, backup and recovery, and concurrency control. Concurrency control: old lock mechanisms still work, though

deadlock can be distributed and harder to detect Portions of the database available even if one or

more of the sites is down. Atomicity ensured by 2PC

All local transactions of a global trans. commit, or none do

20

Distributing the Data Distributed joins required

When the database was centralized at New York, a query issued at any of the sites that required a join of two or more of the tables could be handled in the standard way by the computer at New York.

The result would then be sent to the site that issued the query.

In the dispersed approach, a join might require tables located at different sites.

11

21

Replicated Tables Second option: duplicated tables at two or

more sites on the network.

Advantages Availability - during a site failure, data can still be

accessed at a replicated location. Local access - Replicate table at a site requiring

frequent access. Distributed joins may be simplified if copies of one

or more involved tables are local.

22

Replicated Tables Disadvantages

Security risk

Concurrency control - How do you keep data consistent when it is replicated in tables on three continents?

12

23

Full Data Replication

The maximum approach of replicating every table at every site.

Great for availability Great for joins Minimized

transmission time

24

Full Data Replication

Worst for concurrency control - every change to every table has to be reflected at every site.

Worst for security

Takes up a lot of disk space

13

25

Partial Replication Have a copy of the

entire database at headquarters in New York and have each table replicated exactly once at one of the other sites.

26

Partial Replication Improves availability -

each table is now at two sites.

Security and concurrency exposures are limited (in comparison).

Joins occur at NY.

14

27

Partial Replication New York could tend to

become a bottleneck.

If a table is heavily used in both Tokyo and Los Angeles, it can only be placed at one of the two sites (plus the copy of the entire database in New York), leaving the other with speed and telecom cost problems.

28

Targeted Replication

15

29

Targeted Replication Place copies of tables at the sites that

use them most heavily in order to minimize telecommunications costs.

Ensure that there are at least two copies of important or frequently used tables to realize the gains in availability.

30

Targeted Replication Limit the number of copies of any one

table to control the security and concurrency issues.

Avoid any one site becoming a bottleneck.

Concurrency control still an issue.

16

31

Concurrency Control in Distributed Database The “lost update” problem.

Even worse now The locking protections that we discussed earlier that

can be put into place to handle the problem of concurrent update in a single table are not adequate to handle the new, expanded problem in distributed database systems with replication.

Approaches Asynchronous Synchronous

32

Asynchronous Approach Pull replication If retrieved data does not necessarily

have to be up-to-the-minute accurate, we can use asynchronous approaches to updating replicated data.

How volatile (frequently updated) is the data?

What is our tolerance for occasionally getting old data?

17

33

Asynchronous Schemes The site where the data was updated can

send an update message to the other sites that contain a copy of the same table.

One of the sites can be chosen to accumulate all of the updates to all of the tables, and transmit changes regularly.

Each table can have one of the sites be declared the “dominant” site for that table, which periodically transmits updates to the other sites.

34

Synchronous Approach Push replication If retrieved data does have to be up-to-

the-minute accurate All data in replicated tables worldwide

must always be consistent, accurate, and up-do-date

Use two-phase commit protocol(again)

18

35

Two-Phase Commit: Prepare Phase Each computer on the network has a special log file

in addition to its database tables. The computer at the initiating site sends the updated

data to the other sites that have copies of the table to be updated. This is done in a ready state, just prior to commit.

The computers at the other sites record the changes in their logs (but not in the actual database tables.) These computers attempt to lock the database entities

involved in the update. If they are successful (the entities are not busy and can be

locked) they inform the initiating site.

36

Two-Phase Commit:Commit Phase If all of the other sites reported they

were successful in logging the update and locking the entities, the initiating site issues instructions to transfer the update from the logs to the actual database tables.

19

37

Two-Phase Commit Either all of the replicated files have to be

updated or none of them must be updated. A form of atomicity

A complex, costly, and time-consuming process. Still more complex: competing transactions

The more volatile the data in the database, the less attractive is this procedure for updating replicated tables in the distributed database.

38

Distributed Joins A query that is run from one of the

computers in a distributed database system that requires a join of two or more tables that are not all at the same computer.

The distributed DBMS must have its own built-in expert system that is capable of figuring out an efficient way to handle a request for a distributed join.

20

39

Partitioning The purpose is to have records or

columns of a table resident at the sites that use them the most frequently.

Horizontal Partitioning

Vertical Partitioning

40

Horizontal Partitioning A relational table can

be split up so that some records are located at one site, other records are located at another site, and so on.

e.g., partitioning of Table G.

21

41

Example Banking system

Four branches in four different cities Each branch has own computer, with own accounts stored One site has info about all branches of the bank

Each branch maintains schema: Account: (accountnum,branchname,balance)

Central site maintains schema: Branch: (branchname,branchcity,assets)

Local transaction Add $50 to account A-177 at Greenville branch, initiated at

Greenville branch Global transaction

Transfer $50 from A-177 to A-305, located at the Columbia branch

42

Vertical Partitioning The columns of a table are divided up among

several cities on the network. Each such partition must include the primary

key attribute(s) of the table. Makes sense when different sites are

responsible for processing different functions involving an entity.

Distributed join required to bring data together.

22

Major Tradeoff CAP Theorem:

Only two of these three guarantees: Consistency (all transactions see the same data

at the same time) Availability (all operations eventually receive a

response) Partition tolerance (system continues to

operate despite arbitrary message loss) “Consistency, Availability, Partition

tolerance: Pick two.”43

CAP Theorem In practice, if you have a network that may

drop messages, “partition tolerance” means coping by deciding which of Consistency or Availability to drop.

Leads to BASE alternative to ACID : Basically available Soft state Eventually consistent

Availability at cost of consistency44

23

NoSQL Databases Embrace the BASE model Highly distributed databases with

eventual consistency Recall the three V’s

Volume Velocity Variety

45

Key Assumptions of Hadoop Distributed File System

High volume Write-once, read-many Streaming access

Move computations to

the dataFault tolerance

46

24

Hadoop Distributed File System (HDFS)

47

48

Distributed Directory Management A distributed DBMS must include a directory

that keeps track of where the database tables, the replicated copies of database tables (if any), and the table partitions (if any) are located.

When a query is presented at any site on the network, the distributed DBMS can automatically use the directory to find out where the required data is located and maintain location transparency.

25

49

Directory Location The entire directory could be stored at

only one site. Copies of the directory could be stored

at several of the sites. A copy of the directory could be stored

at every site. (This is generally the best solution.) Why?

50

Distributed Joins The DBMS evaluates various options for

performing a join by considering: The number and size of the records from each

table involved in the join.

The distances and costs of transmitting the records from one site to another to execute the join.

The distance and cost of shipping the result of the join back to the site that issued the query in the first place.

26

51

Distributed DBMS Summary Centralized Database Advantages

Single site provides high degree of security, concurrency, and backup and recovery control.

No need for a distributed directory since all of the data is in one place.

No need for distributed joins since all of the data is in one place.

52

Distributed DBMS Summary Centralized Database Disadvantages

All data accesses from other than the site with the database incur communications costs.

The site with the database can become a bottleneck.

Possible availability problem: if the site with the database goes down, there can be no data access.

27

53

Distributed DBMS Summary Dispersing Tables on the Network (w/no

replication or partitioning) Advantages Local autonomy.

Reduced communications costs because each table can be located at the site that most heavily uses it.

Improved availability because portions of the database are available even if one or some of the sites are down.

54

Distributed DBMS Summary Dispersing Tables on the Network (w/no

replication or partitioning) Disadvantages Several sites have to be concerned with security,

concurrency, and backup and recovery. Requires a distributed directory and the software

to support location transparency. Requires distributed joins.

28

55

Distributed DBMS Summary Targeted Data Replication Advantages

Greatly reduced communications costs for read-only data access because copies of tables can be located at multiple sites that most heavily use them.

Greatly improved availability because if a site with a database table goes down, there may be another site with a copy of that table.

56

Distributed DBMS Summary Targeted Data Replication Disadvantages

Multi-site concurrency control when data in replicated tables is updated.

29

57

Distributed DBMS Summary Partitioned Tables Advantages

Greatest local autonomy because data at the record or column level can be stored at the site(s) that most heavily use it.

Greatly reduced communications costs because data at the record or column level can be stored at the site(s) that most heavily use it.

58

Distributed DBMS Summary Partitioned Tables Disadvantages

Retrieving all or a large portion of a table may require multi-site accesses.

XII. Distributed Database Management Systems

Documents