1 XII. Distributed Database Management Systems 2 Overview Where is the database located? Local or remote storage Reasons for remote storage? Options for remote storage Centralized database Single-site processing, single-site data Distributed database Multiple-site processing, multiple-site data
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
XII. Distributed Database Management Systems
2
Overview Where is the database located?
Local or remote storage Reasons for remote storage? Options for remote storage
Centralized database Single-site processing, single-site data
Distributed database Multiple-site processing, multiple-site data
2
Fully Distributed Database Management System
3
4
Distributed Database Potential drawbacks of centralized data
Availability (in case of failure) Network communications costs Bottleneck at site
Potential solution Distribute data over remote locations
3
5
Distributed Database Concept Instead of having one, centralized database,
data is spread out among various locations on the distributed network, each of which has its own computer and data storage facilities.
All of this distributed data is still considered to be a single logical database.
6
Distributed DBMS Distributed database management
system Sophisticated software Manages transparency features Maintains ACID principles in a much more
complex environment Involves global (distributed) transactions
involving more than one location Utilizes two-phase commit (2PC) protocol
4
Distributed Database Transparency Features
Distribution transparency
Transaction transparency
Failure transparency
Performance transparency
Heterogeneity transparency
7
8
Distribution Transparency The user just issues the query, and the
result is returned. A person or process anywhere on the
distributed network queries the database.
It is not necessary to know where on the network the data being sought is located.
5
Transaction Transparency Ensures database transactions will
maintain distributed database’s integrity and consistency
Ensures transaction completed only when all database sites involved complete their part
Distributed database systems require complex mechanisms to manage transactions
9
Distributed Concurrency Control Concurrency control is important in
distributed database environment Due to multi-site multiple-process
operations that create inconsistencies and deadlocked transactions
10
6
Effect of Premature COMMIT
11
Two-Phase Commit Protocol (2PC) Guarantees if a portion of a transaction
operation cannot be committed, all changes made at the other sites will be undone To maintain a consistent database state
Requires that each node’s transaction log entry be written before database fragment is updated (write-ahead protocol)
12
7
Two-Phase Commit Protocol (2PC) Defines operations between
coordinator and subordinates Phases of implementation
Preparation Coordinator sends PREPARE TO COMMIT Aborts if all nodes not ready
The final COMMIT Coordinator sends COMMIT message Aborts if all commits not successful
13
Performance and Failure Transparency Performance transparency: Allows a
DDBMS to perform more efficiently than if it were a centralized database
Failure transparency: Ensures the system will operate in case of network failure
Considerations for resolving requests in a distributed data environment: Data distribution Data replication
Replica transparency: DDBMS’s ability to hide multiple copies of data from the user 14
8
Distributed Database Design
• How to partition database into fragments
Data fragmentation
• Which fragments to replicate
Data replication
• Where to locate those fragments and replicas
Data allocation
15
16
Distributing the Data Headquartered in
NY, a company’s database consists of 6 large tables: A, B, C, D, E, F.
With a centralized database, all 6 tables would be located in NY.
9
17
Distributing the Data The company has major sites in Los Angeles,
Memphis, New York, Paris, and Tokyo.
The first and simplest idea in distributing the data would be to disperse the six tables among the five sites, perhaps based on frequency of use of each table.
18
Distributing the Data Tables A and B are kept
at New York
Table C is moved to Memphis
Tables D and E are moved to Tokyo
Table F is moved to Paris.
10
19
Distributing the Data Paris employees can now access Table F without
incurring telecommunications costs associated with accessing Table F in NY.
Local autonomy - Paris employees, e.g., can take responsibility for Table F -- its security, backup and recovery, and concurrency control. Concurrency control: old lock mechanisms still work, though
deadlock can be distributed and harder to detect Portions of the database available even if one or
more of the sites is down. Atomicity ensured by 2PC
All local transactions of a global trans. commit, or none do
20
Distributing the Data Distributed joins required
When the database was centralized at New York, a query issued at any of the sites that required a join of two or more of the tables could be handled in the standard way by the computer at New York.
The result would then be sent to the site that issued the query.
In the dispersed approach, a join might require tables located at different sites.
11
21
Replicated Tables Second option: duplicated tables at two or
more sites on the network.
Advantages Availability - during a site failure, data can still be
accessed at a replicated location. Local access - Replicate table at a site requiring
frequent access. Distributed joins may be simplified if copies of one
or more involved tables are local.
22
Replicated Tables Disadvantages
Security risk
Concurrency control - How do you keep data consistent when it is replicated in tables on three continents?
12
23
Full Data Replication
The maximum approach of replicating every table at every site.
Great for availability Great for joins Minimized
transmission time
24
Full Data Replication
Worst for concurrency control - every change to every table has to be reflected at every site.
Worst for security
Takes up a lot of disk space
13
25
Partial Replication Have a copy of the
entire database at headquarters in New York and have each table replicated exactly once at one of the other sites.
26
Partial Replication Improves availability -
each table is now at two sites.
Security and concurrency exposures are limited (in comparison).
Joins occur at NY.
14
27
Partial Replication New York could tend to
become a bottleneck.
If a table is heavily used in both Tokyo and Los Angeles, it can only be placed at one of the two sites (plus the copy of the entire database in New York), leaving the other with speed and telecom cost problems.
28
Targeted Replication
15
29
Targeted Replication Place copies of tables at the sites that
use them most heavily in order to minimize telecommunications costs.
Ensure that there are at least two copies of important or frequently used tables to realize the gains in availability.
30
Targeted Replication Limit the number of copies of any one
table to control the security and concurrency issues.
Avoid any one site becoming a bottleneck.
Concurrency control still an issue.
16
31
Concurrency Control in Distributed Database The “lost update” problem.
Even worse now The locking protections that we discussed earlier that
can be put into place to handle the problem of concurrent update in a single table are not adequate to handle the new, expanded problem in distributed database systems with replication.
Approaches Asynchronous Synchronous
32
Asynchronous Approach Pull replication If retrieved data does not necessarily
have to be up-to-the-minute accurate, we can use asynchronous approaches to updating replicated data.
How volatile (frequently updated) is the data?
What is our tolerance for occasionally getting old data?
17
33
Asynchronous Schemes The site where the data was updated can
send an update message to the other sites that contain a copy of the same table.
One of the sites can be chosen to accumulate all of the updates to all of the tables, and transmit changes regularly.
Each table can have one of the sites be declared the “dominant” site for that table, which periodically transmits updates to the other sites.
34
Synchronous Approach Push replication If retrieved data does have to be up-to-
the-minute accurate All data in replicated tables worldwide
must always be consistent, accurate, and up-do-date
Use two-phase commit protocol(again)
18
35
Two-Phase Commit: Prepare Phase Each computer on the network has a special log file
in addition to its database tables. The computer at the initiating site sends the updated
data to the other sites that have copies of the table to be updated. This is done in a ready state, just prior to commit.
The computers at the other sites record the changes in their logs (but not in the actual database tables.) These computers attempt to lock the database entities
involved in the update. If they are successful (the entities are not busy and can be
locked) they inform the initiating site.
36
Two-Phase Commit:Commit Phase If all of the other sites reported they
were successful in logging the update and locking the entities, the initiating site issues instructions to transfer the update from the logs to the actual database tables.
19
37
Two-Phase Commit Either all of the replicated files have to be
updated or none of them must be updated. A form of atomicity
A complex, costly, and time-consuming process. Still more complex: competing transactions
The more volatile the data in the database, the less attractive is this procedure for updating replicated tables in the distributed database.
38
Distributed Joins A query that is run from one of the
computers in a distributed database system that requires a join of two or more tables that are not all at the same computer.
The distributed DBMS must have its own built-in expert system that is capable of figuring out an efficient way to handle a request for a distributed join.
20
39
Partitioning The purpose is to have records or
columns of a table resident at the sites that use them the most frequently.
Horizontal Partitioning
Vertical Partitioning
40
Horizontal Partitioning A relational table can
be split up so that some records are located at one site, other records are located at another site, and so on.
e.g., partitioning of Table G.
21
41
Example Banking system
Four branches in four different cities Each branch has own computer, with own accounts stored One site has info about all branches of the bank
Each branch maintains schema: Account: (accountnum,branchname,balance)
Central site maintains schema: Branch: (branchname,branchcity,assets)
Local transaction Add $50 to account A-177 at Greenville branch, initiated at
Greenville branch Global transaction
Transfer $50 from A-177 to A-305, located at the Columbia branch
42
Vertical Partitioning The columns of a table are divided up among
several cities on the network. Each such partition must include the primary
key attribute(s) of the table. Makes sense when different sites are
responsible for processing different functions involving an entity.
Distributed join required to bring data together.
22
Major Tradeoff CAP Theorem:
Only two of these three guarantees: Consistency (all transactions see the same data
at the same time) Availability (all operations eventually receive a
response) Partition tolerance (system continues to
CAP Theorem In practice, if you have a network that may
drop messages, “partition tolerance” means coping by deciding which of Consistency or Availability to drop.
Leads to BASE alternative to ACID : Basically available Soft state Eventually consistent
Availability at cost of consistency44
23
NoSQL Databases Embrace the BASE model Highly distributed databases with
eventual consistency Recall the three V’s
Volume Velocity Variety
45
Key Assumptions of Hadoop Distributed File System
High volume Write-once, read-many Streaming access
Move computations to
the dataFault tolerance
46
24
Hadoop Distributed File System (HDFS)
47
48
Distributed Directory Management A distributed DBMS must include a directory
that keeps track of where the database tables, the replicated copies of database tables (if any), and the table partitions (if any) are located.
When a query is presented at any site on the network, the distributed DBMS can automatically use the directory to find out where the required data is located and maintain location transparency.
25
49
Directory Location The entire directory could be stored at
only one site. Copies of the directory could be stored
at several of the sites. A copy of the directory could be stored
at every site. (This is generally the best solution.) Why?
50
Distributed Joins The DBMS evaluates various options for
performing a join by considering: The number and size of the records from each
table involved in the join.
The distances and costs of transmitting the records from one site to another to execute the join.
The distance and cost of shipping the result of the join back to the site that issued the query in the first place.