Top Banner
Distributed databases 1 Distributed databases
35

Distributed databases 1. 2 Outline introduction principles / objectives problems.

Mar 28, 2015

Download

Documents

Cole McGarry
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Distributed databases 1. 2 Outline introduction principles / objectives problems.

Distributed databases

1

Distributed databases

Page 2: Distributed databases 1. 2 Outline introduction principles / objectives problems.

Distributed databases

2

Outline

introduction principles / objectives problems

Page 3: Distributed databases 1. 2 Outline introduction principles / objectives problems.

Distributed databases

3

Introduction

communication network

server

applicationapplication

application

applicationapplication

applicationapplication

server

serverDBMS in its own right

Page 4: Distributed databases 1. 2 Outline introduction principles / objectives problems.

Distributed databases

4

Introduction

distributed database = collection of connected sites each site is a DB in its own right (1)

• has its own DBMS and its own users

• operations can be performed locally as if the DB was not distributed

the sites collaborate (transparently from the user’s point of view) the union of all DBs = the DB of the whole organisation (institution)

• (oppose to (1))

physical or logical distribution strict homogeneity (assumption)

Page 5: Distributed databases 1. 2 Outline introduction principles / objectives problems.

Distributed databases

5

Motivation

advantages matches the structure of the organisation

• example

efficiency of processing• stored closely to where it is being used

increased accessibility• remote DBs can be accessed

disadvantage complexity

Page 6: Distributed databases 1. 2 Outline introduction principles / objectives problems.

Distributed databases

6

Implementations (systems)

commercial INGRES/STAR (Ask Group Inc. Ingres Division) distributed database option of ORACLE7 (Oracle

Corporation) distributed data facility of DB2 (IBM)

they all provide some sort of features for distributed databases

Page 7: Distributed databases 1. 2 Outline introduction principles / objectives problems.

Distributed databases

7

Fundamental principle

a distributed DB system should look to the user exactly as a non-distributed DB system

Page 8: Distributed databases 1. 2 Outline introduction principles / objectives problems.

Distributed databases

8

Principles / objectives

local autonomy no reliance on central site

continuous operation location independence

fragmentation independence replication independence

distributed query processing distributed transaction management

hardware independence OS independence

network independence DBMS independence

Page 9: Distributed databases 1. 2 Outline introduction principles / objectives problems.

Distributed databases

9

Principles / objectives

not independent from each other not exhaustive sometimes contradicting different degree of importance (for the user)

Page 10: Distributed databases 1. 2 Outline introduction principles / objectives problems.

Distributed databases

10

Local autonomy

all operations at a certain site are fully controlled by that site

not achievable (why?) therefore, autonomy should be achieved to the

maximum extent possible

local data is locally owned and managed local data belongs to the local server even if it is

accessible from other servers security, integrity, ..., are in the responsibility of the local

server

Page 11: Distributed databases 1. 2 Outline introduction principles / objectives problems.

Distributed databases

11

No reliance on a central site

reasons bottle-neck vulnerability

conclusion all sites must be equal

Page 12: Distributed databases 1. 2 Outline introduction principles / objectives problems.

Distributed databases

12

Continuous operation

greater reliability the probability that the system is running at any moment

of time

greater availability the probability that the system is running for a specified

period of time

Page 13: Distributed databases 1. 2 Outline introduction principles / objectives problems.

Distributed databases

13

Location independence / transparency

users should not have to know where data is physically stored

why do you think this is needed?• think of application programs

what does this objective look like?

Page 14: Distributed databases 1. 2 Outline introduction principles / objectives problems.

Distributed databases

14

Data fragmentation

data fragmentation if a relation can be divided into “fragments” for storing purposes motivation: performance - data is stored where it is mostly used

types horizontal or vertical

definition fragment = any subrelation derivable via restriction or projection

restrictions disjoint decompositions non-loss decompositions

Page 15: Distributed databases 1. 2 Outline introduction principles / objectives problems.

Distributed databases

15

FRAGMENT Emp INTOLo_Emp AT SITE ‘London’

WHERE Dept_id = ‘Sales’Le_Emp AT SITE ‘Leeds’

WHERE Dept_id = ‘Dev’ ;

Data fragmentation - example

Page 16: Distributed databases 1. 2 Outline introduction principles / objectives problems.

Distributed databases

16

Fragmentation independence / transparency

users should perceive data as if it were not fragmented

why?

it is the optimiser’s responsibility to determine which fragments need to be physically accessed

similar to views retrieving updating (JOIN and UNION views)

Page 17: Distributed databases 1. 2 Outline introduction principles / objectives problems.

Distributed databases

17

FRAGMENT Emp INTOLo_Emp AT SITE ‘London’ WHERE Dept_id = ‘Sales’Le_Emp AT SITE ‘Leeds’ WHERE Dept_id = ‘Dev’ ;

--looks (and works almost) like a viewSELECT * FROM Emp WHERE Salary > 40 AND Dept_id = ‘Dev’;--is transformed intoSELECT * FROM Lo_emp WHERE Salary > 40 AND Dept_id = ‘Dev’;UNIONSELECT * FROM Le_emp WHERE Salary > 40 AND Dept_id = ‘Dev’ ;

Fragmentation independence - example

Page 18: Distributed databases 1. 2 Outline introduction principles / objectives problems.

Distributed databases

18

Data replication

copies of the same fragment can exist at different sites

reasons better availability better performance

disadvantage update propagation

Page 19: Distributed databases 1. 2 Outline introduction principles / objectives problems.

Distributed databases

19

Replication independence / transparency

users should not have to be aware of data replication

it is the optimiser’s responsibility to choose which replica to use

commercial systems not full support for replication independence (update

problems) - primary copy

Page 20: Distributed databases 1. 2 Outline introduction principles / objectives problems.

Distributed databases

20

Distributed query processing

the system must have set level operators one record at a time - too many messages (traffic) relational - indicated

optimisation particularly relevant! find best way to move data across the network

Page 21: Distributed databases 1. 2 Outline introduction principles / objectives problems.

Distributed databases

21

Distributed transaction management

problems recovery concurrency

transaction = set of agents agent - runs on a certain machine

recovery two-phase commit protocol

concurrency locking

Page 22: Distributed databases 1. 2 Outline introduction principles / objectives problems.

Distributed databases

22

Problems

occur due to network utilisation network messages are costly

aim minimise network utilisation

problems: query processing catalog management update propagation recovery control concurrency control

Page 23: Distributed databases 1. 2 Outline introduction principles / objectives problems.

Distributed databases

23

Query processing

in a distributed environment query execution is distributed query optimisation is distributed

• global optimisation

• local optimisation

example• query on relation R issued at site X

• part of R, say Ry, stored at Y

• part of R, say Rz, stored at Z

• where is the query going to be executed?

Page 24: Distributed databases 1. 2 Outline introduction principles / objectives problems.

Distributed databases

24

Query processing example - initial conditions

Site A: Suppliers ( S_id, City ) 10,000 tuplesContracts ( S_id, P_id ) 1,000,000 tuples

Site B: Parts (P_id, Colour ) 100,000 tuples

SELECT S.S_idFROM Suppliers S, Contracts C, Parts PWHERE S.S_id = C.S_id AND P.P_id = C.P_id AND

City = ‘London’ ANDColour = ‘red’ ;

Page 25: Distributed databases 1. 2 Outline introduction principles / objectives problems.

Distributed databases

25

Query processing example - evaluation

possible evaluation procedures (1) move relation Parts to site A and evaluate the query at A (2) move relations Suppliers and Contracts to B and evaluate at B (3) join Suppliers with Contracts at A, restrict the tuples for suppliers

from London, and for each of these tuples check at site B to see whether the corresponding part is red

(4) join Suppliers with Contracts at A, restrict the tuples for suppliers from London, transfer them B and terminate the processing there

(5) restrict Parts to tuples containing red parts, move the result to A and process there

(6) think of other possibilities … there is an extra dimension added by the site where the query was issued

Page 26: Distributed databases 1. 2 Outline introduction principles / objectives problems.

Distributed databases

26

Query processing example - total time

total_time = delay_time + data_transfer_time = no_messages * 0.1 + data_volume(in bits) / 50000

assumptions:

1. disregard computation time on each server (site)2. estimated cardinality of some intermediate results

red parts …... 10contracts with suppliers from London …... 50,000

3. communication assumptions

date rate …... 50k bits / second

access delay …... 0.1 second

4. size of each tuple ……. 200 bits

Page 27: Distributed databases 1. 2 Outline introduction principles / objectives problems.

Distributed databases

27

Query processing example - total time

procedure messages/ accesses

total datatransferred

time

move all to A 1 access 100,000 * 200 6.67min

move all to B 2 accesses (1,000,000 + 10,000) *200

1.12hours

join S and C at site A, restrictto 'London' and check foreach tuple …

2 * 50,000messages(query +response foreach tuple)

0 2.58hours

restrict P to 'red' and checkfor each tuple whether thereexists a contract …

2 * 10messages(see above)

0 2seconds

restrict P to 'red', transfer tosite A and process at A

1 access 10 * 200 0.1seconds

Page 28: Distributed databases 1. 2 Outline introduction principles / objectives problems.

Distributed databases

28

Catalog management

what ‘other’ data does the catalog include? fragmentation, replication ...

where should the catalogue be stored centralised

• objective: no central site!

fully replicated • loss of autonomy - update propagation!

partitioned • non local operations - very expensive!

combination of first and third

Page 29: Distributed databases 1. 2 Outline introduction principles / objectives problems.

Distributed databases

29

Catalog management

R* - object naming

<creator ID>@<creator site ID>.<local name>@<birth site ID> e.g. [email protected]@PostgresGold

each site maintains a catalog entry for every object born at that site (and the site where it had

migrated, if applicable) every object stored at that site object identification - at most 2 sites need to be accessed

Page 30: Distributed databases 1. 2 Outline introduction principles / objectives problems.

Distributed databases

30

Update propagation

problems because of replication data might become less available

• due to immediate update request

primary copy scheme one copy is designated primary copy (unique) primary copies exist at different sites (distributed) an update is logically complete if the primary copy has been updated

• the site holding the primary copy would have to propagate the updates• this has to be done before COMMIT (preserve - ACID)• commercial systems: update propagation is guaranteed for some future

time

violation of local autonomy

Page 31: Distributed databases 1. 2 Outline introduction principles / objectives problems.

Distributed databases

31

Recovery control

two-phase commit protocol issues

there can be no central site so each site should be able to act as a coordinator

• usually the site where the transaction was initiated

other sites are told by the coordinator what to do • loss of autonomy

there is no protocol (theoretically) that guarantees that a transaction is / is not performed by all agents with respect to any kind of failure

increased number of messages more complex protocols

Page 32: Distributed databases 1. 2 Outline introduction principles / objectives problems.

Distributed databases

32

Concurrency control

locking overhead - increased number of messages

primary copy strategy locking only the primary copy the primary copy’s site will propagate the update loss of autonomy (severely)

global deadlock two interlocked (waiting for each other) sites cannot be detected using the wait-for graph - therefore,

communication overhead

Page 33: Distributed databases 1. 2 Outline introduction principles / objectives problems.

Distributed databases

33

Global deadlock

Transaction 1x

Transaction 1y Transaction 2y

Transaction 2x

holds lock on tx

holds lock on ty

site X

site Y

Page 34: Distributed databases 1. 2 Outline introduction principles / objectives problems.

Distributed databases

34

Gateways

DBMS #1(Ingres)

DBMS #2(Oracle)

GATEWAY

Page 35: Distributed databases 1. 2 Outline introduction principles / objectives problems.

Distributed databases

35

Client / server systems

are a special case of distributed database systems

remember from last term read for extra information