Top Banner
Silberschatz, Galvin and Gagne ©2009 perating System Concepts – 8 th Edition, Module 16: Distributed System Structures
90

Module 16: Distributed System Structures

Feb 24, 2016

Download

Documents

BERTHA GUZMAN

Module 16: Distributed System Structures. Chapter Objectives. To provide a high-level overview of distributed systems and the networks that interconnect them To discuss the general structure of distributed operating systems. Motivation. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Module 16:  Distributed System Structures

Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition,

Module 16: Distributed System Structures

Page 2: Module 16:  Distributed System Structures

16.3 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Chapter Objectives

To provide a high-level overview of distributed systems and the networks that interconnect them

To discuss the general structure of distributed operating systems

Page 3: Module 16:  Distributed System Structures

16.4 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Motivation Distributed system is collection of loosely coupled processors

interconnected by a communications network Processors variously called nodes, computers, machines, hosts

Site is location of the processor Reasons for distributed systems

Resource sharing sharing and printing files at remote sites processing information in a distributed database using remote specialized hardware devices

Computation speedup – load sharing Reliability – detect and recover from site failure, function transfer,

reintegrate failed site Communication – message passing

Page 4: Module 16:  Distributed System Structures

16.5 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

A Distributed System

Page 5: Module 16:  Distributed System Structures

16.6 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Types of Distributed Operating Systems

Network Operating Systems Distributed Operating Systems

Page 6: Module 16:  Distributed System Structures

16.7 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Network-Operating Systems

Users are aware of multiplicity of machines. Access to resources of various machines is done explicitly by: Remote logging into the appropriate remote machine (telnet,

ssh) Remote Desktop (Microsoft Windows) Transferring data from remote machines to local machines, via

the File Transfer Protocol (FTP) mechanism

Page 7: Module 16:  Distributed System Structures

16.8 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Distributed-Operating Systems

Users not aware of multiplicity of machines Access to remote resources similar to access to local

resources Data Migration – transfer data by transferring entire file, or

transferring only those portions of the file necessary for the immediate task

Computation Migration – transfer the computation, rather than the data, across the system

Page 8: Module 16:  Distributed System Structures

16.9 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Distributed-Operating Systems (Cont.)

Process Migration – execute an entire process, or parts of it, at different sites Load balancing – distribute processes across network to even the

workload Computation speedup – subprocesses can run concurrently on

different sites Hardware preference – process execution may require specialized

processor Software preference – required software may be available at only a

particular site Data access – run process remotely, rather than transfer all data locally

Page 9: Module 16:  Distributed System Structures

16.10 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Network Structure

Local-Area Network (LAN) – designed to cover small geographical area. Multiaccess bus, ring, or star network Speed 10 – 100 megabits/second Broadcast is fast and cheap Nodes:

usually workstations and/or personal computers a few (usually one or two) mainframes

Page 10: Module 16:  Distributed System Structures

16.12 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Network Types (Cont.)

Wide-Area Network (WAN) – links geographically separated sites Point-to-point connections over long-haul lines (often leased

from a phone company) Speed 1.544 – 45 megbits/second Broadcast usually requires multiple messages Nodes:

usually a high percentage of mainframes

Page 11: Module 16:  Distributed System Structures

16.14 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Network Conversations

14

End: Requester

End: Replier

Page 12: Module 16:  Distributed System Structures

16.16 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Network TopologyThe various topologies are depicted as graphs whose nodes correspond to sites

An edge from node A to node B corresponds to a direct connection between the two sitesThe following six items depict various network topologies

Page 13: Module 16:  Distributed System Structures

16.17 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

TCP/IP and OSI Layers

TCP/IP Suite OSI Reference

Telnet, FTP, SMTP, HTTP, etc.

Application

Application FTAM, X.400, etc.

Presentation ISO 8823

Session ISO 8327

TCP, UDP End – to – End Transport ISO 8073

Internet, IP, ICMP

Path determination Link –to – Link

Network ISO 8473

Network access/link 802.x MAC Data Link ISO 8802.x

LLC/MAC

Physical 802.x phys Physical physical©2005, L.A. DeNoia17

Page 14: Module 16:  Distributed System Structures

16.18 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Communication Protocol

Physical layer – handles the mechanical and electrical details of the physical transmission of a bit stream

Data-link layer – handles the frames, or fixed-length parts of packets, including any error detection and recovery that occurred in the physical layer

Network layer – provides connections and routes packets in the communication network, including handling the address of outgoing packets, decoding the address of incoming packets, and maintaining routing information for proper response to changing load levels

The communication network is partitioned into the following multiple layers:

Page 15: Module 16:  Distributed System Structures

16.19 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Communication Protocol (Cont.)

Transport layer – responsible for low-level network access and for message transfer between clients, including partitioning messages into packets, maintaining packet order, controlling flow, and generating physical addresses

Session layer – implements sessions, or process-to-process communications protocols

Presentation layer – resolves the differences in formats among the various sites in the network, including character conversions, and half duplex/full duplex (echoing)

Application layer – interacts directly with the users’ deals with file transfer, remote-login protocols and electronic mail, as well as schemas for distributed databases

Page 16: Module 16:  Distributed System Structures

16.20 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

TCP/IP View of Encapsulation

©2005, L.A. DeNoia20

User Data

TCP segment

Network segment

Link layer segment

MAC frame

TCP hdr

IP hdr

Linkhdr

MAC trlr

MAC hdr

Page 17: Module 16:  Distributed System Structures

16.21 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

TCP/IP Message Flow

©2005, L.A. DeNoia21

Data Link Layer Data Link Layer

Network Layer Network Layer

Physical Layer Physical Layer

Service

Access Point

Ethernet frames

bits

Interface

Transport Layer Transport Layer

Application Layer Application Layer

IP packets

TCP segments

HTTP messages

Page 18: Module 16:  Distributed System Structures

16.22 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Up and Down the Layers

©2005, L.A. DeNoia22

Phys

Link

Network

TCP

server

Phys

Link

Network

TCP

browser

Phy

Link

Network

Phys

Link

Open System A Relay Node Open System B

HTTP msg

TCP segment

pkt

frm

bits

router

Page 19: Module 16:  Distributed System Structures

16.23 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Communication Structure

Naming and name resolution - How do two processes locate each other to communicate?

Routing strategies - How are messages sent through the network?

Connection strategies - How do two processes send a sequence of messages?

Contention - The network is a shared resource, so how do we resolve conflicting demands for its use?

The design of a communication network must address four basic issues:

Page 20: Module 16:  Distributed System Structures

16.24 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Naming and Name Resolution

Name systems in the network Address messages with the process-id Identify processes on remote systems by

<host-name, identifier> pair Domain name service (DNS) – specifies the naming structure of the

hosts, as well as name to address resolution (Internet)

Page 21: Module 16:  Distributed System Structures

16.25 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Root DNS Servers

com DNS servers org DNS servers edu DNS servers

poly.eduDNS servers

umass.eduDNS serversyahoo.com

DNS serversamazon.comDNS servers

pbs.orgDNS servers

Distributed, Hierarchical Database

client wants IP for www.amazon.com; 1st approx: client queries a root server to find com DNS server client queries com DNS server to get amazon.com DNS server client queries amazon.com DNS server to get IP address for

www.amazon.com

Application 2-25

Page 22: Module 16:  Distributed System Structures

16.26 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

DNS: Root name servers

contacted by local name server that can not resolve name root name server:

contacts authoritative name server if name mapping not known gets mapping returns mapping to local name server

13 root name servers worldwideb USC-ISI Marina del Rey, CA

l ICANN Los Angeles, CA

e NASA Mt View, CAf Internet Software C. Palo Alto, CA (and 36 other locations)

i Autonomica, Stockholm (plus 28 other locations)

k RIPE London (also 16 other locations)

m WIDE Tokyo (also Seoul, Paris, SF)

a Verisign, Dulles, VAc Cogent, Herndon, VA (also LA)d U Maryland College Park, MDg US DoD Vienna, VAh ARL Aberdeen, MDj Verisign, ( 21 locations)

Application 2-26

Page 23: Module 16:  Distributed System Structures

16.27 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Routing Strategies

Fixed routing - A path from A to B is specified in advance; path changes only if a hardware failure disables it Since the shortest path is usually chosen, communication costs are

minimized Fixed routing cannot adapt to load changes Ensures that messages will be delivered in the order in which they were

sent Virtual circuit - A path from A to B is fixed for the duration of one session.

Different sessions involving messages from A to B may have different paths Partial remedy to adapting to load changes Ensures that messages will be delivered in the order in which they were

sent

Page 24: Module 16:  Distributed System Structures

16.28 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Routing Strategies (Cont.)

Dynamic routing - The path used to send a message form site A to site B is chosen only when a message is sent Usually a site sends a message to another site on the link least used at

that particular time Adapts to load changes by avoiding routing messages on heavily used

path Messages may arrive out of order

This problem can be remedied by appending a sequence number to each message

Page 25: Module 16:  Distributed System Structures

16.29 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Connection Strategies

Circuit switching - A permanent physical link is established for the duration of the communication (i.e., telephone system)

Message switching - A temporary link is established for the duration of one message transfer (i.e., post-office mailing system)

Packet switching - Messages of variable length are divided into fixed-length packets which are sent to the destination Each packet may take a different path through the network The packets must be reassembled into messages as they arrive

Circuit switching requires setup time, but incurs less overhead for shipping each message, and may waste network bandwidth Message and packet switching require less setup time, but incur more

overhead per message

Page 26: Module 16:  Distributed System Structures

16.30 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition©2005, L.A. DeNoia

30

Circuit Switching

A

B C

DE

Page 27: Module 16:  Distributed System Structures

16.31 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition©2005, L.A. DeNoia

31

Packet Switching

A

B C

D

E

Page 28: Module 16:  Distributed System Structures

16.32 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Contention

CSMA/CD - Carrier sense with multiple access (CSMA); collision detection (CD) A site determines whether another message is currently

being transmitted over that link. If two or more sites begin transmitting at exactly the same time, then they will register a CD and will stop transmitting

When the system is very busy, many collisions may occur, and thus performance may be degraded

CSMA/CD is used successfully in the Ethernet system, the most common network system

Several sites may want to transmit information over a link simultaneously. Techniques to avoid repeated collisions include:

Page 29: Module 16:  Distributed System Structures

16.39 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Failure Detection

Detecting hardware failure is difficult To detect a link failure, a handshaking protocol can be used Assume Site A and Site B have established a link

At fixed intervals, each site will exchange an I-am-up message indicating that they are up and running

If Site A does not receive a message within the fixed interval, it assumes either (a) the other site is not up or (b) the message was lost

Site A can now send an Are-you-up? message to Site B If Site A does not receive a reply, it can repeat the message or try

an alternate route to Site B

Page 30: Module 16:  Distributed System Structures

16.40 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Failure Detection (cont)

If Site A does not ultimately receive a reply from Site B, it concludes some type of failure has occurred

Types of failures:- Site B is down- The direct link between A and B is down- The alternate link from A to B is down- The message has been lost

However, Site A cannot determine exactly why the failure has occurred

Page 31: Module 16:  Distributed System Structures

16.41 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Reconfiguration

When Site A determines a failure has occurred, it must reconfigure the system:

1. If the link from A to B has failed, this must be broadcast to every site in the system

2. If a site has failed, every other site must also be notified indicating that the services offered by the failed site are no longer available

When the link or the site becomes available again, this information must again be broadcast to all other sites

Page 32: Module 16:  Distributed System Structures

16.44 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

An Ethernet Packet

Page 33: Module 16:  Distributed System Structures

Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition,

Chapter 17: Distributed-File Systems

Page 34: Module 16:  Distributed System Structures

16.49 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Background

Distributed file system (DFS) – a distributed implementation of the classical time-sharing model of a file system, where multiple users share files and storage resources

A DFS manages set of dispersed storage devices

Overall storage space managed by a DFS is composed of different, remotely located, smaller storage spaces

There is usually a correspondence between constituent storage spaces and sets of files

Page 35: Module 16:  Distributed System Structures

16.50 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

DFS Structure

Service – software entity running on one or more machines and providing a particular type of function to a priori unknown clients

Server – service software running on a single machine

Client – process that can invoke a service using a set of operations that forms its client interface

A client interface for a file service is formed by a set of primitive file operations (create, delete, read, write)

Client interface of a DFS should be transparent, i.e., not distinguish between local and remote files

Page 36: Module 16:  Distributed System Structures

16.51 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Naming and Transparency

Naming – mapping between logical and physical objects

Multilevel mapping – abstraction of a file that hides the details of how and where on the disk the file is actually stored

A transparent DFS hides the location where in the network the file is stored

For a file being replicated in several sites, the mapping returns a set of the locations of this file’s replicas; both the existence of multiple copies and their location are hidden

Page 37: Module 16:  Distributed System Structures

16.52 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Naming Structures

Location transparency – file name does not reveal the file’s physical storage location

Location independence – file name does not need to be changed when the file’s physical storage location changes

Page 38: Module 16:  Distributed System Structures

16.54 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Remote File Access

Remote-service mechanism is one transfer approach Reduce network traffic by retaining recently accessed disk blocks in

a cache, so that repeated accesses to the same information can be handled locally

If needed data not already cached, a copy of data is brought from the server to the user

Accesses are performed on the cached copy Files identified with one master copy residing at the server

machine, but copies of (parts of) the file are scattered in different caches

Cache-consistency problem – keeping the cached copies consistent with the master file Could be called network virtual memory

Page 39: Module 16:  Distributed System Structures

16.55 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Cache Location – Disk vs. Main Memory

Advantages of disk caches More reliable Cached data kept on disk are still there during recovery and

don’t need to be fetched again

Advantages of main-memory caches: Permit workstations to be diskless Data can be accessed more quickly Performance speedup in bigger memories Server caches (used to speed up disk I/O) are in main memory

regardless of where user caches are located; using main-memory caches on the user machine permits a single caching mechanism for servers and users

Page 40: Module 16:  Distributed System Structures

16.56 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Cache Update Policy

Write-through – write data through to disk as soon as they are placed on any cache Reliable, but poor performance

Delayed-write – modifications written to the cache and then written through to the server later Write accesses complete quickly; some data may be overwritten

before they are written back, and so need never be written at all Poor reliability; unwritten data will be lost whenever a user machine

crashes Variation – scan cache at regular intervals and flush blocks that have

been modified since the last scan Variation – write-on-close, writes data back to the server when the file

is closed Best for files that are open for long periods and frequently modified

Page 41: Module 16:  Distributed System Structures

16.57 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Cachefs and its Use of Caching

Page 42: Module 16:  Distributed System Structures

16.58 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Consistency

Is locally cached copy of the data consistent with the master copy?

Client-initiated approach Client initiates a validity check Server checks whether the local data are consistent with the

master copy

Server-initiated approach Server records, for each client, the (parts of) files it caches When server detects a potential inconsistency, it must react

Page 43: Module 16:  Distributed System Structures

16.59 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Comparing Caching and Remote Service

Page 44: Module 16:  Distributed System Structures

16.62 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Stateful File Service

Mechanism Client opens a file Server fetches information about the file from its disk, stores it

in its memory, and gives the client a connection identifier unique to the client and the open file

Identifier is used for subsequent accesses until the session ends

Server must reclaim the main-memory space used by clients who are no longer active

Increased performance Fewer disk accesses Stateful server knows if a file was opened for sequential access

and can thus read ahead the next blocks

Page 45: Module 16:  Distributed System Structures

16.63 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Stateless File Server

Avoids state information by making each request self-contained

Each request identifies the file and position in the file

No need to establish and terminate a connection by open and close operations

Page 46: Module 16:  Distributed System Structures

16.64 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Distinctions Between Stateful & Stateless Service

Failure Recovery A stateful server loses all its volatile state in a crash

Restore state by recovery protocol based on a dialog with clients, or abort operations that were underway when the crash occurred

Server needs to be aware of client failures in order to reclaim space allocated to record the state of crashed client processes (orphan detection and elimination)

With stateless server, the effects of server failure sand recovery are almost unnoticeable A newly reincarnated server can respond to a self-contained

request without any difficulty

Page 47: Module 16:  Distributed System Structures

16.65 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Distinctions (Cont.)

Penalties for using the robust stateless service: longer request messages slower request processing additional constraints imposed on DFS design

Some environments require stateful service A server employing server-initiated cache validation cannot

provide stateless service, since it maintains a record of which files are cached by which clients

UNIX use of file descriptors and implicit offsets is inherently stateful; servers must maintain tables to map the file descriptors, and store the current offset within a file

Page 48: Module 16:  Distributed System Structures

16.66 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

File Replication

Replicas of the same file reside on failure-independent machines Improves availability and can shorten service time Naming scheme maps a replicated file name to a particular replica

Existence of replicas should be invisible to higher levels Replicas must be distinguished from one another by different

lower-level names Updates – replicas of a file denote the same logical entity, and thus

an update to any replica must be reflected on all other replicas Demand replication – reading a nonlocal replica causes it to be

cached locally, thereby generating a new nonprimary replica

Page 49: Module 16:  Distributed System Structures

Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition,

Chapter 18: Distributed Coordination

Page 50: Module 16:  Distributed System Structures

16.77 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Chapter 18 Distributed Coordination

Event Ordering Mutual Exclusion Atomicity Concurrency Control Deadlock Handling Election Algorithms

Page 51: Module 16:  Distributed System Structures

16.79 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Event Ordering

Happened-before relation (denoted by ) If A and B are events in the same process, and A was executed before

B, then A B If A is the event of sending a message by one process and B is the

event of receiving that message by another process, then A B If A B and B C then A C

Page 52: Module 16:  Distributed System Structures

16.80 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Relative Time for Three Concurrent Processes

Page 53: Module 16:  Distributed System Structures

16.81 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Implementation of

Associate a timestamp with each system event Require that for every pair of events A and B, if A B, then the timestamp of

A is less than the timestamp of B Within each process Pi a logical clock, LCi is associated

The logical clock can be implemented as a simple counter that is incremented between any two successive events executed within a process

Logical clock is monotonically increasing A process advances its logical clock when it receives a message whose

timestamp is greater than the current value of its logical clock If the timestamps of two events A and B are the same, then the events are

concurrent We may use the process identity numbers to break ties and to create a total

ordering

Page 54: Module 16:  Distributed System Structures

16.82 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Distributed Mutual Exclusion (DME)

Assumptions The system consists of n processes; each process Pi resides at a

different processor Each process has a critical section that requires mutual exclusion

Requirement If Pi is executing in its critical section, then no other process Pj is

executing in its critical section We present two algorithms to ensure the mutual exclusion execution of

processes in their critical sections

Page 55: Module 16:  Distributed System Structures

16.83 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

DME: Centralized Approach

One of the processes in the system is chosen to coordinate the entry to the critical section

A process that wants to enter its critical section sends a request message to the coordinator

The coordinator decides which process can enter the critical section next, and its sends that process a reply message

When the process receives a reply message from the coordinator, it enters its critical section

After exiting its critical section, the process sends a release message to the coordinator and proceeds with its execution

This scheme requires three messages per critical-section entry: request reply release

Page 56: Module 16:  Distributed System Structures

16.84 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

DME: Fully Distributed Approach

When process Pi wants to enter its critical section, it generates a new timestamp, TS, and sends the message request (Pi, TS) to all other processes in the system

When process Pj receives a request message, it may reply immediately or it may defer sending a reply back

When process Pi receives a reply message from all other processes in the system, it can enter its critical section

After exiting its critical section, the process sends reply messages to all its deferred requests

Page 57: Module 16:  Distributed System Structures

16.85 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

DME: Fully Distributed Approach (Cont)

The decision whether process Pj replies immediately to a request(Pi, TS) message or defers its reply is based on three factors: If Pj is in its critical section, then it defers its reply to Pi

If Pj does not want to enter its critical section, then it sends a reply immediately to Pi

If Pj wants to enter its critical section but has not yet entered it, then it compares its own request timestamp with the timestamp TS If its own request timestamp is greater than TS, then it sends a reply

immediately to Pi (Pi asked first) Otherwise, the reply is deferred

Page 58: Module 16:  Distributed System Structures

16.86 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Desirable Behavior of Fully Distributed Approach

Freedom from Deadlock is ensured

Freedom from starvation is ensured, since entry to the critical section is scheduled according to the timestamp ordering The timestamp ordering ensures that processes are served in a first-

come, first served order

Page 59: Module 16:  Distributed System Structures

16.87 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Three Undesirable Consequences

The processes need to know the identity of all other processes in the system, which makes the dynamic addition and removal of processes more complex

If one of the processes fails, then the entire scheme collapses This can be dealt with by continuously monitoring the state of all the

processes in the system

Processes that have not entered their critical section must pause frequently to assure other processes that they intend to enter the critical section This protocol is therefore suited for small, stable sets of cooperating

processes

Page 60: Module 16:  Distributed System Structures

16.88 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Token-Passing Approach

Circulate a token among processes in system Token is special type of message Possession of token entitles holder to enter critical section

Processes logically organized in a ring structure Unidirectional ring guarantees freedom from starvation Two types of failures

Lost token – election must be called Failed processes – new logical ring established

Page 61: Module 16:  Distributed System Structures

16.89 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Atomicity

Either all the operations associated with a program unit are executed to completion, or none are performed

Ensuring atomicity in a distributed system requires a transaction coordinator, which is responsible for the following: Starting the execution of the transaction Breaking the transaction into a number of subtransactions, and

distribution these subtransactions to the appropriate sites for execution Coordinating the termination of the transaction, which may result in the

transaction being committed at all sites or aborted at all sites

Page 62: Module 16:  Distributed System Structures

16.90 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Two-Phase Commit Protocol (2PC)

Assumes fail-stop model

Players: Transaction Controller, all local sites involved in the transaction

Execution of the protocol is initiated by the coordinator after the last step of the transaction has been reached

When the protocol is initiated, the transaction may still be executing at some of the local sites

The protocol involves all the local sites at which the transaction executed

Example: Let T be a transaction initiated at site Si and let the transaction coordinator at Si be Ci

Page 63: Module 16:  Distributed System Structures

16.91 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Phase 1: Obtaining a Decision

Ci adds <prepare T> record to the log

Ci sends <prepare T> message to all sites When a site receives a <prepare T> message, the transaction manager

determines if it can commit the transaction If no: add <no T> record to the log and respond to Ci with <abort T> If yes:

add <ready T> record to the log force all log records for T onto stable storage send <ready T> message to Ci

Page 64: Module 16:  Distributed System Structures

16.92 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Phase 1 (Cont)

Coordinator collects responses All respond “ready”,

decision is commit At least one response is “abort”,

decision is abort At least one participant fails to respond within time out period,

decision is abort

Page 65: Module 16:  Distributed System Structures

16.93 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Phase 2: Recording Decision in the Database

Coordinator adds a decision record <abort T> or <commit T>

to its log and forces record onto stable storage Once that record reaches stable storage it is irrevocable (even if failures

occur) Coordinator sends a message to each participant informing it of the decision

(commit or abort) Participants take appropriate action locally

Page 66: Module 16:  Distributed System Structures

16.94 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Failure Handling in 2PC – Site Failure

The log contains a <commit T> record In this case, the site executes redo(T)

The log contains an <abort T> record In this case, the site executes undo(T)

The contains a <ready T> record; consult Ci

If Ci is down, site sends query-status T message to the other sites The log contains no control records concerning T

In this case, the site executes undo(T)

Page 67: Module 16:  Distributed System Structures

16.95 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Failure Handling in 2PC – Coordinator Ci Failure

If an active site contains a <commit T> record in its log, the T must be committed

If an active site contains an <abort T> record in its log, then T must be aborted

If some active site does not contain the record <ready T> in its log then the failed coordinator Ci cannot have decided to commit T Rather than wait for Ci to recover, it is preferable to abort T

All active sites have a <ready T> record in their logs, but no additional control records In this case we must wait for the coordinator to recover Blocking problem – T is blocked pending the recovery of site Si

Page 68: Module 16:  Distributed System Structures

16.96 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Concurrency Control

Modify the centralized concurrency schemes to accommodate the distribution of transactions

Transaction manager coordinates execution of transactions (or subtransactions) that access data at local sites

Local transaction only executes at that site

Global transaction executes at several sites

Page 69: Module 16:  Distributed System Structures

16.97 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Locking Protocols

Can use the two-phase locking protocol in a distributed environment by changing how the lock manager is implemented

Nonreplicated scheme – each site maintains a local lock manager which administers lock and unlock requests for those data items that are stored in that site Simple implementation involves two message transfers for handling lock

requests, and one message transfer for handling unlock requests Deadlock handling is more complex

Page 70: Module 16:  Distributed System Structures

16.98 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Single-Coordinator Approach

A single lock manager resides in a single chosen site, all lock and unlock requests are made a that site

Simple implementation

Simple deadlock handling

Possibility of bottleneck

Vulnerable to loss of concurrency controller if single site fails

Multiple-coordinator approach distributes lock-manager function over several sites

Page 71: Module 16:  Distributed System Structures

16.99 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Majority Protocol

Avoids drawbacks of central control by dealing with replicated data in a decentralized manner

More complicated to implement

Deadlock-handling algorithms must be modified; possible for deadlock to occur in locking only one data item

Page 72: Module 16:  Distributed System Structures

16.101 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Primary Copy

One of the sites at which a replica resides is designated as the primary site Request to lock a data item is made at the primary site of that data item

Concurrency control for replicated data handled in a manner similar to that of unreplicated data

Simple implementation, but if primary site fails, the data item is unavailable, even though other sites may have a replica

Page 73: Module 16:  Distributed System Structures

16.102 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Timestamping

Each transaction is given a unique timestamp, which is used to decide the serialization order.

Generate unique timestamps in distributed scheme: Each site generates a unique local timestamp The global unique timestamp is obtained by concatenation of the unique

local timestamp with the unique site identifier Use a logical clock defined within each site to ensure the fair generation

of timestamps

Timestamp-ordering scheme – combine the centralized concurrency control timestamp scheme with the 2PC protocol to obtain a protocol that ensures serializability with no cascading rollbacks

Page 74: Module 16:  Distributed System Structures

16.103 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Generation of Unique Timestamps

Page 75: Module 16:  Distributed System Structures

16.104 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Deadlock Prevention

Resource-ordering deadlock-prevention – define a global ordering among the system resources Assign a unique number to all system resources A process may request a resource with unique number i only if it is not

holding a resource with a unique number grater than i Simple to implement; requires little overhead

Banker’s algorithm – designate one of the processes in the system as the process that maintains the information necessary to carry out the Banker’s algorithm Also implemented easily, but may require too much overhead

Page 76: Module 16:  Distributed System Structures

16.105 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Timestamped Deadlock-Prevention Scheme

Each process Pi is assigned a unique priority number

Priority numbers are used to decide whether a process Pi should wait for a process Pj; otherwise Pi is rolled back

The scheme prevents deadlocks For every edge Pi Pj in the wait-for graph, Pi has a higher priority than

Pj

Thus a cycle cannot exist

Problem - Starvation

Page 77: Module 16:  Distributed System Structures

16.106 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Wait-Die Scheme

Based on a nonpreemptive technique

If Pi requests a resource currently held by Pj, Pi is allowed to wait only if it has a smaller timestamp than does Pj (Pi is older than Pj) Otherwise, Pi is rolled back (dies)

Example: Suppose that processes P1, P2, and P3 have timestamps 5, 10, and 15 respectively if P1 request a resource held by P2, then P1 will wait If P3 requests a resource held by P2, then P3 will be rolled back

Page 78: Module 16:  Distributed System Structures

16.107 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Would-Wait Scheme

Based on a preemptive technique; counterpart to the wait-die system

If Pi requests a resource currently held by Pj, Pi is allowed to wait only if it has a larger timestamp than does Pj (Pi is younger than Pj). Otherwise Pj is rolled back (Pj is wounded by Pi)

Example: Suppose that processes P1, P2, and P3 have timestamps 5, 10, and 15 respectively If P1 requests a resource held by P2, then the resource will be

preempted from P2 and P2 will be rolled back If P3 requests a resource held by P2, then P3 will wait

Page 79: Module 16:  Distributed System Structures

16.108 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Deadlock Detection

Use wait-for graphs Local wait-for graphs at each local site. The nodes of the

graph correspond to all the processes that are currently either holding or requesting any of the resources local to that site

May also use a global wait-for graph. This graph is the union of all local wait-for graphs.

Page 80: Module 16:  Distributed System Structures

16.109 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Two Local Wait-For Graphs

Page 81: Module 16:  Distributed System Structures

16.110 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Global Wait-For Graph

Page 82: Module 16:  Distributed System Structures

16.111 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Deadlock Detection – Centralized Approach

Each site keeps a local wait-for graph A global wait-for graph is maintained in a single coordination

process There are three different options (points in time) when the wait-for

graph may be constructed:1. Whenever a new edge is inserted or removed in one of the local wait-

for graphs2. Periodically, when a number of changes have occurred in a wait-for

graph3. Whenever the coordinator needs to invoke the cycle-detection

algorithm Unnecessary rollbacks may occur as a result of false cycles

Page 83: Module 16:  Distributed System Structures

16.114 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Local and Global Wait-For Graphs

Page 84: Module 16:  Distributed System Structures

16.115 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Fully Distributed Approach

All controllers share equally the responsibility for detecting deadlock Every site constructs a wait-for graph that represents a part of the total

graph We add one additional node Pex to each local wait-for graph

If a local wait-for graph contains a cycle that does not involve node Pex, then the system is in a deadlock state

A cycle involving Pex implies the possibility of a deadlock Deadlock is a possibility To ascertain whether a deadlock does exist, a distributed deadlock-

detection algorithm must be invoked

Page 85: Module 16:  Distributed System Structures

16.116 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Augmented Local Wait-For Graphs

Page 86: Module 16:  Distributed System Structures

16.117 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Augmented Local Wait-For Graph in Site S2

Page 87: Module 16:  Distributed System Structures

16.118 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Election Algorithms

Determine where a new copy of the coordinator should be restarted Assume that a unique priority number is associated with each active

process in the system, and assume that the priority number of process Pi is I

The coordinator is always the process with the largest priority number. When a coordinator fails, the algorithm must elect that active process with the largest priority number

Two algorithms, the bully algorithm and a ring algorithm, can be used to elect a new coordinator in case of failures

Page 88: Module 16:  Distributed System Structures

16.119 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Bully Algorithm

Applicable to systems where every process can send a message to every other process in the system

If process Pi sends a request that is not answered by the coordinator within a time interval T, assume that the coordinator has failed; Pi tries to elect itself as the new coordinator

Pi sends an election message to every process with a higher priority number, Pi then waits for any of these processes to answer within T

Page 89: Module 16:  Distributed System Structures

16.120 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition

Bully Algorithm (Cont)

If no response within T, assume that all processes with numbers greater than i have failed; Pi elects itself the new coordinator

If answer is received, Pi begins time interval T´, waiting to receive a message that a process with a higher priority number has been elected

If no message is sent within T´, assume the process with a higher number has failed; Pi should restart the algorithm

If there are no active processes with higher numbers, the recovered process forces all processes with lower number to let it become the coordinator process, even if there is a currently active coordinator with a lower number

Page 90: Module 16:  Distributed System Structures

Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition,

End of Chapter 18