Silberschatz, Galvin and Gagne ©2009 perating System Concepts – 8 th Edition, Module 16: Distributed System Structures
Feb 24, 2016
Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition,
Module 16: Distributed System Structures
16.3 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Chapter Objectives
To provide a high-level overview of distributed systems and the networks that interconnect them
To discuss the general structure of distributed operating systems
16.4 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Motivation Distributed system is collection of loosely coupled processors
interconnected by a communications network Processors variously called nodes, computers, machines, hosts
Site is location of the processor Reasons for distributed systems
Resource sharing sharing and printing files at remote sites processing information in a distributed database using remote specialized hardware devices
Computation speedup – load sharing Reliability – detect and recover from site failure, function transfer,
reintegrate failed site Communication – message passing
16.5 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
A Distributed System
16.6 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Types of Distributed Operating Systems
Network Operating Systems Distributed Operating Systems
16.7 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Network-Operating Systems
Users are aware of multiplicity of machines. Access to resources of various machines is done explicitly by: Remote logging into the appropriate remote machine (telnet,
ssh) Remote Desktop (Microsoft Windows) Transferring data from remote machines to local machines, via
the File Transfer Protocol (FTP) mechanism
16.8 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Distributed-Operating Systems
Users not aware of multiplicity of machines Access to remote resources similar to access to local
resources Data Migration – transfer data by transferring entire file, or
transferring only those portions of the file necessary for the immediate task
Computation Migration – transfer the computation, rather than the data, across the system
16.9 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Distributed-Operating Systems (Cont.)
Process Migration – execute an entire process, or parts of it, at different sites Load balancing – distribute processes across network to even the
workload Computation speedup – subprocesses can run concurrently on
different sites Hardware preference – process execution may require specialized
processor Software preference – required software may be available at only a
particular site Data access – run process remotely, rather than transfer all data locally
16.10 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Network Structure
Local-Area Network (LAN) – designed to cover small geographical area. Multiaccess bus, ring, or star network Speed 10 – 100 megabits/second Broadcast is fast and cheap Nodes:
usually workstations and/or personal computers a few (usually one or two) mainframes
16.12 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Network Types (Cont.)
Wide-Area Network (WAN) – links geographically separated sites Point-to-point connections over long-haul lines (often leased
from a phone company) Speed 1.544 – 45 megbits/second Broadcast usually requires multiple messages Nodes:
usually a high percentage of mainframes
16.14 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Network Conversations
14
End: Requester
End: Replier
16.16 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Network TopologyThe various topologies are depicted as graphs whose nodes correspond to sites
An edge from node A to node B corresponds to a direct connection between the two sitesThe following six items depict various network topologies
16.17 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
TCP/IP and OSI Layers
TCP/IP Suite OSI Reference
Telnet, FTP, SMTP, HTTP, etc.
Application
Application FTAM, X.400, etc.
Presentation ISO 8823
Session ISO 8327
TCP, UDP End – to – End Transport ISO 8073
Internet, IP, ICMP
Path determination Link –to – Link
Network ISO 8473
Network access/link 802.x MAC Data Link ISO 8802.x
LLC/MAC
Physical 802.x phys Physical physical©2005, L.A. DeNoia17
16.18 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Communication Protocol
Physical layer – handles the mechanical and electrical details of the physical transmission of a bit stream
Data-link layer – handles the frames, or fixed-length parts of packets, including any error detection and recovery that occurred in the physical layer
Network layer – provides connections and routes packets in the communication network, including handling the address of outgoing packets, decoding the address of incoming packets, and maintaining routing information for proper response to changing load levels
The communication network is partitioned into the following multiple layers:
16.19 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Communication Protocol (Cont.)
Transport layer – responsible for low-level network access and for message transfer between clients, including partitioning messages into packets, maintaining packet order, controlling flow, and generating physical addresses
Session layer – implements sessions, or process-to-process communications protocols
Presentation layer – resolves the differences in formats among the various sites in the network, including character conversions, and half duplex/full duplex (echoing)
Application layer – interacts directly with the users’ deals with file transfer, remote-login protocols and electronic mail, as well as schemas for distributed databases
16.20 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
TCP/IP View of Encapsulation
©2005, L.A. DeNoia20
User Data
TCP segment
Network segment
Link layer segment
MAC frame
TCP hdr
IP hdr
Linkhdr
MAC trlr
MAC hdr
16.21 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
TCP/IP Message Flow
©2005, L.A. DeNoia21
Data Link Layer Data Link Layer
Network Layer Network Layer
Physical Layer Physical Layer
Service
Access Point
Ethernet frames
bits
Interface
Transport Layer Transport Layer
Application Layer Application Layer
IP packets
TCP segments
HTTP messages
16.22 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Up and Down the Layers
©2005, L.A. DeNoia22
Phys
Link
Network
TCP
server
Phys
Link
Network
TCP
browser
Phy
Link
Network
Phys
Link
Open System A Relay Node Open System B
HTTP msg
TCP segment
pkt
frm
bits
router
16.23 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Communication Structure
Naming and name resolution - How do two processes locate each other to communicate?
Routing strategies - How are messages sent through the network?
Connection strategies - How do two processes send a sequence of messages?
Contention - The network is a shared resource, so how do we resolve conflicting demands for its use?
The design of a communication network must address four basic issues:
16.24 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Naming and Name Resolution
Name systems in the network Address messages with the process-id Identify processes on remote systems by
<host-name, identifier> pair Domain name service (DNS) – specifies the naming structure of the
hosts, as well as name to address resolution (Internet)
16.25 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Root DNS Servers
com DNS servers org DNS servers edu DNS servers
poly.eduDNS servers
umass.eduDNS serversyahoo.com
DNS serversamazon.comDNS servers
pbs.orgDNS servers
Distributed, Hierarchical Database
client wants IP for www.amazon.com; 1st approx: client queries a root server to find com DNS server client queries com DNS server to get amazon.com DNS server client queries amazon.com DNS server to get IP address for
www.amazon.com
Application 2-25
16.26 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
DNS: Root name servers
contacted by local name server that can not resolve name root name server:
contacts authoritative name server if name mapping not known gets mapping returns mapping to local name server
13 root name servers worldwideb USC-ISI Marina del Rey, CA
l ICANN Los Angeles, CA
e NASA Mt View, CAf Internet Software C. Palo Alto, CA (and 36 other locations)
i Autonomica, Stockholm (plus 28 other locations)
k RIPE London (also 16 other locations)
m WIDE Tokyo (also Seoul, Paris, SF)
a Verisign, Dulles, VAc Cogent, Herndon, VA (also LA)d U Maryland College Park, MDg US DoD Vienna, VAh ARL Aberdeen, MDj Verisign, ( 21 locations)
Application 2-26
16.27 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Routing Strategies
Fixed routing - A path from A to B is specified in advance; path changes only if a hardware failure disables it Since the shortest path is usually chosen, communication costs are
minimized Fixed routing cannot adapt to load changes Ensures that messages will be delivered in the order in which they were
sent Virtual circuit - A path from A to B is fixed for the duration of one session.
Different sessions involving messages from A to B may have different paths Partial remedy to adapting to load changes Ensures that messages will be delivered in the order in which they were
sent
16.28 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Routing Strategies (Cont.)
Dynamic routing - The path used to send a message form site A to site B is chosen only when a message is sent Usually a site sends a message to another site on the link least used at
that particular time Adapts to load changes by avoiding routing messages on heavily used
path Messages may arrive out of order
This problem can be remedied by appending a sequence number to each message
16.29 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Connection Strategies
Circuit switching - A permanent physical link is established for the duration of the communication (i.e., telephone system)
Message switching - A temporary link is established for the duration of one message transfer (i.e., post-office mailing system)
Packet switching - Messages of variable length are divided into fixed-length packets which are sent to the destination Each packet may take a different path through the network The packets must be reassembled into messages as they arrive
Circuit switching requires setup time, but incurs less overhead for shipping each message, and may waste network bandwidth Message and packet switching require less setup time, but incur more
overhead per message
16.30 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition©2005, L.A. DeNoia
30
Circuit Switching
A
B C
DE
16.31 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition©2005, L.A. DeNoia
31
Packet Switching
A
B C
D
E
16.32 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Contention
CSMA/CD - Carrier sense with multiple access (CSMA); collision detection (CD) A site determines whether another message is currently
being transmitted over that link. If two or more sites begin transmitting at exactly the same time, then they will register a CD and will stop transmitting
When the system is very busy, many collisions may occur, and thus performance may be degraded
CSMA/CD is used successfully in the Ethernet system, the most common network system
Several sites may want to transmit information over a link simultaneously. Techniques to avoid repeated collisions include:
16.39 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Failure Detection
Detecting hardware failure is difficult To detect a link failure, a handshaking protocol can be used Assume Site A and Site B have established a link
At fixed intervals, each site will exchange an I-am-up message indicating that they are up and running
If Site A does not receive a message within the fixed interval, it assumes either (a) the other site is not up or (b) the message was lost
Site A can now send an Are-you-up? message to Site B If Site A does not receive a reply, it can repeat the message or try
an alternate route to Site B
16.40 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Failure Detection (cont)
If Site A does not ultimately receive a reply from Site B, it concludes some type of failure has occurred
Types of failures:- Site B is down- The direct link between A and B is down- The alternate link from A to B is down- The message has been lost
However, Site A cannot determine exactly why the failure has occurred
16.41 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Reconfiguration
When Site A determines a failure has occurred, it must reconfigure the system:
1. If the link from A to B has failed, this must be broadcast to every site in the system
2. If a site has failed, every other site must also be notified indicating that the services offered by the failed site are no longer available
When the link or the site becomes available again, this information must again be broadcast to all other sites
16.44 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
An Ethernet Packet
Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition,
Chapter 17: Distributed-File Systems
16.49 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Background
Distributed file system (DFS) – a distributed implementation of the classical time-sharing model of a file system, where multiple users share files and storage resources
A DFS manages set of dispersed storage devices
Overall storage space managed by a DFS is composed of different, remotely located, smaller storage spaces
There is usually a correspondence between constituent storage spaces and sets of files
16.50 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
DFS Structure
Service – software entity running on one or more machines and providing a particular type of function to a priori unknown clients
Server – service software running on a single machine
Client – process that can invoke a service using a set of operations that forms its client interface
A client interface for a file service is formed by a set of primitive file operations (create, delete, read, write)
Client interface of a DFS should be transparent, i.e., not distinguish between local and remote files
16.51 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Naming and Transparency
Naming – mapping between logical and physical objects
Multilevel mapping – abstraction of a file that hides the details of how and where on the disk the file is actually stored
A transparent DFS hides the location where in the network the file is stored
For a file being replicated in several sites, the mapping returns a set of the locations of this file’s replicas; both the existence of multiple copies and their location are hidden
16.52 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Naming Structures
Location transparency – file name does not reveal the file’s physical storage location
Location independence – file name does not need to be changed when the file’s physical storage location changes
16.54 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Remote File Access
Remote-service mechanism is one transfer approach Reduce network traffic by retaining recently accessed disk blocks in
a cache, so that repeated accesses to the same information can be handled locally
If needed data not already cached, a copy of data is brought from the server to the user
Accesses are performed on the cached copy Files identified with one master copy residing at the server
machine, but copies of (parts of) the file are scattered in different caches
Cache-consistency problem – keeping the cached copies consistent with the master file Could be called network virtual memory
16.55 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Cache Location – Disk vs. Main Memory
Advantages of disk caches More reliable Cached data kept on disk are still there during recovery and
don’t need to be fetched again
Advantages of main-memory caches: Permit workstations to be diskless Data can be accessed more quickly Performance speedup in bigger memories Server caches (used to speed up disk I/O) are in main memory
regardless of where user caches are located; using main-memory caches on the user machine permits a single caching mechanism for servers and users
16.56 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Cache Update Policy
Write-through – write data through to disk as soon as they are placed on any cache Reliable, but poor performance
Delayed-write – modifications written to the cache and then written through to the server later Write accesses complete quickly; some data may be overwritten
before they are written back, and so need never be written at all Poor reliability; unwritten data will be lost whenever a user machine
crashes Variation – scan cache at regular intervals and flush blocks that have
been modified since the last scan Variation – write-on-close, writes data back to the server when the file
is closed Best for files that are open for long periods and frequently modified
16.57 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Cachefs and its Use of Caching
16.58 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Consistency
Is locally cached copy of the data consistent with the master copy?
Client-initiated approach Client initiates a validity check Server checks whether the local data are consistent with the
master copy
Server-initiated approach Server records, for each client, the (parts of) files it caches When server detects a potential inconsistency, it must react
16.59 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Comparing Caching and Remote Service
16.62 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Stateful File Service
Mechanism Client opens a file Server fetches information about the file from its disk, stores it
in its memory, and gives the client a connection identifier unique to the client and the open file
Identifier is used for subsequent accesses until the session ends
Server must reclaim the main-memory space used by clients who are no longer active
Increased performance Fewer disk accesses Stateful server knows if a file was opened for sequential access
and can thus read ahead the next blocks
16.63 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Stateless File Server
Avoids state information by making each request self-contained
Each request identifies the file and position in the file
No need to establish and terminate a connection by open and close operations
16.64 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Distinctions Between Stateful & Stateless Service
Failure Recovery A stateful server loses all its volatile state in a crash
Restore state by recovery protocol based on a dialog with clients, or abort operations that were underway when the crash occurred
Server needs to be aware of client failures in order to reclaim space allocated to record the state of crashed client processes (orphan detection and elimination)
With stateless server, the effects of server failure sand recovery are almost unnoticeable A newly reincarnated server can respond to a self-contained
request without any difficulty
16.65 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Distinctions (Cont.)
Penalties for using the robust stateless service: longer request messages slower request processing additional constraints imposed on DFS design
Some environments require stateful service A server employing server-initiated cache validation cannot
provide stateless service, since it maintains a record of which files are cached by which clients
UNIX use of file descriptors and implicit offsets is inherently stateful; servers must maintain tables to map the file descriptors, and store the current offset within a file
16.66 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
File Replication
Replicas of the same file reside on failure-independent machines Improves availability and can shorten service time Naming scheme maps a replicated file name to a particular replica
Existence of replicas should be invisible to higher levels Replicas must be distinguished from one another by different
lower-level names Updates – replicas of a file denote the same logical entity, and thus
an update to any replica must be reflected on all other replicas Demand replication – reading a nonlocal replica causes it to be
cached locally, thereby generating a new nonprimary replica
Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition,
Chapter 18: Distributed Coordination
16.77 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Chapter 18 Distributed Coordination
Event Ordering Mutual Exclusion Atomicity Concurrency Control Deadlock Handling Election Algorithms
16.79 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Event Ordering
Happened-before relation (denoted by ) If A and B are events in the same process, and A was executed before
B, then A B If A is the event of sending a message by one process and B is the
event of receiving that message by another process, then A B If A B and B C then A C
16.80 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Relative Time for Three Concurrent Processes
16.81 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Implementation of
Associate a timestamp with each system event Require that for every pair of events A and B, if A B, then the timestamp of
A is less than the timestamp of B Within each process Pi a logical clock, LCi is associated
The logical clock can be implemented as a simple counter that is incremented between any two successive events executed within a process
Logical clock is monotonically increasing A process advances its logical clock when it receives a message whose
timestamp is greater than the current value of its logical clock If the timestamps of two events A and B are the same, then the events are
concurrent We may use the process identity numbers to break ties and to create a total
ordering
16.82 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Distributed Mutual Exclusion (DME)
Assumptions The system consists of n processes; each process Pi resides at a
different processor Each process has a critical section that requires mutual exclusion
Requirement If Pi is executing in its critical section, then no other process Pj is
executing in its critical section We present two algorithms to ensure the mutual exclusion execution of
processes in their critical sections
16.83 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
DME: Centralized Approach
One of the processes in the system is chosen to coordinate the entry to the critical section
A process that wants to enter its critical section sends a request message to the coordinator
The coordinator decides which process can enter the critical section next, and its sends that process a reply message
When the process receives a reply message from the coordinator, it enters its critical section
After exiting its critical section, the process sends a release message to the coordinator and proceeds with its execution
This scheme requires three messages per critical-section entry: request reply release
16.84 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
DME: Fully Distributed Approach
When process Pi wants to enter its critical section, it generates a new timestamp, TS, and sends the message request (Pi, TS) to all other processes in the system
When process Pj receives a request message, it may reply immediately or it may defer sending a reply back
When process Pi receives a reply message from all other processes in the system, it can enter its critical section
After exiting its critical section, the process sends reply messages to all its deferred requests
16.85 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
DME: Fully Distributed Approach (Cont)
The decision whether process Pj replies immediately to a request(Pi, TS) message or defers its reply is based on three factors: If Pj is in its critical section, then it defers its reply to Pi
If Pj does not want to enter its critical section, then it sends a reply immediately to Pi
If Pj wants to enter its critical section but has not yet entered it, then it compares its own request timestamp with the timestamp TS If its own request timestamp is greater than TS, then it sends a reply
immediately to Pi (Pi asked first) Otherwise, the reply is deferred
16.86 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Desirable Behavior of Fully Distributed Approach
Freedom from Deadlock is ensured
Freedom from starvation is ensured, since entry to the critical section is scheduled according to the timestamp ordering The timestamp ordering ensures that processes are served in a first-
come, first served order
16.87 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Three Undesirable Consequences
The processes need to know the identity of all other processes in the system, which makes the dynamic addition and removal of processes more complex
If one of the processes fails, then the entire scheme collapses This can be dealt with by continuously monitoring the state of all the
processes in the system
Processes that have not entered their critical section must pause frequently to assure other processes that they intend to enter the critical section This protocol is therefore suited for small, stable sets of cooperating
processes
16.88 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Token-Passing Approach
Circulate a token among processes in system Token is special type of message Possession of token entitles holder to enter critical section
Processes logically organized in a ring structure Unidirectional ring guarantees freedom from starvation Two types of failures
Lost token – election must be called Failed processes – new logical ring established
16.89 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Atomicity
Either all the operations associated with a program unit are executed to completion, or none are performed
Ensuring atomicity in a distributed system requires a transaction coordinator, which is responsible for the following: Starting the execution of the transaction Breaking the transaction into a number of subtransactions, and
distribution these subtransactions to the appropriate sites for execution Coordinating the termination of the transaction, which may result in the
transaction being committed at all sites or aborted at all sites
16.90 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Two-Phase Commit Protocol (2PC)
Assumes fail-stop model
Players: Transaction Controller, all local sites involved in the transaction
Execution of the protocol is initiated by the coordinator after the last step of the transaction has been reached
When the protocol is initiated, the transaction may still be executing at some of the local sites
The protocol involves all the local sites at which the transaction executed
Example: Let T be a transaction initiated at site Si and let the transaction coordinator at Si be Ci
16.91 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Phase 1: Obtaining a Decision
Ci adds <prepare T> record to the log
Ci sends <prepare T> message to all sites When a site receives a <prepare T> message, the transaction manager
determines if it can commit the transaction If no: add <no T> record to the log and respond to Ci with <abort T> If yes:
add <ready T> record to the log force all log records for T onto stable storage send <ready T> message to Ci
16.92 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Phase 1 (Cont)
Coordinator collects responses All respond “ready”,
decision is commit At least one response is “abort”,
decision is abort At least one participant fails to respond within time out period,
decision is abort
16.93 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Phase 2: Recording Decision in the Database
Coordinator adds a decision record <abort T> or <commit T>
to its log and forces record onto stable storage Once that record reaches stable storage it is irrevocable (even if failures
occur) Coordinator sends a message to each participant informing it of the decision
(commit or abort) Participants take appropriate action locally
16.94 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Failure Handling in 2PC – Site Failure
The log contains a <commit T> record In this case, the site executes redo(T)
The log contains an <abort T> record In this case, the site executes undo(T)
The contains a <ready T> record; consult Ci
If Ci is down, site sends query-status T message to the other sites The log contains no control records concerning T
In this case, the site executes undo(T)
16.95 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Failure Handling in 2PC – Coordinator Ci Failure
If an active site contains a <commit T> record in its log, the T must be committed
If an active site contains an <abort T> record in its log, then T must be aborted
If some active site does not contain the record <ready T> in its log then the failed coordinator Ci cannot have decided to commit T Rather than wait for Ci to recover, it is preferable to abort T
All active sites have a <ready T> record in their logs, but no additional control records In this case we must wait for the coordinator to recover Blocking problem – T is blocked pending the recovery of site Si
16.96 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Concurrency Control
Modify the centralized concurrency schemes to accommodate the distribution of transactions
Transaction manager coordinates execution of transactions (or subtransactions) that access data at local sites
Local transaction only executes at that site
Global transaction executes at several sites
16.97 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Locking Protocols
Can use the two-phase locking protocol in a distributed environment by changing how the lock manager is implemented
Nonreplicated scheme – each site maintains a local lock manager which administers lock and unlock requests for those data items that are stored in that site Simple implementation involves two message transfers for handling lock
requests, and one message transfer for handling unlock requests Deadlock handling is more complex
16.98 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Single-Coordinator Approach
A single lock manager resides in a single chosen site, all lock and unlock requests are made a that site
Simple implementation
Simple deadlock handling
Possibility of bottleneck
Vulnerable to loss of concurrency controller if single site fails
Multiple-coordinator approach distributes lock-manager function over several sites
16.99 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Majority Protocol
Avoids drawbacks of central control by dealing with replicated data in a decentralized manner
More complicated to implement
Deadlock-handling algorithms must be modified; possible for deadlock to occur in locking only one data item
16.101 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Primary Copy
One of the sites at which a replica resides is designated as the primary site Request to lock a data item is made at the primary site of that data item
Concurrency control for replicated data handled in a manner similar to that of unreplicated data
Simple implementation, but if primary site fails, the data item is unavailable, even though other sites may have a replica
16.102 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Timestamping
Each transaction is given a unique timestamp, which is used to decide the serialization order.
Generate unique timestamps in distributed scheme: Each site generates a unique local timestamp The global unique timestamp is obtained by concatenation of the unique
local timestamp with the unique site identifier Use a logical clock defined within each site to ensure the fair generation
of timestamps
Timestamp-ordering scheme – combine the centralized concurrency control timestamp scheme with the 2PC protocol to obtain a protocol that ensures serializability with no cascading rollbacks
16.103 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Generation of Unique Timestamps
16.104 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Deadlock Prevention
Resource-ordering deadlock-prevention – define a global ordering among the system resources Assign a unique number to all system resources A process may request a resource with unique number i only if it is not
holding a resource with a unique number grater than i Simple to implement; requires little overhead
Banker’s algorithm – designate one of the processes in the system as the process that maintains the information necessary to carry out the Banker’s algorithm Also implemented easily, but may require too much overhead
16.105 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Timestamped Deadlock-Prevention Scheme
Each process Pi is assigned a unique priority number
Priority numbers are used to decide whether a process Pi should wait for a process Pj; otherwise Pi is rolled back
The scheme prevents deadlocks For every edge Pi Pj in the wait-for graph, Pi has a higher priority than
Pj
Thus a cycle cannot exist
Problem - Starvation
16.106 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Wait-Die Scheme
Based on a nonpreemptive technique
If Pi requests a resource currently held by Pj, Pi is allowed to wait only if it has a smaller timestamp than does Pj (Pi is older than Pj) Otherwise, Pi is rolled back (dies)
Example: Suppose that processes P1, P2, and P3 have timestamps 5, 10, and 15 respectively if P1 request a resource held by P2, then P1 will wait If P3 requests a resource held by P2, then P3 will be rolled back
16.107 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Would-Wait Scheme
Based on a preemptive technique; counterpart to the wait-die system
If Pi requests a resource currently held by Pj, Pi is allowed to wait only if it has a larger timestamp than does Pj (Pi is younger than Pj). Otherwise Pj is rolled back (Pj is wounded by Pi)
Example: Suppose that processes P1, P2, and P3 have timestamps 5, 10, and 15 respectively If P1 requests a resource held by P2, then the resource will be
preempted from P2 and P2 will be rolled back If P3 requests a resource held by P2, then P3 will wait
16.108 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Deadlock Detection
Use wait-for graphs Local wait-for graphs at each local site. The nodes of the
graph correspond to all the processes that are currently either holding or requesting any of the resources local to that site
May also use a global wait-for graph. This graph is the union of all local wait-for graphs.
16.109 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Two Local Wait-For Graphs
16.110 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Global Wait-For Graph
16.111 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Deadlock Detection – Centralized Approach
Each site keeps a local wait-for graph A global wait-for graph is maintained in a single coordination
process There are three different options (points in time) when the wait-for
graph may be constructed:1. Whenever a new edge is inserted or removed in one of the local wait-
for graphs2. Periodically, when a number of changes have occurred in a wait-for
graph3. Whenever the coordinator needs to invoke the cycle-detection
algorithm Unnecessary rollbacks may occur as a result of false cycles
16.114 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Local and Global Wait-For Graphs
16.115 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Fully Distributed Approach
All controllers share equally the responsibility for detecting deadlock Every site constructs a wait-for graph that represents a part of the total
graph We add one additional node Pex to each local wait-for graph
If a local wait-for graph contains a cycle that does not involve node Pex, then the system is in a deadlock state
A cycle involving Pex implies the possibility of a deadlock Deadlock is a possibility To ascertain whether a deadlock does exist, a distributed deadlock-
detection algorithm must be invoked
16.116 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Augmented Local Wait-For Graphs
16.117 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Augmented Local Wait-For Graph in Site S2
16.118 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Election Algorithms
Determine where a new copy of the coordinator should be restarted Assume that a unique priority number is associated with each active
process in the system, and assume that the priority number of process Pi is I
The coordinator is always the process with the largest priority number. When a coordinator fails, the algorithm must elect that active process with the largest priority number
Two algorithms, the bully algorithm and a ring algorithm, can be used to elect a new coordinator in case of failures
16.119 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Bully Algorithm
Applicable to systems where every process can send a message to every other process in the system
If process Pi sends a request that is not answered by the coordinator within a time interval T, assume that the coordinator has failed; Pi tries to elect itself as the new coordinator
Pi sends an election message to every process with a higher priority number, Pi then waits for any of these processes to answer within T
16.120 Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition
Bully Algorithm (Cont)
If no response within T, assume that all processes with numbers greater than i have failed; Pi elects itself the new coordinator
If answer is received, Pi begins time interval T´, waiting to receive a message that a process with a higher priority number has been elected
If no message is sent within T´, assume the process with a higher number has failed; Pi should restart the algorithm
If there are no active processes with higher numbers, the recovered process forces all processes with lower number to let it become the coordinator process, even if there is a currently active coordinator with a lower number
Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8th Edition,
End of Chapter 18