Top Banner
Advanced Operating Systems (Distributed Systems) Unit 1 Sikkim Manipal University Page No. 1 Unit 1 Introduction Structure: 1.0 Objectives 1.1 Distributed Computing Systems 1.2 Distributed Computing System Models 1.3 Advantages of Distributed Systems 1.4 Distributed Operating Systems 1.5 Issues in Designing Distributed Operating Systems 1.6 Distributed Computing Environment 1.7 Summary 1.8 Terminal Questions 1.0 Objectives After studying this unit, you will get familiar with Fundamentals of distributed computing systems Distributed design models Distributed operating systems and their design issues Distributed computing environment 1.1 Distributed Computing Systems Over the past two decades, advancements in microelectronic technology have resulted in the availability of fast, inexpensive processors, and advancements in communication technology have resulted in the availability of cost-effective and highly efficient computer networks. The advancements in these two technologies favour the use of interconnected, multiple processors in place of a single, high-speed processor.
244
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: MC0085

Advanced Operating Systems (Distributed Systems) Unit 1

Sikkim Manipal University Page No. 1

Unit 1 Introduction

Structure:

1.0 Objectives

1.1 Distributed Computing Systems

1.2 Distributed Computing System Models

1.3 Advantages of Distributed Systems

1.4 Distributed Operating Systems

1.5 Issues in Designing Distributed Operating Systems

1.6 Distributed Computing Environment

1.7 Summary

1.8 Terminal Questions

1.0 Objectives

After studying this unit, you will get familiar with

Fundamentals of distributed computing systems

Distributed design models

Distributed operating systems and their design issues

Distributed computing environment

1.1 Distributed Computing Systems

Over the past two decades, advancements in microelectronic technology

have resulted in the availability of fast, inexpensive processors, and

advancements in communication technology have resulted in the availability

of cost-effective and highly efficient computer networks. The advancements

in these two technologies favour the use of interconnected, multiple

processors in place of a single, high-speed processor.

Page 2: MC0085

Advanced Operating Systems (Distributed Systems) Unit 1

Sikkim Manipal University Page No. 2

Computer architectures consisting of interconnected, multiple processors

are basically of two types:

In tightly coupled systems, there is a single system wide primary

memory (address space) that is shared by all the processors (Fig. 1.1).

If any processor writes, for example, the value 100 to the memory

location x, any other processor subsequently reading from location x will

get the value 100. Therefore, in these systems, any communication

between the processors usually takes place through the shared

memory.

In loosely coupled systems, the processors do not share memory, and

each processor has its own local memory (Fig. 1.2). If a processor writes

the value 100 to the memory location x, this write operation will only

change the contents of its local memory and will not affect the contents

of the memory of any other processor. Hence, if another processor

reads the memory location x, it will get whatever value was there before

in that location of its own local memory. In these systems, all physical

communication between the processors is done by passing messages

across the network that interconnects the processors.

Usually, tightly coupled systems are referred to as parallel processing

systems, and loosely coupled systems are referred to as distributed

computing systems, or simply distributed systems. In contrast to the

tightly coupled systems, the processors of distributed computing

systems can be located far from each other to cover a wider

geographical area. Furthermore, in tightly coupled systems, the number

of processors that can be usefully deployed is usually small and limited

by the bandwidth of the shared memory. This is not the case with

distributed computing systems that are more freely expandable and can

have an almost unlimited number of processors.

Page 3: MC0085

Advanced Operating Systems (Distributed Systems) Unit 1

Sikkim Manipal University Page No. 3

Fig. 1.1: Tightly Coupled Multiprocessor Systems

Fig. 1.2: Loosely Coupled Multiprocessor Systems

Hence, a distributed computing system is basically a collection of

processors interconnected by a communication network in which each

processor has its own local memory and other peripherals, and the

communication between any two processors of the system takes place by

message passing over the communication network. For a particular

processor, its own resources are local, whereas the other processors and

their resources are remote. Together, a processor and its resources are

usually referred to as a node or site or machine of the distributed computing

system.

Page 4: MC0085

Advanced Operating Systems (Distributed Systems) Unit 1

Sikkim Manipal University Page No. 4

1.2 Distributed Computing System Models

Distributed Computing system models can be broadly classified into five

categories. They are

Minicomputer model

Workstation model

Workstation – server model

Processor – pool model

Hybrid model

Minicomputer Model

The minicomputer model (Fig. 1.3) is a simple extension of the centralized

time-sharing system. A distributed computing system based on this model

consists of a few minicomputers (they may be large supercomputers as

well) interconnected by a communication network. Each minicomputer

usually has multiple users simultaneously logged on to it. For this, several

interactive terminals are connected to each minicomputer. Each user is

logged on to one specific minicomputer, with remote access to other

minicomputers. The network allows a user to access remote resources that

are available on some machine other than the one on to which the user is

currently logged.

The minicomputer model may be used when resource sharing (such as

sharing of information databases of different types, with each type of

database located on a different machine) with remote users is desired.

The early ARPAnet is an example of a distributed computing system based

on the minicomputer model.

Page 5: MC0085

Advanced Operating Systems (Distributed Systems) Unit 1

Sikkim Manipal University Page No. 5

Fig. 1.3: A Distributed Computing System based on Minicomputer Model

Workstation Model

A distributed computing system based on the workstation model (Fig. 1.4)

consists of several workstations interconnected by a communication

network. An organization may have several workstations located throughout

a building or campus, each workstation equipped with its own disk and

serving as a single-user computer. It has been often found that in such an

environment, at any one time a significant proportion of the workstations are

idle (not being used), resulting in the waste of large amounts of CPU time.

Therefore, the idea of the workstation model is to interconnect all these

workstations by a high-speed LAN so that idle workstations may be used to

process jobs of users who are logged onto other workstations and do not

have sufficient processing power at their own workstations to get their jobs

processed efficiently.

Page 6: MC0085

Advanced Operating Systems (Distributed Systems) Unit 1

Sikkim Manipal University Page No. 6

Fig. 1.4: A Distributed Computing System based on Workstation Model

In this model, a user logs onto one of the workstations called his or her

"home" workstation and submits jobs for execution. When the system finds

that the user's workstation does not have sufficient processing power for

executing the processes of the submitted jobs efficiently, it transfers one or

more of the processes from the user's workstation to some other workstation

that is currently idle and gets the process executed there, and finally the

result of execution is returned to the user's workstation.

This model is not so simple to implement as it might appear at first sight

because several issues must be resolved. Tanenbaum studies summarize

these issues as follows and these issues are carefully handled to achieve

the maximum efficiency:

1 How does the system find an idle workstation?

2 How is a process transferred from one workstation to get it executed on

another workstation?

Page 7: MC0085

Advanced Operating Systems (Distributed Systems) Unit 1

Sikkim Manipal University Page No. 7

3 What happens to a remote process if a user logs onto a workstation that

was idle until now and was being used to execute a process of another

workstation?

Workstation – Server Model

The workstation model is a network of personal workstations, each with its

own disk and a local file system. A workstation with its own local disk is

usually called a diskful workstation and a workstation without a local disk is

called a diskless workstation. With the proliferation of high-speed networks,

diskless workstations have become more popular in network environments

than diskful workstations, making the workstation-server model more

popular than the workstation model for building distributed computing

systems.

A distributed computing system based on the workstation-server model

(Fig. 1.5) consists of a few minicomputers and several workstations (most of

which are diskless, but a few of which may be diskful) interconnected by a

communication network.

Fig. 1.5: A Distributed Computing System based on Workstation-server Model

Page 8: MC0085

Advanced Operating Systems (Distributed Systems) Unit 1

Sikkim Manipal University Page No. 8

Note that when diskless workstations are used on a network, the file system

to be used by these workstations must be implemented either by a diskful

workstation or by a minicomputer equipped with a disk for file storage. One

or more of the minicomputers are used for implementing the file system.

Other minicomputers may be used for providing other types of services,

such as database service and print service. Therefore, each minicomputer is

used as a server machine to provide one or more types of services.

Therefore in the workstation-server model, in addition to the workstations,

there are specialized machines (may be specialized workstations) for

running server processes (called servers) for managing and providing

access to shared resources.

For a number of reasons, such as higher reliability and better scalability,

multiple servers are often used for managing the resources of a particular

type in a distributed computing system. For example, there may be multiple

file servers, each running on a separate minicomputer and cooperating via

the network, for managing the files of all the users in the system. Due to this

reason, a distinction is often made between the services that are provided to

clients and the servers that provide them. That is, a service is an abstract

entity that is provided by one or more servers. For example, one or more file

servers may be used in a distributed computing system to provide file

service to the users.

In this model, a user logs onto a workstation called his or her home

workstation. Normal computation activities required by the user's processes

are performed at the user's home workstation, but requests for services

provided by special servers (such as a file server or a database server) are

sent to a server providing that type of service that performs the user's

requested activity and returns the result of request processing to the user's

workstation. Therefore, in this model, the user's processes need not be

Page 9: MC0085

Advanced Operating Systems (Distributed Systems) Unit 1

Sikkim Manipal University Page No. 9

migrated to the server machines for getting the work done by those

machines.

For better overall system performance, the local disk of a diskful workstation

is normally used for such purposes as storage of temporary files, storage of

unshared files, storage of shared files that are rarely changed, paging

activity in virtual-memory management, and caching of remotely accessed

data.

Compared to the workstation model, the workstation-server model has

several advantages:

1. In general, it is much cheaper to use a few minicomputers equipped with

large, fast disks that are accessed over the network than a large number

of diskful workstations, with each workstation having a small, slow disk.

2. Diskless workstations are also preferred to diskful workstations from a

system maintenance point of view. Backup and hardware maintenance

are easier to perform with a few large disks than with many small disks

scattered all over a building or campus. Furthermore, installing new

releases of software (such as a file server with new functionalities) is

easier when the software is to be installed on a few file server machines

than on every workstation.

3. In the workstation-server model, since all files are managed by the file

servers, users have the flexibility to use any workstation and access the

files in the same manner irrespective of which workstation the user is

currently logged on. Note that this is not true with the workstation model,

in which each workstation has its local file system, because different

mechanisms are needed to access local and remote files.

4. In the workstation-server model, the request-response protocol

described above is mainly used to access the services of the server

machines. Therefore, unlike the workstation model, this model does not

need a process migration facility, which is difficult to implement.

Page 10: MC0085

Advanced Operating Systems (Distributed Systems) Unit 1

Sikkim Manipal University Page No. 10

The request-response protocol is known as the client-server model of

communication. In this model, a client process (which in this case

resides on a workstation) sends a request to a server process (which in

this case resides on a minicomputer) for getting some service such as

reading a block of a file. The server executes the request and sends

back a reply to the client that contains the result of request processing.

The client-server model provides an effective general-purpose approach

to the sharing of information and resources in distributed computing

systems. It is not only meant for use with the workstation-server model

but also can be implemented in a variety of hardware and software

environments. The computers used to run the client and server

processes need not necessarily be workstations and minicomputers.

They can be of many types and there is no need to distinguish between

them. It is even possible for both the client and server processes to be

run on the same computer. Moreover, some processes are both client

and server processes. That is, a server process may use the services of

another server, appearing as a client to the latter.

5. A user has guaranteed response time because workstations are not

used for executing remote processes. However, the model does not

utilize the processing capability of idle workstations.

The V-System proposed by Cheriton in 1988 is an example of a

distributed computing system that is based on the workstation-server

model.

Processor – Pool Model

The processor-pool model is based on the observation that most of the time

a user does not need any computing power but once in a while the user may

need a very large amount of computing power for a short time (e.g., when

recompiling a program consisting of a large number of files after changing a

Page 11: MC0085

Advanced Operating Systems (Distributed Systems) Unit 1

Sikkim Manipal University Page No. 11

basic shared declaration). Therefore, unlike the workstation-server model in

which a processor is allocated to each user, in the processor-pool model the

processors are pooled together to be shared by the users as needed. The

pool of processors consists of a large number of microcomputers and

minicomputers attached to the network. Each processor in the pool has its

own memory to load and run a system program or an application program of

the distributed computing system.

The pure processor-pool model (Fig. 1.6), the processors in the pool have

no terminals attached directly to them, and users access the system from

terminals that are attached to the network via special devices. These

terminals are either small diskless workstations or graphic terminals, such

as X terminals. A special server (called a run server) manages and allocates

the processors in the pool to different users on a demand basis. When a

user submits a job for computation, an appropriate number of processors

are temporarily assigned to his or her job by the run server. For example, if

the user's computation job is the compilation of a program having n

segments, in which each of the segments can be compiled independently to

produce separate relocatable object files, n processors from the pool can be

allocated to this job to compile all the n segments in parallel. When the

computation is completed, the processors are returned to the pool for use by

other users.

In the processor-pool model there is no concept of a home machine. That is,

a user does not log onto a particular machine but to the system as a whole.

This is in contrast to other models in which each user has a home machine

(e.g., a workstation or minicomputer) onto which he or she logs and runs

most of his or her programs there by default.

Page 12: MC0085

Advanced Operating Systems (Distributed Systems) Unit 1

Sikkim Manipal University Page No. 12

Fig. 1.6: A distributed computing system based on the processor-pool model

As compared to the workstation-server model, the processor-pool model

allows better utilization of the available processing power of a distributed

computing system. This is because in the processor-pool model, the entire

processing power of the system is available for use by the currently logged-

on users, whereas this is not true for the workstation-server model in which

several workstations may be idle at a particular time but they cannot be

used for processing the jobs of other users. Furthermore, the processor-pool

model provides greater flexibility than the workstation-server model in the

sense that the system's services can be easily expanded without the need

to install any more computers; the processors in the pool can be allocated to

act as extra servers to carry any additional load arising from an increased

user population or to provide new services. However, the processor-pool

model is usually considered to be unsuitable for high-performance

interactive applications, especially those using graphics or window systems.

This is mainly because of the slow speed of communication between the

Page 13: MC0085

Advanced Operating Systems (Distributed Systems) Unit 1

Sikkim Manipal University Page No. 13

computer on which the application program of a user is being executed and

the terminal via which the user is interacting with the system. The

workstation-server model is generally considered to be more suitable for

such applications.

Amoeba proposed by Mullender et al. in 1990 is an example of distributed

computing systems based on the processor-pool model.

Hybrid Model

Out of the four models described above, the workstation-server model, is

the most widely used model for building distributed computing systems. This

is because a large number of computer users only perform simple

interactive tasks such as editing jobs, sending electronic mails, and

executing small programs. The workstation-server model is ideal for such

simple usage. However, in a working environment that has groups of users

who often perform jobs needing massive computation, the processor-pool

model is more attractive and suitable.

To combine the advantages of both the workstation-server and processor-

pool models, a hybrid model may be used to build a distributed computing

system. The hybrid model is based on the workstation-server model but with

the addition of a pool of processors. The processors in the pool can be

allocated dynamically for computations that are too large for workstations or

that requires several computers concurrently for efficient execution. In

addition to efficient execution of computation-intensive jobs, the hybrid

model gives guaranteed response to interactive jobs by allowing them to be

processed on local workstations of the users. However, the hybrid model is

more expensive to implement than the workstation-server model or the

processor-pool model.

Page 14: MC0085

Advanced Operating Systems (Distributed Systems) Unit 1

Sikkim Manipal University Page No. 14

1.3 Advantages of Distributed Systems

From the models of distributed computing systems presented above, it is

obvious that distributed computing systems are much more complex and

difficult to build than traditional centralized systems (those consisting of a

single CPU, its memory, peripherals, and one or more terminals). The

increased complexity is mainly due to the fact that in addition to being

capable of effectively using and managing a very large number of distributed

resources, the system software of a distributed computing system should

also be capable of handling the communication and security problems that

are very different from those of centralized systems. For example, the

performance and reliability of a distributed computing system depends to a

great extent on the performance and reliability of the underlying

communication network. Special software is usually needed to handle loss

of messages during transmission across the network or to prevent

overloading of the network, which degrades the performance and

responsiveness to the users. Similarly, special software security measures

are needed to protect the widely distributed shared resources and services

against intentional or accidental violation of access control and privacy

constraints.

From the models of distributed computing systems presented above, it is

obvious that distributed computing systems are much more complex and

difficult to build than traditional centralized systems (those consisting of a

single CPU, its memory, peripherals, and one or more terminals). The

increased complexity is mainly due to the fact that in addition to being

capable of effectively using and managing a very large number of distributed

resources, the system software of a distributed computing system should

also be capable of handling the communication and security problems that

are very different from those of centralized systems. For example, the

performance and reliability of a distributed computing system depends to a

Page 15: MC0085

Advanced Operating Systems (Distributed Systems) Unit 1

Sikkim Manipal University Page No. 15

great extent on the performance and reliability of the underlying

communication network. Special software is usually needed to handle loss

of messages during transmission across the network or to prevent

overloading of the network, which degrades the performance and

responsiveness to the users. Similarly, special software security measures

are needed to protect the widely distributed shared resources and services

against intentional or accidental violation of access control and privacy

constraints.

Despite the increased complexity and the difficulty of building distributed

computing systems, the installation and use of distributed computing

systems are rapidly increasing. This is mainly because the advantages of

distributed computing systems outweigh their disadvantages. The technical

needs, the economic pressures, and the major advantages that have led to

the emergence and popularity of distributed computing systems are

described next.

Inherently Distributed Applications

Distributed computing systems come into existence in some very natural

ways. For example, several applications are inherently distributed in nature

and require a distributed computing system for their realization. For

instance, in an employee database of a nationwide organization, the data

pertaining to a particular employee are generated at the employee's branch

office, and in addition to the global need to view the entire database, there is

a local need for frequent and immediate access to locally generated data at

each branch office. Applications such as these require that some processing

power be available at the many distributed locations for collecting,

preprocessing, and accessing data, resulting in the need for distributed

computing systems. Some other examples of inherently distributed

applications are a computerized worldwide airline reservation system, a

computerized banking system in which a customer can deposit/withdraw

Page 16: MC0085

Advanced Operating Systems (Distributed Systems) Unit 1

Sikkim Manipal University Page No. 16

money from his or her account from any branch of the bank, and a factory

automation system controlling robots and machines all along an assembly

line.

Information Sharing among Distributed Users

Another reason for the emergence of distributed computing systems was a

desire for efficient person-to-person communication facility by sharing

information over great distances. In a distributed computing system,

information generated by one of the users can be easily and efficiently

shared by the users working at other nodes of the system. This facility may

be useful in many ways. For example, a project can be performed by two or

more users who are geographically far off from each other but whose

computers are a part of the same distributed computing system. In this

case, although the users are geographically separated from each other, they

can work in cooperation, for example, by transferring the files of the project,

logging onto each other's remote computers to run programs, and

exchanging messages by electronic mail to coordinate the work.

Resource Sharing

Information is not the only thing that can be shared in a distributed

computing system. Sharing of software resources such as software libraries

and databases as well as hardware resources such as printers, hard disks,

and plotters can also be done in a very effective way among all the

computers and the users of a single distributed computing system. For

example, we saw that in a distributed computing system based on the

workstation-server model the workstations may have no disk or only a small

disk (10-20 megabytes) for temporary storage, and access to permanent

files on a large disk can be provided to all the workstations by a single file

server.

Page 17: MC0085

Advanced Operating Systems (Distributed Systems) Unit 1

Sikkim Manipal University Page No. 17

Better Price-Performance Ratio

This is one of the most important reasons for the growing popularity of

distributed computing systems. With the rapidly increasing power and

reduction in the price of microprocessors, combined with the increasing

speed of communication networks, distributed computing systems

potentially have a much better price-performance ratio than a single large

centralized system. For example, we saw how a small number of CPUs in a

distributed computing system based on the processor-pool model can be

effectively used by a large number of users from inexpensive terminals,

giving a fairly high price-performance ratio as compared to either a

centralized time-sharing system or a personal computer. Another reason for

distributed computing systems to be more cost-effective than centralized

systems is that they facilitate resource sharing among multiple computers.

For example, a single unit of expensive peripheral devices such as color

laser printers, high-speed storage devices, and plotters can be shared

among all the computers of the same distributed computing system. If these

computers are not linked together with a communication network, each

computer must have its own peripherals, resulting in higher cost.

Shorter Response Times and Higher Throughput

Due to multiplicity of processors, distributed computing systems are

expected to have better performance than single-processor centralized

systems. The two most commonly used performance metrics are response

time and throughput of user processes. That is, the multiple processors of a

distributed computing system can be utilized properly for providing shorter

response times and higher throughput than a single-processor centralized

system. For example, if there are two different programs to be run, two

processors are evidently more powerful than one because the programs can

be simultaneously run on different processors. Furthermore, if a particular

computation can be partitioned into a number of subcomputations that can

Page 18: MC0085

Advanced Operating Systems (Distributed Systems) Unit 1

Sikkim Manipal University Page No. 18

run concurrently, in a distributed computing system all the subcomputations

can be simultaneously run with each one on a different processor.

Distributed computing systems with very fast communication networks are

increasingly being used as parallel computers to solve single complex

problems rapidly. Another method often used in distributed computing

systems for achieving better overall performance is to distribute the load

more evenly among the multiple processors by moving jobs from currently

overloaded processors to lightly loaded ones. For example, in a distributed

computing system based on the workstation model, if a user currently has

two processes to run, out of which one is an interactive process and the

other is a process that can be run in the background, it may be

advantageous to run the interactive process on the home node of the user

and the other one on a remote idle node (if any node is idle).

Higher Reliability

Reliability refers to the degree of tolerance against errors and component

failures in a system. A reliable system prevents loss of information even in

the event of component failures. The multiplicity of storage devices and

processors in a distributed computing system allows the maintenance of

multiple copies of critical information within the system and the execution of

important computations redundantly to protect them against catastrophic

failures. With this approach, if one of the processors fails, the computation

can be successfully completed at the other processor, and if one of the

storage devices fails, the information can still be used from the other storage

device. Furthermore, the geographical distribution of the processors and

other resources in a distributed computing system limits the scope of

failures caused by natural disasters.

An important aspect of reliability is availability, which refers to the fraction of

time for which a system is available for use. In comparison to a centralized

system, a distributed computing system also enjoys the advantage of

Page 19: MC0085

Advanced Operating Systems (Distributed Systems) Unit 1

Sikkim Manipal University Page No. 19

increased availability. For example, if the processor of a centralized system

fails (assuming that it is a single-processor centralized system), the entire

system breaks down and no useful work can be performed. However, in the

case of a distributed computing system, a few parts of the system can be

down without interrupting the jobs of the users who are using the other parts

of the system. For example, if a workstation of a distributed computing

system that is based on the workstation-server model fails, only the user of

that workstation is affected. Other users of the system are not affected by

this failure. Similarly, in a distributed computing system based on the

processor-pool model, if some of the processors in the pool are down at any

moment, the system can continue to function normally, simply with some

loss in performance that is proportional to the number of processors that are

down. In this case, none of the users are affected and the users cannot

even know that some of the processors are down.

The advantage of higher reliability is an important reason for the use of

distributed computing systems for critical applications whose failure may be

disastrous. However, often reliability comes at the cost of performance.

Therefore, it is necessary to maintain a balance between the two.

Extensibility and Incremental Growth

Another major advantage of distributed computing systems is that they are

capable of incremental growth. That is, it is possible to gradually extend the

power and functionality of a distributed computing system by simply adding

additional resources (both hardware and software) to the system as and

when the need arises. For example, additional processors can be easily

added to the system to handle the increased workload of an organization

that might have resulted from its expansion. Incremental growth is a very

attractive feature because for most existing and proposed applications it is

practically impossible to predict future demands of the system. Extensibility

Page 20: MC0085

Advanced Operating Systems (Distributed Systems) Unit 1

Sikkim Manipal University Page No. 20

is also easier in a distributed computing system because addition of new

resources to an existing system can be performed without significant

disruption of the normal functioning of the system. Properly designed

distributed computing systems that have the property of extensibility and

incremental growth are called open distributed systems.

Better Flexibility in Meeting Users’ Needs

Different types of computers are usually more suitable for performing

different types of computations. For example, computers with ordinary

power are suitable for ordinary data processing jobs, whereas high-

performance computers are more suitable for complex mathematical

computations. In a centralized system, the users have to perform all types of

computations on the only available computer. However, a distributed

computing system may have a pool of different types of computers, in which

case the most appropriate one can be selected for processing a user's job

depending on the nature of the job. For instance, we saw that in a

distributed computing system that is based on the hybrid model, interactive

jobs can be processed at a user's own workstation and the processors in

the pool may be used to process noninteractive, computation-intensive jobs.

1.4 Distributed Operating Systems

Tanenbaum and Van Renesse define an operating system as a program

that controls the resources of a computer system and provides its users with

an interface or virtual machine that is more convenient to use than the bare

machine. According to this definition, the two primary tasks of an operating

system are as follows:

1. To present users with a virtual machine that is easier to program than

the underlying hardware.

2. To manage the various resources of the system. This involves

performing such tasks as keeping track of who is using which resource,

Page 21: MC0085

Advanced Operating Systems (Distributed Systems) Unit 1

Sikkim Manipal University Page No. 21

granting resource requests, accounting for resource usage, and

mediating conflicting requests from different programs and users.

Therefore, the users' view of a computer system, the manner in which the

users access the various resources of the computer system, and the ways

in which the resource requests are granted depend to a great extent on the

operating system of the computer system. The operating systems commonly

used for distributed computing systems can be broadly classified into two

types – network operating systems and distributed operating systems. The

three most important features commonly used to differentiate between these

two types of operating systems are system image, autonomy, and fault

tolerance capability. These features are given below:

System image: Under network OS, the user views the distributed system

as a collection of machines connected by a communication subsystem.

i.e the user is aware of the fact that multiple computers are used. A

distributed OS hides the existence of multiple computers and provides a

single system image to the users.

Autonomy: A network OS is built on a set of existing centralized OSs and

handles the interfacing and coordination of remote operations and

communications between these OSs. So, in this case, each machine has its

own OS. With a distributed OS, there is a single system-wide OS and each

computer runs part of this global OS.

Fault tolerance capability: A network operating system provides little or no

fault tolerance capability in the sense that if 10% of the machines of the

entire distributed computing system are down at any moment, at least 10%

of the users are unable to continue with their work. On the other hand, with

a distributed operating system, most of the users are normally unaffected by

the failed machines and can continue to perform their work normally, with

only a 10% loss in performance of the entire distributed computing system.

Page 22: MC0085

Advanced Operating Systems (Distributed Systems) Unit 1

Sikkim Manipal University Page No. 22

Therefore, the fault tolerance capability of a distributed operating system is

usually very high as compared to that of a network operating system.

1.5 Issues in Designing Distributed Operating Systems

In general, designing a distributed operating system is more difficult than

designing a centralized operating system for several reasons. In the design

of a centralized operating system, it is assumed that the operating system

has access to complete and accurate information about the environment in

which it is functioning. For example, a centralized operating system can

request status information, being assured that the interrogated component

will not change state while awaiting a decision based on that status

information, since only the single operating system asking the question may

give commands. However, a distributed operating system must be designed

with the assumption that complete information about the system

environment will never be available. In a distributed system, the resources

are physically separated, there is no common clock among the multiple

processors, delivery of messages is delayed, and messages could even be

lost. Due to all these reasons, a distributed operating system does not have

up-to-date, consistent knowledge about the state of the various components

of the underlying distributed system. Obviously, lack of up-to-date and

consistent information makes many things (such as management of

resources and synchronization of cooperating activities) much harder in the

design of a distributed operating system. For example, it is hard to schedule

the processors optimally if the operating system is not sure how many of

them are up at the moment.

Despite these complexities and difficulties, a distributed operating system

must be designed to provide all the advantages of a distributed system to its

users. That is, the users should be able to view a distributed system as a

virtual centralized system that is flexible, efficient, reliable, secure, and easy

Page 23: MC0085

Advanced Operating Systems (Distributed Systems) Unit 1

Sikkim Manipal University Page No. 23

to use. To meet this challenge, the designers of a distributed operating

system must deal with several design issues. Some of the key design issues

are described below.

Transparency

One of the main goals of a distributed operating system is to make the

existence of multiple computers invisible (transparent) and provide a single

system image to its users. That is, a distributed operating system must be

designed in such a way that a collection of distinct machines connected by a

communication subsystem appears to its users as a virtual uniprocessor.

Achieving complete transparency is a difficult task and requires that several

different aspects of transparency be supported by the distributed operating

system. The eight forms of transparency identified by the International

Standards Organization's Reference Model for Open Distributed Processing

[ISO 1992] are access transparency, location transparency, replication

transparency, failure transparency, migration transparency, concurrency

transparency, performance transparency, and scaling transparency.

Access Transparency

Access transparency means that users should not need or be able to

recognize whether a resource (hardware or software) is remote or local.

This implies that the distributed operating system should allow users to

access remote resources in the same way as local resources. That is, the

user interface, which takes the form of a set of system calls, should not

distinguish between local and remote resources, and it should be the

responsibility of the distributed operating system to locate the resources and

to arrange for servicing user requests in a user-transparent manner.

This requirement leads to the development and deployment of a well-

designed set of system calls that are meaningful in both centralized and

distributed environments and a global resource naming facility. Due to the

Page 24: MC0085

Advanced Operating Systems (Distributed Systems) Unit 1

Sikkim Manipal University Page No. 24

need to handle communication failures in distributed systems, it is not

possible to design system calls that provide complete access transparency.

However, the area of designing a global resource naming facility has been

well researched with considerable success. The distributed shared memory

mechanism is also meant to provide a uniform set of system calls for

accessing both local and remote memory objects. Although this mechanism

is quite useful in providing access transparency, it is suitable only for limited

types of distributed applications due to its performance limitation.

Location Transparency

The two main aspects of location transparency are as follows:

Name transparency refers to the fact that the name of a resource

(hardware or software) should not reveal any hint as to the physical

location of the resource. That is the name of a resource should be

independent of the physical connectivity or topology of the system or the

current location of the resource. Furthermore, such resources, which are

capable of being moved from one node to another in a distributed

system (such as a file), must be allowed to move without having their

names changed. Therefore, resource names must be unique

systemwide.

User mobility refers to the fact that no matter which machine a user is

logged onto, he or she should be able to access a resource with the

same name. That is, the user should not be required to use different

names to access the same resource from two different nodes of the

system. In a distributed system that supports user mobility, users can

freely log on to any machine in the system and access any resource

without making any extra effort.

Both name transparency and user mobility requirements call for a

system wide, global resource naming facility.

Page 25: MC0085

Advanced Operating Systems (Distributed Systems) Unit 1

Sikkim Manipal University Page No. 25

Replication Transparency

For better performance and reliability, almost all distributed operating

systems have the provision to create replicas (additional copies) of files and

other resources on different nodes of the distributed system. In these

systems, both the existence of multiple copies of a replicated resource and

the replication activity should be transparent to the users. That is, two

important issues related to replication transparency are naming of replicas

and replication control. It is the responsibility of the system to name the

various copies of a resource and to map a user-supplied name of the

resource to an appropriate replica of the resource. Furthermore, replication

control decisions such as how many copies of the resource should be

created, where should each copy be placed, and when should a copy be

created/deleted should be made entirely automatically by the system in a

user-transparent manner.

Failure Transparency

Failure transparency deals with masking from the users' partial failures in

the system, such as a communication link failure, a machine failure, or a

storage device crash. A distributed operating system having failure

transparency property will continue to function, perhaps in a degraded form,

in the face of partial failures. For example, suppose the file service of a

distributed operating system is to be made failure transparent. This can be

done by implementing it as a group of file servers that closely cooperate

with each other to manage the files of the system and that function in such a

manner that the users can utilize the file service even if only one of the file

servers is up and working. In this case, the users cannot notice the failure of

one or more file servers, except for slower performance of file access

operations. Any type of service can be implemented in this way for failure

transparency. However, in this type of design, care should be taken to

Page 26: MC0085

Advanced Operating Systems (Distributed Systems) Unit 1

Sikkim Manipal University Page No. 26

ensure that the cooperation among multiple servers does not add too much

overhead to the system.

Complete failure transparency is not achievable with the current state of the

art in distributed operating systems because all types of failures cannot be

handled in a user-transparent manner. For example, failure of the

communication network of a distributed system normally disrupts the work of

its users and is noticeable by the users. Moreover, an attempt to design a

completely failure-transparent distributed system will result in a very slow

and highly expensive system due to the large amount of redundancy

required for tolerating all types of failures. The design of such a distributed

system, although theoretically possible, is not practically justified.

Migration Transparency

For better performance, reliability, and security reasons, an object that is

capable of being moved (such as a process or a file) is often migrated from

one node to another in a distributed system. The aim of migration

transparency is to ensure that the movement of the object is handled

automatically by the system in a user-transparent manner. Three important

issues in achieving this goal are as follows:

i) Migration decisions such as which object is to be moved from where

to where should be made automatically by the system.

ii) Migration of an object from one node to another should not require

any change in its name.

iii) When the migrating object is a process, the interprocess

communication mechanism should ensure that a message sent to the

migrating process reaches it without the need for the sender process

to resend it if the receiver process moves to another node before the

message is received.

Page 27: MC0085

Advanced Operating Systems (Distributed Systems) Unit 1

Sikkim Manipal University Page No. 27

Concurrency Transparency

In a distributed system, multiple users who are spatially separated use the

system concurrently. In such a situation, it is economical to share the

system resources (hardware or software) among the concurrently executing

user processes. However, since the number of available resources in a

computing system is restricted, one user process must necessarily influence

the action of other concurrently executing user processes, as it competes for

resources. For example, concurrent update to the same file by two different

processes should be prevented. Concurrency transparency means that

each user has a feeling that he or she is the sole user of the system and

other users do not exist in the system. For providing concurrency

transparency, the resource sharing mechanisms of the distributed operating

system must have the following four properties:

i) An event-ordering property ensures that all access requests to various

system resources are properly ordered to provide a consistent view to

all users of the system.

ii) A mutual-exclusion property ensures that at any time at most one

process accesses a shared resource, which must not be used

simultaneously by multiple processes if program operation is to be

correct.

iii) A no-starvation property ensures that if every process that is granted

a resource, which must not be used simultaneously by multiple

processes, eventually releases it, every request for that resource is

eventually granted.

iv) A no-deadlock property ensures that a situation will never occur in

which competing processes prevent their mutual progress even

though no single one requests more resources than available in the

system.

Page 28: MC0085

Advanced Operating Systems (Distributed Systems) Unit 1

Sikkim Manipal University Page No. 28

Performance Transparency

The aim of performance transparency is to allow the system to be

automatically reconfigured to improve performance, as loads vary

dynamically in the system. As far as practicable, a situation in which one

processor of the system is overloaded with jobs while another processor is

idle should not be allowed to occur. That is, the processing capability of the

system should be uniformly distributed among the currently available jobs in

the system. This requirement calls for the support of intelligent resource

allocation and process migration facilities in distributed operating systems.

Scaling Transparency

The aim of scaling transparency is to allow the system to expand in scale

without disrupting the activities of the users. This requirement demands for

open-system architecture and the use of scalable algorithms for designing

the distributed operating system components.

Reliability

In general, distributed systems are expected to be more reliable than

centralized systems due to the existence of multiple instances of resources.

However, the existence of multiple instances of the resources alone cannot

increase the system's reliability. Rather, the distributed operating system,

which manages these resources, must be designed properly to increase the

system's reliability by taking full advantage of this characteristic feature of a

distributed system.

A fault is a mechanical or algorithmic defect that may generate an error. A

fault in a system causes system failure. Depending on the manner in which

a failed system behaves, system failures are of two types – fail-stop and

Byzantine. In the case of fail-stop failure, the system stops functioning after

changing to a state in which its failure can be detected. On the other hand,

in the case of Byzantine failure, the system continues to function but

Page 29: MC0085

Advanced Operating Systems (Distributed Systems) Unit 1

Sikkim Manipal University Page No. 29

produces wrong results. Undetected software bugs often cause Byzantine

failure of a system. Obviously, Byzantine failures are much more difficult to

deal with than fail-stop failures.

For higher reliability, the fault-handling mechanisms of a distributed

operating system must be designed properly to avoid faults, to tolerate

faults, and to detect and recover from faults. Commonly used methods for

dealing with these issues are:

Fault Avoidance: Fault avoidance deals with designing the components of

the system in such a way that the occurrence of faults is minimized.

Conservative design practices such as using high-reliability components are

often employed for improving the system's reliability based on the idea of

fault avoidance. Although a distributed operating system often has little or

no role to play in improving the fault avoidance capability of a hardware

component, the designers of the various software components of the

distributed operating system must test them thoroughly to make these

components highly reliable.

Fault Tolerance: Fault tolerance is the ability of a system to continue

functioning in the event of partial system failure. The performance of the

system might be degraded due to partial failure, but otherwise the system

functions properly.

Fault Detection and Recovery: The fault detection and recovery method of

improving reliability deals with the use of hardware and software

mechanisms to determine the occurrence of a failure and then to correct the

system to a state acceptable for continued operation.

Flexibility

Another important issue in the design of distributed operating systems is

flexibility. Flexibility is the most important feature for open distributed

Page 30: MC0085

Advanced Operating Systems (Distributed Systems) Unit 1

Sikkim Manipal University Page No. 30

systems. The design of a distributed operating system should be flexible

due to the following reasons:

Ease of modification. From the experience of system designers, it has

been found that some parts of the design often need to be replaced/

modified either because some bug is detected in the design or because

the design is no longer suitable for the changed system environment or

new-user requirements. Therefore, it should be easy to incorporate

changes in the system in a user-transparent manner or with minimum

interruption caused to the users.

Ease of enhancement. In every system, new functionalities have to be

added from time to time to make it more powerful and easy to use.

Therefore, it should be easy to add new services to the system.

Furthermore, if a group of users do not like the style in which a particular

service is provided by the operating system, they should have the

flexibility to add and use their own service that works in the style with

which the users of that group are more familiar and feel more

comfortable.

The most important design factor that influences the flexibility of a

distributed operating system is the model used for designing its kernel. The

kernel of an operating system is its central controlling part that provides

basic system facilities. It operates in a separate address space that is

inaccessible to user processes. It is the only part of an operating system

that a user cannot replace or modify. In the case of a distributed operating

system identical kernels are run on all the nodes of a distributed system.

Performance

If a distributed system is to be used, its performance must be at least as

good as a centralized system. That is, when a particular application is run

on a distributed system, its overall performance should be better than or at

Page 31: MC0085

Advanced Operating Systems (Distributed Systems) Unit 1

Sikkim Manipal University Page No. 31

least equal to that of running the same application on a single-processor

system. However, to achieve this goal, it is important that the various

components of the operating system of a distributed system be designed

properly; otherwise, the overall performance of the distributed system may

turn out to be worse than a centralized system. Some design principles

considered useful for better performance are as follows:

Batch if possible. Batching often helps in improving performance

greatly. For example, transfer of data across the network in large chunks

rather than as individual pages is much more efficient. Similarly,

piggybacking of acknowledgment of previous messages with the next

message during a series of messages exchanged between two

communicating entities also improves performance.

Cache whenever possible. Caching of data at clients' sites frequently

improves overall system performance because it makes data available

wherever it is being currently used, thus saving a large amount of

computing time and network bandwidth. In addition, caching reduces

contention on centralized resources.

Minimize copying of data. Data copying overhead involves a

substantial CPU cost of many operations. For example, while being

transferred from its sender to its receiver, a message data may take the

following path on the sending side:

a. From sender's stack to its message buffer

b. From the message buffer in the sender's address space to the

message buffer in the kernel's address space

c. Finally, from the kernel to the network interface board

On the receiving side, the data probably takes a similar path in the

reverse direction. Therefore, in this case, a total of six copy operations

are involved in the message transfer operation. Similarly, in several

Page 32: MC0085

Advanced Operating Systems (Distributed Systems) Unit 1

Sikkim Manipal University Page No. 32

systems, the data copying overhead is also large for read and write

operations on block I/O devices. Therefore, for better performance, it is

desirable to avoid copying of data, although this is not always simple to

achieve. Making optimal use of memory management often helps in

eliminating much data movement between the kernel, block I/O devices,

clients, and servers.

Minimize network traffic. System performance may also be improved

by reducing internode communication costs. For example, accesses to

remote resources require communication, possibly through intermediate

nodes. Therefore, migrating a process closer to the resources it is using

most heavily may be helpful in reducing network traffic in the system if

the decreased cost of accessing its favorite resource offsets the possible

increased cost of accessing its less favored ones. Another way to

reduce network traffic is to use the process migration facility to cluster

two or more processes that frequently communicate with each other on

the same node of the system. Avoiding the collection of global state

information for making some decision also helps in reducing network

traffic.

Take advantage of fine-grain parallelism for multiprocessing.

Performance can also be improved by taking advantage of fine-grain

parallelism for multiprocessing. For example, threads (described in

Chapter 8) are often used for structuring server processes. Servers

structured as a group of threads can operate efficiently because they

can simultaneously service requests from several clients. Fine-grained

concurrency control of simultaneous accesses by multiple processes to

a shared resource is another example of application of this principle for

better performance.

Page 33: MC0085

Advanced Operating Systems (Distributed Systems) Unit 1

Sikkim Manipal University Page No. 33

Scalability

Scalability refers to the capability of a system to adapt to increased service

load. It is inevitable that a distributed system will grow with time since it is

very common to add new machines or an entire subnetwork to the system to

take care of increased workload or organizational changes in a company.

Therefore, a distributed operating system should be designed to easily cope

with the growth of nodes and users in the system. That is, such growth

should not cause serious disruption of service or significant loss of

performance to users. Some guiding principles for designing scalable

distributed systems are as follows:

Avoid centralized entities: such as single file server

Avoid centralized algorithms

Perform most operations on client workstations: servers are shared by

several clients

Heterogeneity

A heterogeneous distributed system consists of interconnected sets of

dissimilar hardware or software systems. Because of the diversity, designing

heterogenous distributed systems is far more difficult than designing

homogeneous distributed systems in which each system is based on the

same, or closely related, hardware and software. However, as a

consequence of large scale, heterogeneity is often inevitable in distributed

systems. Furthermore, often heterogeneity is preferred by many users

because heterogeneous distributed systems provide the flexibility to their

users of different computer platforms for different applications. For example,

a user may have the flexibility of a supercomputer for simulations, a

Macintosh for document processing, and a UNIX workstation for program

development.

Page 34: MC0085

Advanced Operating Systems (Distributed Systems) Unit 1

Sikkim Manipal University Page No. 34

Incompatibilities in a heterogeneous distributed system may be of different

types. For example, the internal formatting schemes of different

communication and host processors may be different; or when several

networks are interconnected via gateways, the communication protocols

and topologies of different networks may be different; or the servers

operating at different nodes of the system may be different. For instance,

some hosts use 32-bit word lengths while others use word lengths of 16 or

64 bits. Byte ordering within these data constructs can vary as well,

requiring special converters to enable data sharing between incompatible

hosts.

In a heterogeneous distributed system, some form of data translation is

necessary for interaction between two incompatible nodes. Some earlier

systems left this translation to the users, but this is no longer acceptable.

The data translation job may be performed either at the sender's node or at

the receiver's node. Suppose this job is performed at the receiver's node.

With this approach, at every node there must be a translator to convert each

format in the system to the format used on the receiving node. Therefore, if

there are n different formats, n - 1 pieces of translation software must be

supported at each node, resulting in a total of n (n - 1) pieces of translation

software in the system. This is undesirable, as adding a new type of format

becomes a more difficult task over time. Performing the translation job at the

sender's node instead of the receiver's node also suffers from the same

drawback.

The software complexity of this translation process can be greatly reduced

by using an intermediate standard data format. In this method, an

intermediate standard data format is declared, and each node only requires

a translation software for converting from its own format to the standard

format and from the standard format to its own format. In this case, when

two incompatible nodes interact at the sender node, the data to be sent is

Page 35: MC0085

Advanced Operating Systems (Distributed Systems) Unit 1

Sikkim Manipal University Page No. 35

first converted to the standard format, the data is moved in the format of the

standard, and finally, at the receiver node, the data is converted from the

standard format to the receiver's format. By choosing the standard format to

be the most common format in the system, the number of conversions can

be reduced.

Security

In order that the users can trust the system and rely on it, the various

resources of a computer system must be protected against destruction and

unauthorized access. Enforcing security in a distributed system is more

difficult than in a centralized system because of the lack of a single point of

control and the use of insecure networks for data communication. In a

centralized system, all users are authenticated by the system at login time,

and the system can easily check whether a user is authorized to perform the

requested operation on an accessed resource. In a distributed system,

however, since the client-server model is often used for requesting and

providing services, when a client sends a request message to a server, the

server must have some way of knowing who the client is. This is not so

simple as it might appear because any client identification field in the

message cannot be trusted. This is because an intruder (a person or

program trying to obtain unauthorized access to system resources) may

pretend to be an authorized client or may change the message contents

during transmission. Therefore, as compared to a centralized system,

enforcement of security in a distributed system has the following additional

requirements:

It should be possible for the sender of a message to know that the

message was received by the intended receiver.

It should be possible for the receiver of a message to know that the

message was sent by the genuine sender.

Page 36: MC0085

Advanced Operating Systems (Distributed Systems) Unit 1

Sikkim Manipal University Page No. 36

It should be possible for both the sender and receiver of a message to

be guaranteed that the contents of the message were not changed while

it was in transfer.

Cryptography is the only known practical method for dealing with these

security aspects of a distributed system. In this method, comprehension of

private information is prevented by encrypting the information, which can

then be decrypted only by authorized users.

Another guiding principle for security is that a system whose security

depends on the integrity of the fewest possible entities is more likely to

remain secure as it grows. For example, it is much simpler to ensure

security based on the integrity of the much smaller number of servers rather

than trusting thousands of clients. In this case, it is sufficient to only ensure

the physical security of these servers and the software they run.

1.6 Introduction to Distributed Computing Environment (DCE)

A vendor-independent distributed computing environment, DCE was defined

by the Open Software Foundation (OSF), a consortium of computer

manufacturers, including IBM, DEC, and Hewlett-Packard. It is not an

operating system, nor is it an application. Rather, it is an integrated set of

services and tools that can be installed as a coherent environment on top of

existing operating systems and serve as a platform for building and running

distributed applications.

A primary goal of DCE is vendor independence. It runs on many different

kinds of computers, operating systems, and networks produced by different

vendors. For example, some operating systems to which DCE can be easily

ported include OSF/1, AIX, DOMAIN OS, ULTRIX, HP-UX, SINIX, SunOS,

UNIX System V, VMS, WINDOWS, and OS/2. On the other hand, it can be

used with any network hardware and transport software, including TCP/IP,

X.25, as well as other similar products.

Page 37: MC0085

Advanced Operating Systems (Distributed Systems) Unit 1

Sikkim Manipal University Page No. 37

As shown in Figure 1.7, DCE is a middleware software layered between the

DCE applications layer and the operating system and networking layer. The

basic idea is to take a collection of existing machines (possibly from different

vendors), interconnect them by a communication network, add the DCE

software platform on top of the native operating systems of the machines,

and then be able to build and run distributed applications. Each machine

has its own local operating system, which may be different from that of other

machines. The DCE software layer on top of the operating system and

networking layer hides the differences between machines by automatically

performing data-type conversions when necessary. Therefore, the

heterogeneous nature of the system is transparent to the applications

programmers, making their job of writing distributed applications much

simpler.

DCE applications

DCE software

Operating systems and networking

Fig. 1.7: Position of DCE Software in a DCE-based Distributed System

DCE Components

DCE is a blend of various technologies developed independently and nicely

integrated by OSF. Each of these technologies forms a component of DCE.

The main components of DCE are as follows:

Threads package: It provides a simple programming model for building

concurrent applications. It includes operations to create and control

multiple threads of execution in a single process and to synchronize

access to global data within an application.

Remote Procedure Call (RPC) facility: It provides programmers with a

number of powerful tools necessary to build client-server applications. In

Page 38: MC0085

Advanced Operating Systems (Distributed Systems) Unit 1

Sikkim Manipal University Page No. 38

fact, the DCE RPC facility is the basis for all communication in DCE

because the programming model underlying all of DCE is the client-

server model. It is easy to use, network-independent and protocol-

independent, provides secure communication between a client and a

server, and hides differences in data requirements by automatically

converting data to the appropriate forms needed by clients and servers.

Distributed Time Service (DTS): It closely synchronizes the clocks of

all the computers in the system. It also permits the use of time values

from external time sources to synchronize the clocks of the computers in

the system with external time. This facility can also be used to

synchronize the clocks of the computers of one distributed environment

with the clocks of the computers of another distributed environment.

Name services: The name services of DCE include the Cell Directory

Service (CDS), the Global Directory Service (GDS), and the Global

Directory Agent (GDA). These services allow resources such as servers,

files, devices, and so on, to be uniquely named and accessed in a

location-transparent manner.

Security Service: It provides the tools needed for authentication and

authorization to protect system resources against illegitimate access.

Distributed File Service (DFS): It provides a systemwide file system

that has such characteristics as location transparency, high

performance, and high availability. A unique feature of DCE DFS is that

it can also provide file services to clients of other file systems.

DCE Cells

The DCE system is highly scalable in the sense that a system running DCE

can have thousands of computers and millions of users spread over a

worldwide geographic area. To accommodate such large systems, DCE

uses the concept of cells. This concept helps break down a large system

into smaller, manageable units called cells.

Page 39: MC0085

Advanced Operating Systems (Distributed Systems) Unit 1

Sikkim Manipal University Page No. 39

In a DCE system, a cell is a group of users, machines, or other resources

that typically have a common purpose and share common DCE services.

The minimum cell configuration requires a cell directory server, a security

server, a distributed time server, and one or more client machines. Each

DCE client machine has client processes for security service, cell directory

service, distributed time service, RPC facility, and threads facility. A DCE

client machine may also have a process for distributed file service if a cell

configuration has a DCE distributed file server. Due to the use of the method

of intersection for clock synchronization, it is recommended that each cell in

a DCE system should have at least three distributed time servers.

An important decision to be made while setting up a DCE system is to

decide the cell boundaries. The following four factors should be taken into

consideration for making this decision.

i) Purpose: The machines of users working on a common goal should be

put in the same cell, as they need easy access to a common set of

system resources. That is, users of machines in the same cell have

closer interaction with each other than with users of machines in

different cells. For example, if a company manufactures and sells

various types of products, depending on the manner in which the

company functions, either a product-oriented or a function-oriented

approach may be taken to decide cell boundaries. In the product-

oriented approach, separate cells are formed for each product, with the

users of the machines belonging to the same cell being responsible for

all types of activities (design, manufacturing, marketing, and support

services) related to one particular product. On the other hand, in the

function-oriented approach, separate cells are formed for each type of

activity, with the users belonging to the same cell being responsible for

a particular activity, such as design, of all types of products.

Page 40: MC0085

Advanced Operating Systems (Distributed Systems) Unit 1

Sikkim Manipal University Page No. 40

ii) Administration: Each system needs an administrator to register new

users in the system and to decide their access rights to the system's

resources. To perform his or her job properly, an administrator must

know the users and the resources of the system. Therefore, to simplify

administration jobs, all the machines and their users that are known to

and manageable by an administrator should be put in a single cell. For

example, all machines belonging to the same department of a company

or a university can belong to a single cell. From an administration point

of view, each cell has a different administrator.

iii) Security: Machines of those users who have greater trust in each

other should be put in the same cell. That is, users of machines of a

cell trust each other more than they trust the users of machines of other

cells. In such a design, cell boundaries act like firewalls in the sense

that accessing a resource that belongs to another cell requires more

sophisticated authentication than accessing a resource that belongs to

a user's own cell.

iv) Overhead: Several DCE operations, such as name resolution and user

authentication, incur more overhead when they are performed between

cells than when they are performed within the same cell. Therefore,

machines of users who frequently interact with each other and the

resources frequently accessed by them should be placed in the same

cell The need to access a resource of another cell should arise

infrequently for better overall system performance.

1.7 Summary

A distributed computing system is a collection of processors interconnected

by a communication network in which each processor has its own local

memory and other peripherals and communication between any two

Page 41: MC0085

Advanced Operating Systems (Distributed Systems) Unit 1

Sikkim Manipal University Page No. 41

processors of the system takes place by message passing over the

communication network.

The existing models for distributed computing systems can be broadly

classified into five models: minicomputer, workstation, workstation-server,

processor-pool, and hybrid.

Distributed computing systems are much more complex and difficult to build

than the traditional centralized systems. Despite the increased complexity

and the difficulty of building, the installation and use of distributed computing

systems are rapidly increasing. This is mainly because the advantages of

distributed computing systems outweigh its disadvantages. The main

advantages of distributed computing systems are (a) suitability for inherently

distributed applications, (b) sharing of information among distributed users,

(c) sharing of resources, (d) better price-performance ratio, (e) shorter

response times and higher throughput, (f) higher reliability, (g) extensibility

and incremental growth, and (h) better flexibility in meeting users' needs.

The operating systems commonly used for distributed computing systems

can be broadly classified into two types: network operating systems and

distributed operating systems. As compared to a network operating system,

a distributed operating system has better transparency and fault tolerance

capability and provides the image of a virtual uniprocessor to the users.

The main issues involved in the design of a distributed operating system are

transparency, reliability, flexibility, performance, scalability, heterogeneity,

security, and emulation of existing operating systems.

DCE is an integrated set of services and tools that can be installed as a

coherent environment on top of existing operating systems and serve as a

platform for building and running distributed applications. A primary goal of

DCE is vendor independence. It runs on many different kinds of computers,

operating systems, and networks produced by different vendors.

Page 42: MC0085

Advanced Operating Systems (Distributed Systems) Unit 1

Sikkim Manipal University Page No. 42

1.8 Terminal Questions

1. Discuss the relative advantages and disadvantages of the various

commonly used models for configuring distributed computing systems.

2. What are the main differences between a network operating system and

a distributed operating system?

3. What are the major issues in designing a distributed operating system?

4. Why is scalability an important feature in the design of a distributed

system?.

5. What are the main components of DCE?

Page 43: MC0085

Advanced Operating Systems (Distributed Systems) Unit 2

Sikkim Manipal University Page No. 43

Unit 2 Message Passing

Structure:

2.1 Introduction

Objectives

2.2 Features of Message Passing

2.3 Issues in IPC by Message Passing

2.4 Synchronization

2.5 Buffering

2.6 Process Addressing

2.7 Failure Handling

2.8 Group Communication

2.9 Terminal Questions

2.1 Introduction

A process is a program in execution. When we say that two computers of a

distributed system are communicating with each other, we mean that two

processes, one running on each computer, are in communication with each

other. A distributed operating system needs to provide interprocess

communication (IPC) mechanisms to facilitate such communication

activities. A message passing system is a subsystem of the distributed

operating system which shields the details of complex network protocols

from the programmer. It enables processes to communicate by exchanging

messages and allows programs to be written by using simple

communication primitives such as send and receive. Interprocess

communication basically requires information sharing among two or more

processes. The two basic methods for information sharing are as follows:

Page 44: MC0085

Advanced Operating Systems (Distributed Systems) Unit 2

Sikkim Manipal University Page No. 44

i) Original Sharing or Shared Data approach

Message is placed in a common memory area that is accessible

to all processes.

This is not possible in a distributed system, unless it is a

distributed shared memory system (DSM).

ii) Copy Sharing or Message Passing approach

Message is physically copied from sender’s address space to the

receiver’s address space.

This is the basic IPC mechanism in distributed systems.

In the shared data approach, the information to be shared is placed in a

common memory area that is accessible to all the processes involved in an

IPC. The shared data paradigm gives the conceptual communication pattern

illustrated in figure 2.1 below:

Figure 2.1: Communications in Shared Data Paradigm

In the method of message passing, the information to be shared is

physically copied from the sender process’s address space to the address

space of all the receiver processes, and this is done by transmitting the data

to be copied in the form of messages (a message is a block of information).

The message passing paradigm gives the conceptual communication

Shared Common Memory Area

P1 P2 Pn-1 Pn …

Page 45: MC0085

Advanced Operating Systems (Distributed Systems) Unit 2

Sikkim Manipal University Page No. 45

pattern as shown in figure 2.2 below. In this case the communicating

processes interact directly with each other.

Figure 2.2: Communication in Message Passing Paradigm

Since computers in a network do not share memory, processes in a

distributed system normally communicate by exchanging messages with

among them. Therefore message passing is the basic IPC mechanism in

distributed systems.

2.2 Features of a Message Passing System

Desirable features of a good message passing system are:

Simplicity

Efficiency

Reliability

Correctness

Flexibility

Security

Portability

Simplicity

The message passing system should be

– easy to use

– easy to develop new applications that communicate with the existing

ones

– able to hide the details of underlying network protocols used

Efficiency

– Should reduce the number of message exchanges (acks,..)

P1 P2

Page 46: MC0085

Advanced Operating Systems (Distributed Systems) Unit 2

Sikkim Manipal University Page No. 46

– Avoid the costs of establishing and terminating connections between

the same pair of processes for each and every message

– Piggyback acknowledgments with the normal messages

– Send acknowledgments selectively

Reliability

– Should handle node and link failures

– Normally handled by acknowledgments, timeouts and

retransmissions.

– Should handle duplicate messages that arise due to retransmissions

(generally sequence numbers of the messages are used for this

purpose).

Correctness

– Atomicity: messages sent to a group of processes will be delivered

to all of them or none of them.

– Ordered delivery: Messages are received by all receivers in an

order acceptable to the application.

– Survivability: Guarantees messages will be delivered correctly in

spite of failures.

Flexibility

– IPC protocols should be flexible to cater to the various needs

different applications (i.e. some may not require atomicity others may

not require ordered delivery, etc)

– IPC primitives should be flexible to permit any kind of control flow

between cooperating processes, including synchronous and

asynchronous send and receive.

Security

– Message passing system should be capable of providing secure

end-to-end communication.

Page 47: MC0085

Advanced Operating Systems (Distributed Systems) Unit 2

Sikkim Manipal University Page No. 47

– Support mechanisms for authentication of the receivers of a

message by a sender.

– Support mechanisms for authentication of the sender by its receivers

– Support encryption of a message before sending it over the network.

Portability: There are two different aspects of portability in a message-

passing system:

1. The message-passing system should itself be portable. It should be

possible to easily construct a new IPC facility on another system by

reusing the basic design of the existing message-passing system.

2. The applications written by using the primitives of the IPC protocols

of the message-passing system should be made portable. This

requires that heterogeneity must be considered while designing a

message – passing system. This may require the use of external

data representation format for the communication taking place

between two or more processes running on computers of different

architectures.

2.3 Issues in IPC (Inter-process Communication) by Message

Passing

A message is a meaningful formatted block of information sent by the

sender process to the receiver process. The message block consists of a

fixed length header followed by a variable size collection of typed data

objects.

The header block of a message may have the following elements:

Address: A set of characters that uniquely identify both the sender and

receiver.

Sequence Number: It is the Message Identifier to identify duplicate and

lost messages in case of system failures.

Page 48: MC0085

Advanced Operating Systems (Distributed Systems) Unit 2

Sikkim Manipal University Page No. 48

Structural Information: It has two parts. The type part that specifies

whether the data to be sent to the receiver is included within the

message or the message only contains a pointer to the data. The

second part specifies length of the variable-size message.

In a message oriented IPC protocol, the users are fully aware of the

message formats used in the communication process and the mechanisms

used to send and receive messages.

The following are some important issues to be considered for the design of

an IPC protocol based message passing system:

The Sender’s Identity

The Receiver’s Identity

Number of Receivers

Guaranteed acceptance of sent messages by the receiver

Acknowledgement by the sender

Handling system crashes or link failures

Handling of buffers

Order of delivery of messages

The above issues are addressed by the semantics of the communication

primitives provided by the IPC Protocol. A general descriptionof the various

ways in which these issues are addressed by message oriented IPC

protocols is presented below.

2.4 Synchronization

A major issue in communication is the synchronization imposed on the

communicating processes by the communication primitives. There are two

types of communicating primitives: Blocking Semantics and Non-Blocking

Semantics.

Page 49: MC0085

Advanced Operating Systems (Distributed Systems) Unit 2

Sikkim Manipal University Page No. 49

Blocking Semantics: A communication primitive is said to have

blocking semantics if its invocation blocks the execution of its invoker

(for example in the case of send, the sender blocks until it receives an

acknowledgement from the receiver.)

Non-blocking Semantics: A communication primitive is said to have

non-blocking semantics if its invocation does not block the execution of

its invoker.

The synchronization imposed on the communicating processes basically

depends on one of the two types of semantics used for the send and receive

primitives.

Blocking Primitives

Blocking Send Primitive: In this case, after execution of the send

statement, the sending process is blocked until it receives an

acknowledgement from the receiver that the message has been received.

Non-Blocking Send Primitive: In this case, after execution of the send

statement, the sending process is allowed to proceed with its execution as

soon as the message is copied to the buffer.

Blocking Receive Primitive: In this case, after execution of the receive

statement, the receiving process is blocked until it receives a message.

Non-Blocking Receive Primitive: In this case, the receiving process

proceeds with its execution after the execution of receive statement, which

returns the control almost immediately just after telling the kernel where the

message buffer is.

Handling non-blocking receives: The following are the two ways of doing

this:

– Polling: a test primitive is used by the receiver to check the buffer status

Page 50: MC0085

Advanced Operating Systems (Distributed Systems) Unit 2

Sikkim Manipal University Page No. 50

– Interrupt: When a message is filled in the buffer, software interrupt is

used to notify the receiver. However, user level interrupts make

programming difficult.

Handling blocking receives: A timeout value may be used with a blocking

receive primitive to prevent a receiving process from getting blocked

indefinitely if the sender has failed.

Synchronous Vs Asynchronous Communication

When both send and receive primitives of a communication between two

processes use blocking semantics, the communication is said to be

synchronous. If one or both of the primitives is non-blocking, then the

communication is said to be asynchronous.

Synchronous communication is easy to implement. It contributes to the

reliable delivery of messages. Asynchronous communication limits

concurrency and is prone to communication deadlocks.

2.5 Buffering

The transmission of messages from one process to another can be done by

copying the body of the message from the sender’s address space to the

receiver’s address space. In some cases, the receiving process may not be

ready to receive the message but it wants the operating system to save that

message for later reception. In such cases, the operating system would rely

on the receiver’s buffer space in which the transmitted messages can be

stored prior to receiving process executing specific code to receive the

message.

The synchronous and asynchronous modes of communication correspond

to the two extremes of buffering: a null buffer, or no buffering, and a buffer

with unbounded capacity. Two other commonly used buffering strategies are

Page 51: MC0085

Advanced Operating Systems (Distributed Systems) Unit 2

Sikkim Manipal University Page No. 51

single-message and finite-bound, or multiple message buffers. These four

types of buffering strategies are given below:

No buffering: In this case, message remains in the sender’s address

space until the receiver executes the corresponding receive.

Single message buffer: A buffer to hold a single message at the

receiver side is used. It is used for implementing synchronous

communication because in this case an application can have only one

outstanding message at any given time.

Unbounded - Capacity buffer: Convenient to support asynchronous

communication. However, it is impossible to support unbounded buffer.

Finite-Bound Buffer: Used for supporting asynchronous

communication.

Buffer overflow can be handled in one of the following ways:

Unsuccessful communication: send returns an error message to the

sending process, indicating that the message could not be delivered to

the receiver because the buffer is full.

Flow-controlled communication: The sender is blocked until the

receiver accepts some messages. This violates the semantics of

asynchronous send. This will also result in communication deadlock.

A message data should be meaningful to the receiving process. This implies

ideally that the structure of the program should be preserved while they are

being transmitted from the address space of the sending process to the

address space of the receiving process. It is not possible in heterogeneous

systems in which the sending and receiving processes are on computers of

different architectures. Even in homogeneous systems, it is very difficult to

achieve this goal mainly because of two reasons:

Page 52: MC0085

Advanced Operating Systems (Distributed Systems) Unit 2

Sikkim Manipal University Page No. 52

1. An absolute pointer value has no meaning (more on this when we talk

about RPC). For example, a pointer to a tree or linked list. So, proper

encoding mechanisms should be adopted to pass such objects.

2. Different program objects, such as integers, long integers, short

integers, and character strings occupy different storage space. So, from

the encoding of these objects, the receiver should be able to identify the

type and size of the objects.

One of the following two representations may be used for the encoding and

decoding of a message data:

1. Tagged representation: The type of each program object as well as its

value is encoded in the message. In this method, it is a simple matter for

the receiving process to check the type of each program object in the

message because of the self-describing nature of the coded data format.

2. Untagged representation: The message contains only program

objects, no information is included in the message about the type of

each program object. In this method, the receiving object should have a

prior knowledge of how to decode the received data because the coded

data format is not self-describing.

The untagged representations used in SUN’s XDR format and tagged

representation is used in Mach distributed operating system.

2.6 Process Addressing

A message passing system generally supports two types of addressing:

Explicit Addressing: The process with which communication is desired

is explicitly specified as a parameter in the communication primitive. e.g.

send (pid, msg), receive (pid, msg).

Page 53: MC0085

Advanced Operating Systems (Distributed Systems) Unit 2

Sikkim Manipal University Page No. 53

Implicit Addressing: A process does not explicitly name a process for

communication. For example, a process can specify a service instead of

a process. e.g. send any (service id, msg), receive any (pid, msg)

Methods for process addressing:

machine id@local id: UNIX uses this form of addressing (IP address,

port number).

Advantages: No global coordination needed for process addressing.

Disadvantages: Does not allow process migration.

machine id1@local id@machine id2: machine id1 identifies the node

on which the process is created. local id is generated by the node on

which the process is created.

machine id2 identifies the last known location of the process. When a

process migrates to another node, the link information (the machine id to

which the process migrates) is left with the current machine. This

information is used for forwarding messages to migrated processes.

Disadvantages:

– Overhead involved in locating a process may be large.

– If the node on which the process was executing is down, it may not

be possible to locate the process.

2.7 Failure Handling

While a distributed system may offer potential for parallelism, it is also prone

to partial failures such as a node crash or a communication link failure.

During Interprocess communication, such failures may lead to the following

problems:

Loss of request Message: This can be due to link failure or receiver

node is down.

Page 54: MC0085

Advanced Operating Systems (Distributed Systems) Unit 2

Sikkim Manipal University Page No. 54

Loss of response message: This may be due to link failure or the

sender is down when the response reaches it.

Unsuccessful execution of request: This may be due to receiver node

crash while processing the request.

A Solution to overcome the above said problem may be done using the

following methods:

1. A four message reliable IPC:

In this method there are four messages involved: Request and

Acknowledgement from the client machine, Reply and Acknowledgement

from the Server machine. In this case, the kernels of both the client and

server will continue to retransmit after timeout until an acknowledgement is

received from both, i.e the client machine sends a request message to the

server machine and waits for an acknowledgement from the server. If the

acknowledgement is not received within the specified timeout period, the

client retransmits its request to the server and waits for an

acknowledgement. This process continues till an acknowledgement is

received. The same process occurs even at the server side.

The server sends a reply message to the client and waits for the

acknowledgement until the specified timeout period. On non-receipt of the

acknowledgement within the timeout period, it resends the reply back to the

client machine and the process continues till the client responds with an

acknowledgement.

2. Three message reliable IPC:

As mentioned in point number 1 above, the scenario here slightly varies,

wherein the client machine does not wait for an acknowledgement to be

received from the server machine. The client machine just sends the

request to the specified server. But here the server machine expects an

acknowledgement from the client machine when it responds to the client’s

Page 55: MC0085

Advanced Operating Systems (Distributed Systems) Unit 2

Sikkim Manipal University Page No. 55

request message. The server now waits for an acknowledgement from the

client and on non-receipt of the acknowledgement within the specified time

period, it retransmits the reply message to the client and this cycle continues

until the client responds with an acknowledgement.

In this method, the server may use the concept of piggybacking, wherein it

may attach the acknowledgement to the client with a message in the form of

a reply to the client.

3. Two message reliable IPC:

In this method there is no requirement either from the client or the server for

receiving acknowledgements from each other. They just exchange the

messages in the form of requests and replies or responses to each other

assuming that their messages have been sent (ideal scenario), but which

may be impractical in real time situations.

Idempotency and handling of duplicate request messages

Idempotency basically means “repeatability”. i.e. an Idempotent operation

produces the same results without any side effects no matter how many

times it is performed with the same arguments. For example, assume an

sqrt procedure for calculating the square root of a given number; sqrt (64)

always returns 8.

On the other hand, operations that do not necessarily produce the same

results when executed repeatedly with the same arguments are said to be

non-idempotent. For example a debit operation on a bank account.

An idempotent operation produces the same result without any side

effect no matter how many times it is executed.

Not all operations are idempotent

So, if requests can be retransmitted, then care should be taken to

implement its reply as an idempotent operation.

Page 56: MC0085

Advanced Operating Systems (Distributed Systems) Unit 2

Sikkim Manipal University Page No. 56

Even if the same request is retransmitted several times, the server

should execute the request only once; or if it executes several times, the

net result should be equivalent to the result of exactly one execution.

This is called exactly once semantics. Primitives based on exactly-once

semantics are desirable but difficult to implement.

Implementation of exactly-once semantics:

– each request has a unique sequence number

– Kernel makes sure request is forwarded to server only once

– After receiving the reply from the server, Kernel caches a copy of the

reply and retransmits it when it receives the same request from client

2.8 Group Communication

The most elementary form of message-based interaction is one-to-one

communication in which a single-sender process sends a message to a

single receiver process. However, for performance and ease of

programming several highly parallel distributed applications require that a

message passing system should also provide group communication facility.

Depending on single or multiple senders and receivers, the following three

types of group communication are possible:

1. One-to-many (Single sender and multiple receivers)

2. Many-to-one (multiple senders and single receiver)

3. Many-to-many (multiple senders and multiple receivers)

The following are the One-to-Many Multicast issues in a group

communication to be addressed:

i) Group Management:

In case of one-to-many communication, receiver processes of a message

form a group. Such groups are of two types – closed and open. A closed

group is one in which only the members of the group can send a message

Page 57: MC0085

Advanced Operating Systems (Distributed Systems) Unit 2

Sikkim Manipal University Page No. 57

to the group. An outside process cannot send a message to the group as a

whole, although it may send a message to an individual member of the

group. On the other hand an open group is one in which any process in the

system can send a message to the group as a whole.

Whether to use a closed group or an open group is application dependent. A

message passing system with group communication facility provides the

flexibility to create an delete groups dynamically and to allow a process to

join or leave a group at any time.

ii) Group Addressing:

A two-level naming scheme is normally used for group addressing. The

higher level group is an ASCII string that is independent of the location

information of the processes in the group. The low-level group name

depends to a large extent on the underlying hardware. For example, on

some networks it is possible to create a special network address to which

multiple machines can listen.

Create a special network address, called multicast address. A packet

sent to multicast address is delivered to all who have subscribed to that

group.

For example, on the Internet, class D IP addresses are used for

multicast. The format of class D IP addresses for IP multicasting:

--------------------------------------

|1|1|1|0| Group identification|

--------------------------------------

The first four bits contain 1110 and identify the address as multicast.

The remaining 28 bits specify a specific multicast group.

Broadcast address: A certain address is declared as a broadcast

address and packets sent to that address are delivered to all in the

network.

Page 58: MC0085

Advanced Operating Systems (Distributed Systems) Unit 2

Sikkim Manipal University Page No. 58

If there is no facility to create multicast or broadcast addresses, then

underlying unicast is used. A disadvantage is for each member a

separate copy of each packet needs to be sent.

iii) Message Delivery Approach: The following are the two possible

approaches for message delivery.

Centralized approach: A centralized group server maintains

information about the groups and their members.

Decentralized approach: No central server keeps the information.

Buffered or Unbuffered: A multicast packet can be buffered until the

receiver is ready to receive. If unbuffered, packets could be lost. Multicast

send is inherently asynchronous:

It is unrealistic to expect sending process to wait until all the receiving

processes that belong to the multicast group are ready to receive.

The sending process may not be aware of all the receiving processes

Flexible Reliability in Multicast Communication: Different levels of reliability

O-reliable: No response is expected from any receivers.

1-reliable: Sender expects response from one receiver (may be the

multicast server can take the responsibility).

m-out-of-n-reliable: The sender expects response from m out of

n receivers.

All-reliable: The sender expects response from all receivers.

Atomic Multicast: A multicast message is received by all the members of

the group or none.

Different Implementation methods:

• The Kernel of the sender is responsible for retransmitting until everyone

receives. This method works only if the sender’s machine and none of

the receiver processes fail.

Page 59: MC0085

Advanced Operating Systems (Distributed Systems) Unit 2

Sikkim Manipal University Page No. 59

• Each receiver of the multicast message performs an atomic multicast of

the same message to the same group. This method ensures all

surviving processes will receive the message even if some receivers fail

after receiving the message or the sender machine fails after sending

the message.

iv) Many-to-One Communication: In this type of communication, multiple

senders send messages to a single receiver. For example,

• A buffer process may receive messages from several consumers

and producers.

• Multicast recipients may be sending acknowledgements to the

sender.

• A database server may be receiving requests from several clients

v) Many-to-Many Communication: In this type of communication, multiple

senders send messages to multiple receivers. An important issue here is

that of ordered delivery of messages. Ordered delivery ensures that all

messages are delivered to all receivers in an order acceptable to the

application.

The following are the various message ordering semantics followed in case

of a Many-to-Many communication:

i) Absolute Ordering: In this type, all messages are delivered to all

processes in the exact order in which they were sent.

• Not possible to implement in the absence of global clock.

• Moreover, absolute ordering is not required by many applications.

ii) Consistent Ordering: In this type, all messages are received by all

processes in the same order.

iii) Causal Ordering: For some applications, consistent-ordering

semantics is not necessary and even weaker semantics is acceptable.

An application can have better performance if the message-passing

Page 60: MC0085

Advanced Operating Systems (Distributed Systems) Unit 2

Sikkim Manipal University Page No. 60

system used supports a weaker ordering semantics that is acceptable

to the application. One such weak ordering semantics that is

acceptable to many applications is the causal ordering semantics.

This semantics ensures that if the event of sending one message is causally

related to the event of sending another message, the two messages are

delivered to all receivers in the correct order. Two message sending events

are said to be causally related if they are correlated by the happened-before

relation. i.e. two message sending events are causally related if there is any

possibility of the second one being influenced in any way by the first one.

The basic idea behind causal ordering semantics is that when it matters,

messages are always delivered in proper order, but when it does not matter,

they may be delivered in any arbitrary order.

2.9 Terminal Questions

1. What is a message passing system? Discuss the desirable features of a

message passing system.

2. Discuss the synchronization issues in a message passing system.

3. Discuss the issues of buffering and process addressing.

4. Discuss about group communication mechanisms in Message Passing.

Page 61: MC0085

Advanced Operating Systems (Distributed Systems) Unit 3

Sikkim Manipal University Page No. 61

Unit 3 Remote Procedure Calls

Structure:

3.1 Introduction

Objectives

3.2 The RPC Model

3.3 Transparency of RPC

3.4 Implementation of RPC Mechanism

3.5 STUB Generation

3.6 RPC Messages

3.7 Marshaling Arguments and Results

3.8 Server Management

3.9 Parameter Passing, Call Semantics

3.10 Communication Protocol for RPCs

3.11 Complicated RPC

3.12 Client-Server Binding

3.13 Security

3.14 Terminal Questions

3.1 Introduction

Many distributed systems have been based on explicit message exchange

between processes. However, the procedures send and receive do not

conceal communication, which is important to achieve access transparency

in distributed systems. This problem has long been known, but little was

done about it until a paper by Birrell and Nelson (1984) introduced a

completely different way of handling communication. Although the idea is

refreshingly simple (once someone has thought of it), the implications are

often subtle. In this section we will examine the concept, its implementation,

its strengths, and its weaknesses.

Page 62: MC0085

Advanced Operating Systems (Distributed Systems) Unit 3

Sikkim Manipal University Page No. 62

In a nutshell, what Birrell and Nelson suggested was allowing programs to

call procedures located on other machines. When a process on machine A

calls a procedure on machine B, the calling process on A is suspended, and

execution of the called procedure takes place on B. Information can be

transported from the caller to the callee in the parameters and can come

back in the procedure result. No message passing at all is visible to the

programmer. This method is known as Remote Procedure Call, or often

just RPC.

While the basic idea sounds simple and elegant, subtle problems exist. To

start with, because the calling and called procedures run on different

machines, they execute in different address spaces, which causes

complications. Parameters and results also have to be passed, which can

be complicated, especially if the machines are not identical. Finally, both

machines can crash and each of the possible failures causes different

problems. Still, most of these can be dealt with, and RPC is a widely-used

technique that underlies many distributed systems.

Objectives:

This unit deals with the remote procedure calling mechanisms in a

distributed system, where in the caller and callee are separated. It starts

introducing the RPC mechanism and discusses the RPC model in a

distributed environment. It discusses various implementation issues

concerned with RPC. It discusses the issues like Stub generation, Server

Management, Parameter Passing mechanisms, Communication protocols,

Client – Server Binding etc.

3.2 The RPC Model

The RPC mechanism is an extension of a normal procedure call

mechanism. It enables a call to be made to a procedure that does not reside

in the address space of the calling process. The called procedure may be on

Page 63: MC0085

Advanced Operating Systems (Distributed Systems) Unit 3

Sikkim Manipal University Page No. 63

a remote machine or on the same machine. The caller and callee have

separate address space; so called procedure has no access to the caller’s

environment.

The RPC model is used for transfer of control and data within a program in

the following manner:

1. For making a procedure call, the caller places arguments to the

procedure in some well-specified location.

2. Control is then transferred to the sequence of instructions that

constitutes the body of the procedure.

3. The procedure body is executed in a newly created execution

environment that includes copies of the arguments given in the calling

instruction.

4. After the procedure’s execution is over, control returns to the calling

point, possibly returning a result.

When a remote procedure call is made, the caller and the callee processes

interact in the following manner:

The caller (also known as the client process) sends a call (request)

message to the callee (also known as the server process) and waits (blocks)

for a reply message. The server executes the procedure and returns the

result of the procedure execution to the client. After extracting the result of

the procedure execution, the client resumes execution. In the above model,

RPC calls are synchronous; however, an implementation may choose to

have RPC calls to be asynchronous to allow parallelism. Also, for each

request the server can create a thread to process the request so the server

can receive other requests.

Page 64: MC0085

Advanced Operating Systems (Distributed Systems) Unit 3

Sikkim Manipal University Page No. 64

Figure 3.1: A Model of Remote Procedure Call

3.3 Transparency of RPC

A major issue in the design of an RPC facility is its transparency property. A

transparent RPC mechanism is one in which local procedures and remote

procedures are indistinguishable to programmers. This requires the

following two types of transparencies:

Syntactic Transparency: A remote procedure call should have the same

syntax as a local procedure call, which is not very difficult to achieve.

Semantic Transparency: Semantics of remote procedure calls are identical

to those of local procedure calls.

Achieving semantic transparency is not easy because:

Unlike local procedure calls, the called procedure is executed in an

address space that is disjoint from the calling program’s address space.

– called procedure has no access to the local environment.

Caller

(Client Process)

Callee

(Server Process)

Request Message (contains remote procedure’s

parameters)

Reply Message (Contains result of procedure

execution)

Call procedure and

wait for reply

Resume Execution

Receive request and start

procedure execution

Procedure Executes

Send reply and wait

for next request

Page 65: MC0085

Advanced Operating Systems (Distributed Systems) Unit 3

Sikkim Manipal University Page No. 65

– Passing addresses (pointers) as arguments is meaningless.

– So, passing pointers as parameters is not attractive. An alternative

may be to send a copy of the value pointed.

– Call by reference can be replaced by copy in/copy out but at the cost

of slightly different semantics.

Remote procedure calls are more vulnerable to failure than local

procedure calls

– Programs that make use of RPC must have the capability to handle

this type of error.

– This makes it more difficult to make RPCs transparent.

RPCs consume much more time (100 to 1000 times) than local

procedure calls due to the involvement of communication network.

So, achieving semantic transparency is not easy.

3.4 Implementation of RPC Mechanism

To achieve the goal of semantic transparency, the implementation of RPC is

based on the concept of stubs. Stubs provide a perfectly normal local

procedure call abstraction. It conceals from programs the interface to the

underlying RPC system. On the client side and the server side, a separate

stub procedure is associated with each. To hide the existence of functional

details of the underlying network, an RPC communication package (called

RPC runtime) is used in both the client and server sides.

Thus implementation of an RPC mechanism involves the following five

elements:

1. The Client

2. The Client stub

3. The RPC Runtime

4. The server stub, and

5. The server

Page 66: MC0085

Advanced Operating Systems (Distributed Systems) Unit 3

Sikkim Manipal University Page No. 66

Figure 3.2: Implementation of RPC Mechanism

The job of each of these elements is described below:

1. Client: To invoke a remote procedure, a client makes a perfectly local

call that invokes the corresponding procedure in the stub

2. Client Stub: The client stub is responsible for performing the following

tasks:

On receipt of a call request from the client, it packs the specification

of the target procedure and the arguments into a message and asks

the local runtime system to send it to the server stub.

On receipt of the result of procedure execution, it unpacks the result

and passes it to the client.

Return Call

Unpack Pack

Receive Send

Client

Client stub

RPC Runtime

Wait

Call Return

UnPack Pack

Receive Send

Server

Execute

Result Packet

Call

Packet

Page 67: MC0085

Advanced Operating Systems (Distributed Systems) Unit 3

Sikkim Manipal University Page No. 67

3. RPCRuntime:

The RPC runtime handles the transmission of the messages across the

network between client and server machines. It is responsible for

retransmissions, acknowledgements, and encryption.

On the client side, it receives the call request from the client stub and

sends it to the server machine. It also receives reply message (result of

procedure execution) from the server machine and passes it to the client

stub.

On the server side, it receives the results of the procedure execution

from the server stub and sends it to the client machine. It also receives

the request message from the client machine and passes it to the server

stub.

4. Server Stub: The functions of server stub are similar to that of the client

stub. It performs the following two tasks:

The server stub unpacks the call receipt messages from local

RPCRuntime and makes a perfect local call to invoke the

appropriate procedure in the server.

The server stub packs the results of the procedure execution

received from server, and asks the local RPCRuntime to send it to

the client stub.

5. Server: On receiving the call request from the server stub, the server

executes the appropriate procedure and returns the result to the server

stub.

Page 68: MC0085

Advanced Operating Systems (Distributed Systems) Unit 3

Sikkim Manipal University Page No. 68

3.5 STUB Generation

The stubs can be generated in the following two ways:

Manual Stub Generation: RPC implementer provides a set of translation

functions from which user can construct his own stubs. It is simple to

implement and can handle complex parameters.

Automatic Stub Generation: This is the most commonly used technique

for stub generation. It uses an Interface Definition Language (IDL), for

defining the interface between the client and server. An interface definition is

mainly a list of procedure names supported by the interface, together with

the types of their arguments and results, which helps the client and server to

perform compile-time type checking and generate appropriate calling

sequences. An interface definition also contains information to indicate

whether each argument is an input, output or both. This helps in

unnecessary copying input argument needs to be copied from client to

server and output needs to be copied from server to client. It also contains

information about type definitions, enumerated types, and defined

constants-so the clients do not have to store this information.

A server program that implements procedures in an interface is said to

export the interface. A client program that calls the procedures is said to

import the interface. When writing a distributed application, a programmer

first writes the interface definition using IDL, then can write a server program

that exports the interface and a client program that imports the interface.

The interface definition is processed using an IDL compiler (the IDL

compiler in Sun RPC is called rpcgen) to generate components that can be

combined with both client and server programs, without making changes to

the existing compilers. In particular, an IDL compiler generates a client stub

procedure and a server stub procedure for each procedure in the interface.

It generates the appropriate marshaling and un-marshaling operations in

Page 69: MC0085

Advanced Operating Systems (Distributed Systems) Unit 3

Sikkim Manipal University Page No. 69

each sub procedure. It also generates a header file that supports the data

types in the interface definition to be included in the source files of both

client and server. The client stubs are compiled and linked with the client

program and the server stubs are compiled and linked with server program.

3.6 RPC Messages

Any remote procedure call involves a client process and a server process

that are possibly located on different computers. The mode of interaction

between the client and server is that the client asks the server to execute a

remote procedure and the server returns the result of execution of the

concerned procedure to the client. Based on this mode of interaction, the

two types of messages involved in the implementation of an RPC system

are as follows:

i) Call messages sent by the client to server for requesting execution of

particular remote procedure.

Components of a call message:

Since a call message is used to request execution of a particular remote

procedure, the basic components in a call message are as follows:

identification information of the remote procedure to be executed – such

as program number, version number, and procedure number

arguments necessary for the execution of the procedure

a message identification field that consists of a sequence number

a message type to distinguish call and reply messages

a client identification field

ii) Reply messages sent by the server to the client for returning the result.

When the server of an RPC receives a call message from a client, it could

be faced with one of the following conditions:

Page 70: MC0085

Advanced Operating Systems (Distributed Systems) Unit 3

Sikkim Manipal University Page No. 70

The message is not intelligible to it. Could be because the call message

violates the RPC protocol. Server needs to discard such calls.

If the server finds the client is not authorized to use the service, the

requested service is not available, or an exception condition such as

division by 0 occurs then it will return an appropriate unsuccessful reply.

If the specified remote procedure is executed successfully, then it sends

a reply

3.7 Marshaling Arguments and Results

Implementation of Remote Procedure calls involves the transfer of

arguments from the client process to the server process and the transfer of

results from the server process to the client process. These arguments and

results are basically language – level data structures (program objects),

which are transferred in the form of message data between the two

computers involved in the call. The transfer of message data between two

computers requires encoding and decoding of message data. In case of

RPCs this operation is known as Marshalling and involves the following

actions:

1. Taking the arguments (of a client process) or the result (of a server

process) that will form the message data to be sent to the remote

process.

2. Encoding the message data of step 1 on the sender’s computer. This

encoding process involves the conversion of program objects into a

stream form that is suitable for transmission and placing them into a

message buffer.

3. Decoding of the message data on the receiver’s computer. This

decoding process involves the reconstruction of program objects from

the message data that was received in the stream form.

Page 71: MC0085

Advanced Operating Systems (Distributed Systems) Unit 3

Sikkim Manipal University Page No. 71

In order that encoding and decoding of an RPC message can be performed

successfully, the order and the representation method used to marshal

arguments and results must be known to both the client and the server of

the RPC. This provides a degree of type safety between a client and a

server because the server will not accept a call from a client until the client

uses the same interface definition as the server.

The marshalling process must reflect the structure of all types of program

objects used in the concerned language.

3.8 Server Management

In RPC based applications, two important issues that need to be considered

for server management are server implementation and server creation.

i) Server Implementation: Based on the style of implementation used

servers may be of two types: Stateful and Stateless.

Stateful servers: A stateful server maintains client’s state information from

one remote procedure call to the next. For example, let us consider a server

that supports the following operations for files:

Open (filename, mode): used to open filename in specified mode. When

the server executes this operation, it creates an entry for this file in a file-

table that is used for maintaining state information.

Read (fid, n, buffer): This operation returns n bytes of file data starting from

the byte currently addressed by the read-write pointer and then increments

the pointer by n.

Write (fid, n, buffer): The server takes n bytes of data from the buffer and

writes to the file identified by the read-write pointer.

Seek (fid, position): causes to change the value of the read pointer.

Page 72: MC0085

Advanced Operating Systems (Distributed Systems) Unit 3

Sikkim Manipal University Page No. 72

Close (fid): causes the server to delete the file state information from the

file-table.

The file server mentioned is stateful because it maintains the current state

information of a file that has been opened for use by a client.

Stateless Server: A stateless server does not maintain any client state

information. So every request must accompany with all the necessary

parameters.

Some operations that a stateless file server can support.

Read (filename, position, n, buffer): Read n bytes from the file from

position.

Write (filename, position, n, buffer): Write n bytes from buffer to file

starting at position.

Merits of a stateful server: A stateful server provides an easier

programming paradigm. It is typically more efficient than stateless servers.

Demerits of stateful server: If the server crashes and restarts, the state

information it was holding may be lost and the client may produce

inconsistent results. If the client process crashes and restarts, the server will

have inconsistent information about the client.

Handling failures under stateless server: When server crashes and

restarts, it does not result in any inconsistencies. When client crashes

and restarts, it does not lead to any inconsistencies of how we choose

which approach to take while designing servers depends on the

application.

Server Creation Semantics

Based on the time duration for which the RPC servers survive, they can be

classified as follows:

Page 73: MC0085

Advanced Operating Systems (Distributed Systems) Unit 3

Sikkim Manipal University Page No. 73

Instance-per-call-servers: They exist only for the duration of a single call.

Such a server is created by the RPCRuntime when the call arrives

The server is deleted when the call has been executed

Not a commonly used semantics because

– These servers are stateless. Any state that has to be preserved

across calls should be handled by the OS

– The overhead involved in the creation and destruction of servers is

expensive, especially if it is for the same type of service.

Instance-per-Session-Servers: Servers belonging to this category

exist for the entire session for which the client and server interact

These servers can maintain state information across calls

Overhead involved in the server creation for each call does not exist

Under this approach

– There is a server manager for each type of service

– All the server managers register with the binding agent ( ? later)

– Client first contacts the binding agent with the type of service needed

– The binding agent returns to the client the address of the server

manager that provides that type of service

– Client contacts the server manager to create a server for it

– Server manager spawns a server and returns the address of the

server to the client

– Client then interacts with this server for the entire session

– The server is destroyed when the client informs back the server

manager of the corresponding type that the server is no longer

needed.

Persistent Servers: This type of server generally remains in existence

indefinitely. It is shared by many clients. Servers of this type are created and

installed before the clients use them. Each server independently exports its

service by registering itself with the binding agent. When a client contacts

Page 74: MC0085

Advanced Operating Systems (Distributed Systems) Unit 3

Sikkim Manipal University Page No. 74

the binding agent for a particular service, the binding agent selects a server

of that type and returns its address to the client. The client then interacts

with the server.

An advantage of this approach is it can improve performance, since it

interleaves requests of several clients. Care should be taken in designing

the procedures so that interleaved concurrent requests from different clients

do not interfere with each other.

3.9 Parameter Passing, Call Semantics

The choice of parameter passing semantics is crucial to the design of an

RPC mechanism. The two choices are call by value and call by reference.

i) Call-by-Value: All parameters are copied into a message that is

transmitted. It does not pose problem for simple data-types such as

integers, small arrays and so on. Passing large data-types like multi-

dimensional arrays, trees, etc. can consume much time for transmission of

data that may not be used.

ii) Call-by-Reference: This is possible only in a distributed shared memory

system. it is also possible in object-based systems, because in this case

client needs to pass the names of objects, which are like reference. In

object-based systems it is called call-by-object-reference. A remote

invocation operation may cause another remote invocation, etc. To avoid

many remote references, another parameter-passing mode, called call-by-

move was proposed; in this approach, the object to which a reference is

made is moved to the site of the callee first and then executed.

As we saw earlier, the following types of failures can occur

The call message gets lost

The response message gets lost

Page 75: MC0085

Advanced Operating Systems (Distributed Systems) Unit 3

Sikkim Manipal University Page No. 75

The callee node crashes and is restarted

The caller node crashes and is restarted

Mechanisms for handling such failures are described below:

The RPCRuntime should be designed to provide flexibility to the

application programmers to select from different possible call semantics

supported by an RPC system .

Possibly or May-be Call Semantics: The weakest semantics. Not really

appropriate for RPC. Caller waits until a predetermined amount of time and

then continues with the execution. Suitable in an environment that has high

probability of successful transmission of messages.

Last-One Call Semantics: The calling of remote procedure by the caller,

execution of procedure by callee, and return of the result to the caller will

eventually be repeated until the result of the procedure execution is received

by the caller. i.e., the results of the last executed call are used by the caller.

Easy to achieve if only two processors are involved. For example, A process

P1 on N1 calls F1 on N2 and F1 calls F2 on N3; N1 fails and restarts; P1’s

call to F1 will be repeated which in turn will call F2 again; N3 is unaware of

N1’s failure; so N3 may send the result of the two executions in any order,

violating last-one semantics.

Above problem occurs due to orphan calls. An orphan call is one whose

parent is dead due to node crash. To achieve last-one semantics, these

parent calls must be terminated before restarting.

Last-of-Many Call Semantics: Similar to last-one semantics, except

orphan calls are neglected.

– Call identifiers are used to uniquely identify each call.

– When a call is repeated, it is assigned a new identifier

– Each response message has the corresponding call identifier

Page 76: MC0085

Advanced Operating Systems (Distributed Systems) Unit 3

Sikkim Manipal University Page No. 76

– A caller accepts a response only if the call identifier matches with that of

the most recently repeated call.

At-Least-Once Call Semantics: Weaker than last-of-many. It just

guarantees that the call is executed more than once, but does not specify

which result will be returned.

Exactly-Once Call Semantics: The strongest and most desirable

semantics. This eliminates the possibility of a procedure being executed

more than once no matter how many times a call is retransmitted.

3.10 Communication Protocol for RPCs

Different systems developed on the basis of remote procedure calls have

different IPC requirements. Based on the needs of different systems,

several communication protocols have been proposes for RPCs. A brief

description of these protocols is given below:

i) The Request Protocol: Also known as the R protocol. It is useful for

RPCs in which the called procedure has nothing to return and the client

does not require confirmation for the procedure having been executed.

An RPC protocol that uses R protocol is also called asynchronous

RPC. For asynchronous RPC, the RPCRuntime does not take

responsibility for retrying a request in case of communication failure.

So, if an unreliable transport protocol such as UDP is used, then

request messages could be lost. Asynchronous RPCs with unreliable

transport protocols are generally useful for implementing periodic

updates. For example, a time server node in a distributed system, may

send synchronization messages every T seconds.

ii) Request/Reply Protocol (RR protocol): It is a basic idea to eliminate

acknowledgements.

A server’s reply message is regarded as an acknowledgment of the

client’s request. A subsequent call message is regarded as an

Page 77: MC0085

Advanced Operating Systems (Distributed Systems) Unit 3

Sikkim Manipal University Page No. 77

acknowledgement for the server’s reply. The RR protocol does not

possess failure-handling capabilities. A timeout and retry is normally

used along with RR protocol, for taking care of lost messages. If

duplicate messages are not filtered, RR protocol provides at least once

semantics. Servers can support exactly-once semantics by keeping

records of replies in a reply cache. How long the reply needs to be

kept?

iii) The Request/Reply/Acknowledge-Reply Protocol (RRA): It is useful

for the design of systems involving simple RPCs. The server needs to

keep a copy of the reply only until it receives the acknowledgement for

reply from client. Exactly-once semantics can be implemented easily

using this protocol. In this protocol a server’s reply message is

regarded as an acknowledgement of the client’s request message. A

subsequent call packet from a client is regarded as an

acknowledgement of the server’s reply of the previous call made by the

client.

3.11 Complicated RPC

Birrell and Nelson categorized the following two types of complicated RPCs

and the methods to handle them.

i) RPCs involving long-duration calls or large gaps between calls. How to

handle such calls?

Periodic probing of the server by client: After a client sends a request, it

periodically sends a probe packet which the server acknowledges. It helps

the client to detect server crash or communication link failure. The

acknowledgements to probe message can also contain information about

lost request in which case the client can retransmit.

Page 78: MC0085

Advanced Operating Systems (Distributed Systems) Unit 3

Sikkim Manipal University Page No. 78

Periodic generation of acknowledgements by server: The server itself

generates acknowledgements periodically and sends to client before

sending reply; longer the time it takes to send a reply, more will be the

number of acknowledgements generated.

ii) RPCs involving arguments and/or results that are too large to fit in a

single datagram packet.

How to handle such calls?

– Use several physical RPCs for one logical RPC

– Use multi-datagram messages. i.e., RPC argument is fragmented and

transmitted in multiple packets.

– For example, Sun RPC is limited to 8 KB. So RPC’s involving larger than

allowed limit must be handled by breaking into several physical RPCs.

3.12 Client-Server Binding

It is necessary for a client (A Client Stub) to know the location of the server

before a remote procedure call can take place. The process by which a

client becomes associated with a server so that calls can take place is

known as binding.

The Client-server binding involves handling of several issues:

– How does a client specify a server to which it wants to get bound?

– How does the binding process locate the specified server?

– When is it proper to bind a client to server?

– Is it possible for a client to change a binding during execution?

– Can a client be simultaneously bound to multiple servers that provide

the same service?

Server Naming: Birrell and Nelson’s proposal

The specification by a client of a server with which it wants to communicate

is primarily a naming issue. An interface name has two parts - a type and an

Page 79: MC0085

Advanced Operating Systems (Distributed Systems) Unit 3

Sikkim Manipal University Page No. 79

instance. Type specifies the interface itself, and instance specifies a server

providing the services within that interface. For example, there may be an

interface type file server, and there may be many instances of servers

providing file service. Type part also has generally version number field to

distinguish between old and new versions of interface (which may provide

different sets of service). Interface names are created by users. The RPC

package only dictates the means by which an importer uses the interface

name to locate an exporter.

Server Locating:

The interface name of a server is its unique identifier. When the client

specifies the interface name of a server for making a remote procedure call,

the server must be located before the client’s request message can be sent

to it. This is primarily a locating issue and any locating mechanism can be

used for this purpose. The most common methods used for locating are

described below:

i) Broadcasting: A broadcast message is sent to locate the server. The

first server responding to this message is used by the client. OK for

small networks.

ii) Binding Agent: A binding agent is basically a name server used to

bind a client to a server by providing information about the desired

server. The binding agent maintains a binding table which is a mapping

of the server’s interface name to its locations. All servers register

themselves with the binding agent as a part of their initialization

process.

To register, the server gives the binder its identification information and a

handle to look at it, for example IP address. The Server can deregister when

it is no longer prepared to offer this service. The binding agent’s location is

known to all nodes. The binding agent interface has three primitives:

register, deregister, and lookup (used by client). The time when can a client

Page 80: MC0085

Advanced Operating Systems (Distributed Systems) Unit 3

Sikkim Manipal University Page No. 80

be bound to a server is called the Binding Time. If the client and server

modules are programmed as if they were linked together, it is known as

Binding at Compile Time.

For example a server’s network address can be compiled into client’s code.

This scheme is very inflexible because if the server moves or the server is

replicated or the interface changes, all client programs need to be

recompiled. However, it is useful in an application whose configuration is

expected to last for a fairly long time.

iii) Binding at Link Time: A server exports its service by registering with

the binding agent as part of the initialization process

A client then makes an import request to the binding agent before

making a call

The binding agent binds the client and server by returning the

server’s handle.

The server’s handle is cached by client to avoid contacting the

binding agent.

iv) Binding at Call Time: A client is bound to a server at the time when it

calls the server for the first time during execution.

v) Indirect Call Method: When a client calls the server for the first time, it

passes the server’s interface name and the arguments of the RPC call

to the binding agent. The binding agent looks up the location of the

target’s server and forwards the RPC message to it. When the target

server returns the results to the binding agent, the binding agent

returns the result along with the handle of the target server to the client.

The client subsequently can call target server directly.

Page 81: MC0085

Advanced Operating Systems (Distributed Systems) Unit 3

Sikkim Manipal University Page No. 81

3.13 Security

Some implementations of RPC includes facilities for client and server

authentication as well as for providing encryption-based security for calls.

The encryption techniques provide protection from eavesdropping and

detect attempts at modifications, replay, or creation of calls.

In other implementations of RPC that do not include security facilities, the

arguments and results of RPC are readable by anyone monitoring

communication between the caller and the callee. In this case if security is

desired, the user must implement his or her own authentication and data

encryption mechanisms.

The following security issues need to be addressed when the user designs a

security system for communication:

Is the authentication of the server by the client required?

Is the authentication of client by server required?

Is it alright if the arguments and results are accessible to users other

than the caller and the callee?

3.14 Terminal Questions

1. How does an RPC facility make the job of a distributed application

programmer easier? Mention the similarities and differences between

RPC model and ordinary procedure call.

2. What is stub? How stubs are generated? Explain how the use of stubs

help in making an RPC mechanism transparent?

3. Describe the following:

Parameter Passing Semantics

Communication protocols for RPCs

Client Server Binding

Page 82: MC0085

Advanced Operating Systems (Distributed Systems) Unit 4

Sikkim Manipal University Page No. 82

Unit 4 Distributed Shared Memory

Structure:

4.1 Introduction

Objectives

4.2 Distributed Shared Memory Systems (DSM)

4.3 DSM – Design and Implementation Issues

4.4 Granularity – Block Size

4.5 Structure of Shared Memory Space in a DSM System

4.6 Memory Coherence (Consistency) Models

4.7 Memory Consistency models

4.8 Implementing Sequential Consistency

4.9 Centralized – Server Algorithm

4.10 Fixed Distributed – Server Algorithm

4.11 Dynamic Distributed Server Algorithm

4.12 Implementing under RNMBs Strategy

4.13 Thrashing

4.14 Terminal Questions

4.1 Introduction

Practice shows that programming multi-computers is much harder than

programming multiprocessors. The difference is caused by the fact that

expressing communication in terms of processes accessing shared data

and using simple synchronization primitives like semaphores and monitors

is much easier than having only message-passing facilities available. Issues

like buffering, blocking, and reliable communication only make things worse.

For this reason, there has been considerable research in emulating shared

memory on multi-computers. The goal is to provide a virtual shared memory

machine, running on a multicomputer, for which applications can be written

Page 83: MC0085

Advanced Operating Systems (Distributed Systems) Unit 4

Sikkim Manipal University Page No. 83

using the shared memory model even though this is not present. The

multicomputer operating system plays a crucial role here.

One approach is to use the virtual memory capabilities of each individual

node to support a large virtual address space. This leads to what is called a

page based distributed shared memory (DSM). The principle of page-

based distributed shared memory is as follows. In a DSM system, the

address space is divided up into pages (typically 4 KB or 8 KB), with the

pages being spread over all the processors in the system. When a

processor references an address that is not present locally, a trap occurs,

and the operating system fetches the page containing the address and

restarts the faulting instruction, which now completes successfully. This

concept is illustrated in Fig. 4.1(a) for an address space with 16 pages and

four processors. It is essentially normal paging, except that remote RAM is

being used as the backing store instead of the local disk.

Figure 4.1 (a) Pages of address space distributed among four machines. (b) Situation after CPU 1 references page 10.

(c) Situation if page 10 is read only and replication is used.

In this example, if processor 1 reference instructions or data in pages 0, 2,

5, or 9, the references are done locally. References to other pages cause

Page 84: MC0085

Advanced Operating Systems (Distributed Systems) Unit 4

Sikkim Manipal University Page No. 84

traps. For example, a reference to an address in page 10 will cause a trap to

the operating system, which then moves page 10 from machine 2 to

machine 1, as shown in Fig. 4.1(b).

One improvement to the basic system that can frequently improve

performance considerably is to replicate pages that are read only, for

example, pages that contain program text, read-only constants, or other

read-only data structures. For example, if page 10 in Fig. 4.1 is a section of

program text, its use by processor 1 can result in a copy being sent to

processor 1, without the original in processor 2’s memory being disturbed,

as shown in Fig. 4.1(c). In this way, processors 1 and 2 can both reference

page 10 as often as needed without causing traps to fetch missing memory.

Another possibility is to replicate not only read-only pages, but all pages. As

long as reads are being done, there is effectively no difference between

replicating a read-only page and replicating a read-write page. However, if a

replicated page is suddenly modified, special action has to be taken to

prevent having multiple, inconsistent copies in existence. Typically all copies

but one are invalidated before allowing the write to proceed.

Further performance improvements can be made if we let go of strict

consistency between replicated pages. In other words, we allow a copy to

be temporarily different from the others. Practice has shown that this

approach may indeed help, but unfortunately, can also make life much

harder for the programmer as he has to be aware of such inconsistencies.

Considering that ease of programming was an important reason for

developing DSM systems in the first place, weakening consistency may not

be a real alternative.

Another issue in designing efficient DSM systems is deciding how large

pages should be. Here, we are faced with similar trade-offs as in deciding

on the size of pages in uni-processor virtual memory systems. For example,

Page 85: MC0085

Advanced Operating Systems (Distributed Systems) Unit 4

Sikkim Manipal University Page No. 85

the cost of transferring a page across a network is primarily determined by

the cost of setting up the transfer and not by the amount of data that is

transferred. Consequently, having large pages may possibly reduce the total

number of transfers when large portions of contiguous data need to be

accessed. On the other hand, if a page contains data of two independent

processes on different processors, the operating system may need to

repeatedly transfer the page between those two processors, as shown in

Fig. 4.2. Having data belonging to two independent processes in the same

page is called false sharing.

After almost 15 years of research on distributed shared memory, DSM

researchers are still struggling to combine efficiency and programmability.

To attain high performance on large-scale multi-computers, programmers

resort to message passing despite its higher complexity compared to

programming (virtual) shared memory systems. It seems therefore justified

to conclude that DSM for high-performance parallel programming cannot

fulfill its initial expectations.

Figure 4.2: False sharing of a page between two independent processes

Objectives:

This unit discusses the memory aspects of a Distributed System, wherein

sharing of the memory is done between the nodes of the system. It provides

an architectural specification of the DSM Memory Structure, and also

Page 86: MC0085

Advanced Operating Systems (Distributed Systems) Unit 4

Sikkim Manipal University Page No. 86

discusses the Design and Implementation issues. It describes the Memory

Coherence (Consistency) models. It also describes various Server based

algorithms.

4.2 Distributed Shared Memory Systems (DSM)

This is also called DSVM (Distributed Shared Virtual Memory). It is a

loosely coupled distributed-memory system that has implemented a

software layer on top of the message passing system to provide a shared

memory abstraction for the programmers. The software layer can be

implemented in the OS kernel or in runtime library routines with proper

kernel support. It is an abstraction that integrates local memory of different

machines in a network environment into a single logical entity shared by

cooperating processes executing on multiple sites. Shared memory exists

only virtually.

DSM Systems: A comparison between message passing and tightly

coupled multiprocessor systems

DSM provides a simpler abstraction than the message passing model. It

relieves the burden from the programmer from explicitly using

communication primitives in their programs.

In message passing systems, passing complex data structures between two

different processes is difficult. Moreover, passing data structures containing

pointers is generally expensive in message passing model.

Distributed Shared Memory takes advantage of the locality of reference

exhibited by programs and improves efficiency.

Distributed Shared Memory systems are cheaper to build than tightly

coupled multiprocessor systems.

The large physical memory available facilitates running programs requiring

large memory efficiently.

Page 87: MC0085

Advanced Operating Systems (Distributed Systems) Unit 4

Sikkim Manipal University Page No. 87

DSM can scale well when compared to tightly coupled multiprocessor

systems.

Message passing system allows processes to communicate with each other

while being protected from one another by having private address spaces,

whereas in DSM one can cause another to fail by erroneously altering data.

When message passing is used between heterogeneous computers

marshaling of data takes care of differences in data representation; how can

memory be shared between computers with different integer representation.

DSM can be made persistent - i.e. processes communicating via DSM may

execute with overlapping lifetimes.

A process can leave information in an agreed location to another process.

Processes communicating via message passing must execute at the same

time.

Which is better? Message passing or Distributed Shared Memory?

Distributed Shared Memory appears to be a promising tool if it can be

implemented efficiently.

Distributed Shared Memory Architecture

Page 88: MC0085

Advanced Operating Systems (Distributed Systems) Unit 4

Sikkim Manipal University Page No. 88

As shown in the above figure, the DSM provides a virtual address space

shared among processes on loosely coupled processors. DSM is basically

an abstraction that integrates the local memory of different machines in a

network environment into a single local entity shared by cooperating

processes executing on multiple sites. The shared memory itself exists only

virtually. The application programs can use it in the same way as traditional

virtual memory, except that processes using it can run on different machines

in parallel.

Architectural Components:

Each node in a distributed system consists of one or more CPUs and a

memory unit. The nodes are connected by a communication network. A

simple message-passing system allows processes on different nodes to

exchange messages with each other. DSM abstraction presents a single

large shared memory space to the processors of all nodes. Shared memory

of DSM exists only virtually. Memory map manager running at each node

maps the local memory onto the shared virtual memory. To facilitate this

mapping, shared-memory space is partitioned into blocks. Data caching is

used to reduce network latency. When a memory block accessed by a

process is not resident in local memory:

a block fault is generated and control goes to the OS.

the OS gets this block from the remote node and maps it to the

application’s address space and the faulting instruction is restarted.

Thus data keeps migrating from one node to another node but no

communication is visible to the user processes.

Network traffic is highly reduced if applications show a high degree of

locality of data accesses.

Page 89: MC0085

Advanced Operating Systems (Distributed Systems) Unit 4

Sikkim Manipal University Page No. 89

Variations of this general approach are used for different implementations

depending on whether the DSM allows replication and/or migration of

shared memory.

4.3 DSM – Design and Implementation Issues

The important issues involved in the design and implementation of DSM

systems are as follows:

Granularity: It refers to the block size of the DSM system, i.e. to the units of

sharing and the unit of data transfer across the network when a network

block fault occurs. Possible units are a few words, a page, or a few pages.

Structure of Shared Memory Space: The structure refers to the Lay out of

the shared data in memory. It is dependent on the type of applications that

the DSM system is intended to support.

Memory coherence and access synchronization: Coherence

(consistency) refers to memory coherence problem that deals with the

consistency of shared data that lies in the main memory of two or more

nodes. Synchronization refers to synchronization of concurrent access to

shared data using synchronization primitives such as semaphores.

Data Location and Access: A DSM system must implement mechanisms

to locate data blocks in order to service the network data block faults to

meet the requirements of the memory coherence semantics being used.

Block Replacement Policy: If the local memory of a node is full, a cache

miss at that node implies not only a fetch of the accessed data block from a

remote node but also a replacement. i.e. a data block of the local memory

must be replaced by the new data block. Therefore a block replacement

policy is also necessary in the design of a DSM system.

Page 90: MC0085

Advanced Operating Systems (Distributed Systems) Unit 4

Sikkim Manipal University Page No. 90

Thrashing: In a DSM system, data blocks migrate between nodes on

demand. If two nodes compete for write access to a single data item, the

corresponding data block may be transferred back and forth at such a high

rate that no real work can get done. A DSM system must use a policy to

avoid this situation (known as Thrashing).

Heterogeneity: The DSM systems built in for homogenous systems need

not address the heterogeneity issue. However, if the underlying system

environment is heterogeneous, the DSM system must be designed to take

care of heterogeneity so that it functions properly with machines having

different architectures.

4.4 Granularity – Block Size

Choosing appropriate block size should take the following into

consideration:

Paging overhead: large block size would minimize the paging overhead

since it takes advantage of locality of reference.

Directory size: Larger the block size, smaller the directory size. Smaller

directory size reduces directory management.

Thrashing: Data items in the same data block may be updated by

multiple nodes at the same time, causing large number of block

transfers. Thrashing is more likely with large blocks.

False Sharing: Two different processes access two unrelated variables

that reside in the same data block. This can lead to thrashing.

Why not use page size of virtual memory system as block size? Some

advantages of such approach are:

• It allows the use of existing page-fault schemes to trigger DSM page-

fault.

Page 91: MC0085

Advanced Operating Systems (Distributed Systems) Unit 4

Sikkim Manipal University Page No. 91

• If page size can fit into a packet, page size does not impose undue

communication overhead.

4.5 Structure of Shared Memory Space in a DSM System

Three commonly used approaches for structuring:

1. No structuring: Shared memory space is simply a linear array of words.

DSM system IVY uses this approach.

2. Structuring by data - type: Shared memory space is structured as a

collection of objects or as a collection of variables in a source language.

Since size of objects and variables vary, one has to use variable grain

size which complicates the design and implementation.

3. Structuring as a database: Structure the shared memory as a

database.

• Shared memory space is ordered as an associative memory, called

a tuple space, which is a collection of tuples with data items in their

fields.

4.6 Memory Coherence (Consistency) Models

What is a memory Consistency Model?

• A set of rules that the applications must obey if they want the DSM

system to provide the degree of consistency guaranteed by the

consistency model.

• Weaker the consistency model, better the concurrency.

• Researchers try to invent new consistency models which are weaker

than the existing ones in such a way that a set of applications will

function correctly under the new consistency model.

Page 92: MC0085

Advanced Operating Systems (Distributed Systems) Unit 4

Sikkim Manipal University Page No. 92

• Note that an application written for a DSM that implements a stronger

consistency model may not work correctly under a DSM that implements

a weaker consistency model.

4.7 Memory Consistency models

i) Strict consistency: Each read operation returns the most recently

written value. This is possible to implement only in systems with the

notion of global time. So, this model is impossible to implement. Hence,

DSM systems based on underlying distributed systems have to use

weaker consistency models.

ii) Sequential consistency: Proposed by Lamport (1979). All

processes in the system observe the same order of all memory access

operations on the shared memory. i.e., if three operations read(r1),

write(w1) and read(r2) are performed on a memory address in that

order, then any of the six orderings (r1,w1, r2), (r2,w1, r1), (w1, r2,

r1).... is acceptable provided all processes see the same ordering. It

can be implemented by serializing all requests on a central server

node. This model is weaker than the strict consistency model. This

model provides one-copy/single-copy semantics because all processes

sharing a memory location always see exactly the same contents

stored in it. Sequential consistency is the most intuitively expected

semantics for memory coherence. So, sequential consistency is

acceptable for most applications.

iii) Causal consistency model: Proposed by Hutto and Ahamad (1990).

In this model, all write operations that are potentially causally related

are seen by all processes in the same (correct) order. For example, if a

process did a read operation and then performed a write operation,

then the value written may have depended in some way on the value

read. A write operation performed by one process P1 is not causally

Page 93: MC0085

Advanced Operating Systems (Distributed Systems) Unit 4

Sikkim Manipal University Page No. 93

related to the write operation performed by another process P2 if P1

has read neither the value written by P2 or any memory variable that

was directly or indirectly derived from the value written by P2 and vice

versa. For implementing DSMs that support causal consistency one

has to keep track of which memory operation is dependent on which

other operation.

This model is weaker than Sequential consistency model

iv) Pipelined Random - Access Memory (PRAM) consistency model

This model was proposed by Lipton and Sandberg (1988). In this

model, all write operations performed by a single process are seen by

all other processes in the order in which they were performed. This

model can be implemented easily by sequencing the write operations

performed by each node independently.

This model is weaker than all the above consistency models.

v) Processor Consistency Model: Proposed by Goodman (1989).

In addition to PRAM consistency, for any memory location, all

processes agree on the same order of all write operations to that

location.

vi) Weak Consistency Model: Proposed by Dubois et al. (1988).

This model distinguishes between ordinary accesses and

synchronization accesses. It requires that memory become consistent

only on synchronization accesses. A DSM that supports weak

consistency model uses a special variable, called synchronization

variable. The operations on it are used to synchronize memory. For

supporting weak consistency, the following should be satisfied:

All accesses to synchronization variables must obey sequential

consistency semantics.

Page 94: MC0085

Advanced Operating Systems (Distributed Systems) Unit 4

Sikkim Manipal University Page No. 94

All previous write operations must be completed everywhere before

an access to synchronization variable is allowed.

All previous access to synchronization variables must be completed

before access to a non – synchronization variable is allowed.

vii) Release Consistency Model: In the weak consistency model, the

entire shared memory is synchronized when a synchronization variable

is accessed by a process i.e.

• All changes made to the memory are propagated to other nodes.

• All changes made to the memory by other processes are propagated

from other nodes to the process’s node.

This is not really necessary because the first operation needs to be

performed only when a process exits from critical section and the

second operation needs to be performed only when the process enters

critical section. So, instead of one synchronization variable, two

synchronization variables, called acquire and release have been

proposed.

– Acquire is used by a process to tell the system that it is about to

enter a critical section.

– Release is used to tell the system that it had exited critical section.

If processes use appropriate synchronization accesses properly, a

release consistency DSM system will produce the same results for an

application as that if the application was executed on a sequentially

consistent DSM system.

viii) Lazy Release consistency model: It is a variation of release

consistency model. In this approach, when a process does a release

access, the contents of all the modifications are not immediately sent to

other nodes but they are sent only on demand. i.e. When a process

Page 95: MC0085

Advanced Operating Systems (Distributed Systems) Unit 4

Sikkim Manipal University Page No. 95

does an acquire access, all modifications of other nodes are acquired

by the process’s node. It minimizes network traffic.

4.8 Implementing Sequential Consistency

Sequential consistency supports the intuitively expected semantics. So, this

is the most preferred choice for designers of DSM system. The replication

and migration strategies for DSM design include:

i) Non-replicated, non-migrating blocks (NRNMBs)

ii) Non-replicated, migrating blocks (NRMBs)

iii) Replicated, migrating blocks (RMBs)

iv) Replicated, non-migrating blocks (RNMBs)

i) Implementing under NRNMBs strategy:

Under this strategy, only one copy of each block of the shared memory is in

the system and its location is fixed. All requests for a block are sent to the

owner node of the block. Upon receiving a request from a client node, the

memory management unit (MMU) and the operating system of the owner

node perform the access request and return the result. Sequential

consistency can be trivially enforced, because the owner node needs to only

process all requests on a block in the order it receives.

Disadvantages: The serialization of data access creates a bottleneck.

Parallelism is not possible in this strategy.

Locating data in the NRNMB strategy: A mapping between blocks and

nodes need to be maintained at each node.

ii) Implementing under NRMBs strategy

Under this strategy, only the processes executing on one node can read or

write a given data item at any time, so sequential consistency is ensured.

The advantages of this strategy include:

– No communication cost for local data access.

Page 96: MC0085

Advanced Operating Systems (Distributed Systems) Unit 4

Sikkim Manipal University Page No. 96

– Allows applications to take advantage of data access locality

The disadvantages of this strategy include:

– Prone to thrashing

– Parallelism cannot be achieved in this method also

– Locating a block in the NRMB strategy:

1. Broadcasting: Under this approach:

Each node maintains a owned blocks table

– When a block fault occurs, the fault handler broadcasts a request on the

network.

– The node that currently owns the block responds by transferring the

block.

– This approach does not scale well.

2. Centralized Server Algorithm: A central server maintains a block table

that contains the location information for all blocks in the shared memory

space

– When a block fault occurs, the fault handler sends a request to the

central server.

– The central server forwards the request to the node holding block

and updates its block table.

– Upon receiving the request, the owner transfers the block to the

requesting node.

– Drawbacks:

Central node is a bottleneck.

If the central node fails, the DSM stops functioning.

3. Fixed Distributed – Server Algorithm: Under this scheme:

• Several nodes have block managers, each block manager manages a

predetermined set of blocks

• Each node maintains a mapping from data blocks to block managers

Page 97: MC0085

Advanced Operating Systems (Distributed Systems) Unit 4

Sikkim Manipal University Page No. 97

• When a block fault occurs, the fault handler sends a request to the

corresponding block manager

• The block manager forwards the request to the corresponding node

and updates its table to reflect the new owner (the node requesting

the block)

• Upon receiving the request, the owner transfers the block to the

requesting node.

4. Dynamic Distributed Server Algorithm: Under this approach there is

no block manager. Each node maintains information about the probable

owner of each block. When a block fault occurs, the fault handler sends

a request to the probable owner of the block. Upon receiving the

request, if the receiving node is the owner of the block, it updates its

block table and transfers the block to the requesting node; otherwise, it

forwards the request to the probable owner of the block as indicated by

its block table.

Implementing under RMBs strategy

A major disadvantage of non replication strategies is lack of parallelism

because only the processes on one node can access data contained in any

given block at any given time. To increase parallelism, virtually all DSM

systems replicate blocks. With replicated blocks, read operations can be

carried out in parallel at multiple nodes by accessing the local copy of the

data. Therefore the average cost of read operations is reduced because no

communication overhead is involved if a replica of the data exists at the

local node. However, replication tends to increase the cost of write

operations because for a write to a block all its replicas must be invalidated

or updated to maintain consistency.

The two basic protocols that may be used for ensuring sequential

consistency in this case are as follows:

Page 98: MC0085

Advanced Operating Systems (Distributed Systems) Unit 4

Sikkim Manipal University Page No. 98

1. Write – Invalidate: In this scheme, all copies of a piece of data except

one are invalidated before a write can be performed on it. Therefore, when a

write fault occurs, at a node, its fault handler copies the accessed block from

one of the block’s current nodes to its own node, invalidates all other copies

of the block by sending an invalidate message containing the block address

to the nodes having a copy of the block, changes the access of the local

copy of the block to write, an returns the faulting instruction.

After returning, the node “owns” that block and can proceed with the write

operation and other read/write operations until the block ownership is

relinquished to some other node.

Protocols for implementing Sequential Consistency

i) Write-Invalidate Protocol: All copies of a data block except one are

invalidated before a write can be performed on it. If one of the nodes that

had a copy of the block before invalidation tries to perform a memory access

operation on the block after invalidation, a block fault will occur and the fault

handler will fetch the block again from a node having a valid copy, thus

achieving sequential consistency.

ii) Write-Update Protocol: Under this scheme, a write operation is carried

out by updating all copies of the data on which the write is performed. When

a write fault occurs at a node, the fault handler copies the accessed block

from a node having a valid copy, updates all copies and the local copy and

then returns to the faulting instruction. In this method, sequential

consistency can be achieved by using a mechanism to totally order the write

operations of all the nodes. One way to accomplish this is through a global

sequencer. The set of reads that take place between any two writes is well

defined and their order is immaterial to sequential consistency.

Page 99: MC0085

Advanced Operating Systems (Distributed Systems) Unit 4

Sikkim Manipal University Page No. 99

Demerit: This protocol is very expensive for use with loosely coupled

systems because every write operation requires network access locating a

block in the RMB strategy:

iii) Broadcasting: Under this approach, each node maintains a owned

blocks table. Each entry in the table has a copy-set field containing the list of

nodes that have a valid copy of the corresponding block. When a read fault

for a block occurs at node N, the fault handler at node N broadcasts a read

request for the block. Upon receiving the request, the node that currently

owns the block adds N to the copy-set field and transfers the block to

node N.

When a write fault for a block occurs at node N, the fault handler at node N

broadcasts a write request for the block The node that currently owns the

block relinquishes its ownership of the block to node N and transfers the

block to node N along with copy-set node N, upon receiving the block sends

invalidation message to all nodes in the copy-set adds an entry in the local

owned block table for the block to reflect N is the owner initializes the copy-

set to {N}. This approach does not scale well.

4.9 Centralized-Server Algorithm

A central server maintains a block table containing owner-node and copy-

set information for each block. When a read/write fault for a block occurs at

node N, the fault handler at node N sends a read/write request to the central

server.

Upon receiving the request, the central-server does the following:

If it is a read request:

• adds N to the copy-set field and

• sends the owner node information to node N

Page 100: MC0085

Advanced Operating Systems (Distributed Systems) Unit 4

Sikkim Manipal University Page No. 100

• upon receiving this information, N sends a request for the block to

the owner node.

• upon receiving this request, the owner returns a copy of the block

to N.

If it is a write request:

It sends the copy-set and owner information of the block to node N

and initializes copy-set to {N}

Node N sends a request for the block to the owner node and an

invalidation message to all blocks in the copy-set.

Upon receiving this request, the owner sends the block to node N

4.10 Fixed Distributed-Server Algorithm

Under this scheme

Several nodes have block managers, each block manager manages a

predetermined set of blocks

When a read/write fault occurs, request for the block is sent to the

corresponding block manager.

• Upon receiving this request

• The actions taken by the block manager are similar to that of the central-

server approach.

4.11 Dynamic Distributed Server Algorithm

Under this approach, there is no block manager. Each node maintains

information about the probable owner of each block, and also the copy-set

information for each block for which it is a owner. When a block fault occurs,

the fault handler sends a request to the probable owner of the block.

Upon receiving the request

if the receiving node is not the owner, it forwards the request to the

probable owner of the block according to its table.

Page 101: MC0085

Advanced Operating Systems (Distributed Systems) Unit 4

Sikkim Manipal University Page No. 101

if the receiving node is the owner, then

If the request is a read request, it adds the entry N to the copy-set

field of the entry corresponding to the block and sends a copy of the

block to node N.

If the request is a write request, it sends the block and copy-set

information to the node N and deletes the entry corresponding to the

block from its block table.

Node N, upon receiving the block, sends invalidation request to all

nodes in the copy-set, and updates its block table to reflect the fact

that it is the owner of the block

4.12 Implementing under RNMBs Strategy

Under this strategy

• Blocks are replicated, and blocks do not migrate to other nodes.

Replicas can be kept consistent by using write-update protocol.

Sequential consistency can be achieved by using a global sequencer.

For locating data, each node should have a block table containing

information about the location of the blocks.

Block Replacement Policy

The following are different approaches that may be used for block

replacement:

1. Usage-Based Replacement Policy: Least recently used (LRU), Most

recently used (MRU).

2. Non-Usage Based Replacement: Do not take usage into consideration,

example First in First out (FIFO), Random.

3. Fixed-Space versus Variable Space Approach: Fixed space

algorithms assume cache size is fixed; under variable space

replacement, cache size can change.

Page 102: MC0085

Advanced Operating Systems (Distributed Systems) Unit 4

Sikkim Manipal University Page No. 102

Which approach is suitable for DSM systems? Variable space algorithms

are not suitable for a DSM system because each node’s memory that acts

as cache is fixed.

IVY uses a priority based scheme for block replacement.

The following are the two most commonly used approaches of placing a

block to be replaced:

1. Using Secondary Store: The block is transferred to a local disk.

2. Using Memory Space at Other Nodes: The block is transferred to

another node which has free memory space. The node needs to know which

nodes have free memory space.

4.13 Thrashing

Thrashing is said to occur when the system spends a large amount of time

transferring shared data blocks from one node to another, compared to the

time spent doing the useful work of executing application processes. It is a

serious performance problem with DSM systems that allow data blocks to

migrate from one node to another. Thrashing may occur due to the following

reasons:

Interleaved data access by two or more processes on different nodes

that causes a data block to move back and forth from one node to

another in quick succession. (Ping-Pong Effect).

Blocks with read only permissions are repeatedly invalidated soon after

they are replicated.

Such situations indicate poor (node) locality in references. If not properly

handled, thrashing degrades system performance considerably.

Page 103: MC0085

Advanced Operating Systems (Distributed Systems) Unit 4

Sikkim Manipal University Page No. 103

The following are some of the proposed solutions for handling thrashing:

1. Application - Controlled Locks: Applications are allowed to lock data

for a short period of time. An application controlled block can be

associated with each data block to implement this method.

2. Nailing a block to a node for a minimum amount of time: In this

method, disallow a block to be taken away until a minimum amount of

time, say t, elapses after its allocation ot that node. How to determine t?

The time t can be fixed statically or dynamically on the basis of access

patterns.

3. Tailor coherence algorithm to the shared-data usage patterns: Use

different coherence protocols for shared data with different

characteristics.

4.14 Terminal Questions

1. Explain the general architecture of a DSM system

2. Discuss the design and implementation issues of a DSM system

3. Discuss the following:

Consistency Models

Thrashing

Page 104: MC0085

Advanced Operating Systems (Distributed Systems) Unit 5

Sikkim Manipal University Page No. 104

Unit 5 Synchronization

Structure:

5.1 Introduction

Objectives

5.2 Clock Synchronization

5.3 Clock Synchronization Algorithms

5.4 Distributed Algorithms

5.5 Event Ordering

5.6 Mutual Exclusion

5.7 Deadlock

5.8 Election Algorithms

5.9 Terminal Questions

5.1 Introduction

A Distributed System is a collection of distinct processes which are spatially

separated and run concurrently. In systems with multiple concurrent

processes, it is economical to share the system resources among the

concurrently executing processes. The sharing of resources may be

cooperative or competitive. Since the number of available resources in a

computing system is restricted, one process must necessarily influence the

action of other concurrently running processes as it competes for resources.

Sometimes, concurrent processes must cooperate either to achieve the

desired performance of the computing system or due to the nature of the

computation being performed. For example, a client process and a server

process must cooperate when performing file access operations. Both

cooperative and competitive sharing require adherence to certain rules of

behavior that guarantee that correct interaction occurs. The rules for

enforcing correct interactions are implemented in the form of

Page 105: MC0085

Advanced Operating Systems (Distributed Systems) Unit 5

Sikkim Manipal University Page No. 105

synchronization mechanisms. This unit focuses on synchronization

mechanisms that are suitable for distributed systems.

Objectives:

This unit introduces synchronizing the disparate systems present on a

distributed network in the process of message transfers. It tells about the

various ways of synchronizing the clocks on both the sender and receiver

machines along with algorithms showing the implementation parts of

synchronization. It also talks about Event Ordering when multiple messages

are being sent from multiple senders to multiple receivers. It also discusses

the situation of deadlocks occurring in case of resource sharing among

distributed systems. It speaks about the election algorithms used to elect a

process or a node for message sending and receiving.

5.2 Clock Synchronization

Time is an important concept when dealing with synchronisation and

coordination. In particular it is often important to know when events occurred

and in what order they occurred. In a non-distributed system dealing with

time is trivial as there is a single shared clock. All processes see the same

time. In a distributed system, on the other hand, each computer has its own

clock. Because no clock is perfect each of these clocks has its own skew

which causes clocks on different computers to drift and eventually become

out of sync.

There are several notions of time that are relevant in a distributed system.

First of all, internally a computer clock simply keeps track of ticks that can

be translated into physical time (hours, minutes, seconds, etc.). This

physical time can be global or local. Global time is a universal time that is

the same for everyone and is generally based on some form of absolute

time.1 Currently Coordinated Universal Time (UTC), which is based on

oscillations of the Cesium-133 atom, is the most accurate global time.

Page 106: MC0085

Advanced Operating Systems (Distributed Systems) Unit 5

Sikkim Manipal University Page No. 106

Besides global time, processes can also consider local time. In this case the

time is only relevant to the processes taking part in the distributed system

(or algorithm). This time may be based on physical or logical clocks.

Physical Clocks

Physical clocks keep track of physical time. In distributed systems that rely

on actual time it is necessary to keep individual computer clocks

synchronized. The clocks can be synchronized to global time (external

synchronization), or to each other (internal synchronization). Cristian’s

algorithm and the Network Time Protocol (NTP) are examples of algorithms

developed to synchronize clocks to an external global time source (usually

UTC). The Berkeley Algorithm is an example of an algorithm that allows

clocks to be synchronized internally.

Cristian’s algorithm requires clients to periodically synchronize with a central

time server (typically a server with a UTC receiver). One of the problems

encountered when synchronizing clocks in a distributed system is that

unpredictable communication latencies can affect the synchronization. For

example, when a client requests the current time from the time server, by

the time the server’s reply reaches the client the time will have changed.

The client must, therefore, determine what the communication latency was

and adjust the server’s response accordingly. Cristian’s algorithm deals with

this problem by attempting to calculate the communication delay based on

the time elapsed between sending a request and receiving a reply.

The Network Time Protocol is similar to Cristian’s algorithm in that

synchronization is also performed using time servers and an attempt is

made to correct for communication latencies.

Unlike Cristian’s algorithm, however, NTP is not centralised and is designed

to work on a wide area scale. As such, the calculation of delay is somewhat

more complicated. Furthermore, NTP provides a hierarchy of time servers,

Page 107: MC0085

Advanced Operating Systems (Distributed Systems) Unit 5

Sikkim Manipal University Page No. 107

with only the top layer containing UTC clocks. The NTP algorithm allows

client-server and peer-to-peer (mostly between time servers)

synchronization. It also allows clients and servers to determine the most

reliable servers to synchronize with. NTP typically provides accuracies

between 1 and 50 msec depending on whether communication is over a

LAN or WAN.

Unlike the previous two algorithms, the Berkeley algorithm does not

synchronize to a global time. Instead, in this algorithm, a time server polls

the clients to determine the average of everyone’s time. The server then

instructs all clients to set their clocks to this new average time. Note that in

all the above algorithms a clock should never be set backward. If time needs

to be adjusted backward, clocks are simply slowed down until time

’catches up’.

Logical Clocks

For many applications, the relative ordering of events is more important than

actual physical time. In a single process the ordering of events (e.g., state

changes) is trivial. In a distributed system, however, besides local ordering

of events, all processes must also agree on ordering of causally related

events (e.g., sending and receiving of a single message). Given a system

consisting of N processes pi, i {1, . . . ,N}, we define the local event

ordering → i as a binary relation, such that, if pi observes e before e′, we

have e → i e′. Based on this local ordering, we define a global ordering as a

happened before relation →, as proposed by Lamport [Lam78]: The relation

→ is the smallest relation, such that

1. e →i e′ implies e → e′,

2. for every message m, send(m) → receive(m), and

3. e → e′ and e′ → e′′ implies e → e′′ (transitivity).

Page 108: MC0085

Advanced Operating Systems (Distributed Systems) Unit 5

Sikkim Manipal University Page No. 108

The relation → is almost a partial order (it lacks reflexivity). If a → b, then we

say a causally affects b. We consider unordered events to be concurrent if

they are unordered; i.e., a 6 → b and b 6 → a implies a k b.

As an example, consider Figure 1. We have the following causal relations:

E11 → E12,E13,E14,E23,E24, . . .

E21 → E22,E23,E24,E13,E14, . . .

Figure 5.1: Example of Event Ordering

Moreover, the following events are concurrent: E11kE21, E12kE22,

E13kE23, E11kE22, E13kE24, E14kE23, and so on.

How Computer Clocks are implemented?

A computer clock usually consists of three components – a quartz crystal

oscillates at a well defined frequency, a counter register, and a constant

register. The constant register is used to store a constant value that is

decided based on the frequency of oscillation of the quartz crystal. The

counter register is used to keep track of the oscillations of the quartz crystal.

i.e. the value in the counter register is decremented by 1 for each oscillation

of the quartz crystal. When the value of the counter register becomes zero,

an interrupt is generated and its value is reinitialized to the value in the

constant register. Each interrupt is called a clock tick.

Page 109: MC0085

Advanced Operating Systems (Distributed Systems) Unit 5

Sikkim Manipal University Page No. 109

Clock Synchronization Issues

No two clocks can be perfectly synchronized. Two clocks are said to be

synchronized at a particular instance of time if the difference in time values

of the two clocks is less than some specified constant . The difference in

time values of two clocks is called Clock Skew. Therefore, a set of clocks

are said to be synchronized if the clock skew of any two clocks in this set is

less than .

Clock synchronization requires each node to read other nodes’ clock values.

Regardless of the clock reading mechanism, a node can obtain only an

approximate view of its clock skew with respect to other nodes’ clocks in the

system.

Errors occur mainly because of unpredictable communication delays during

message passing used to deliver a clock signal or a clock message from

one node to another.

An important issue in clock synchronization is that time must never run

backward because this could cause serious problems, such as the repetition

of certain operations that may be hazardous in certain cases. We know that

during synchronization a fast clock has to be slowed down. But if the time of

a fast clock is readjusted to the actual time all at once, it may lead to running

the time backward for that clock. Therefore, clock synchronizing algorithms

are normally designed to gradually introduce such a change in the fast

running clock instead of readjusting it to the correct time all at once.

5.3 Clock Synchronization Algorithms

Clock synchronization algorithms may be broadly classified as Centralized

and Distributed:

Centralized Algorithms

In centralized clock synchronization algorithms one node has a real-time

receiver. This node, called the time server node whose clock time is

Page 110: MC0085

Advanced Operating Systems (Distributed Systems) Unit 5

Sikkim Manipal University Page No. 110

regarded as correct and used as the reference time. The goal of these

algorithms is to keep the clocks of all other nodes synchronized with the

clock time of the time server node. Depending on the role of the time server

node, centralized clock synchronization algorithms are again of two types –

Passive Time Sever and Active Time Server.

1. Passive Time Server Centralized Algorithm: In this method each

node periodically sends a message to the time server. When the time

server receives the message, it quickly responds with a message (“time

= T”), where T is the current time in the clock of the time server node.

Assume that when the client node sends the “time = ?” message, its

clock time is T0, and when it receives the “time = T” message, its clock

time is T1. Since T0 and T1 are measured using the same clock, in the

absence of any other information, the best estimate of the time required

for the propagation of the message “time = T” from the time server node

to the client’s node is (T1-T0)/2. Therefore, when the reply is received at

the client’s node, its clock is readjusted to T + (T1-T0)/2.

2. Active Time Server Centralized Algorithm: In this approach, the time

server periodically broadcasts its clock time (“time = T”). The other

nodes receive the broadcast message and use the clock time in the

message for correcting their own clocks. Each node has a priori

knowledge of the approximate time (Ta) required for the propagation of

the message “time = T” from the time server node to its own node,

Therefore, when a broadcast message is received at a node, the node’s

clock is readjusted to the time T+Ta. A major drawback of this method is

that it is not fault tolerant. If the broadcast message reaches too late at a

node due to some communication fault, the clock of that node will be

readjusted to an incorrect value. Another disadvantage of this approach

is that it requires broadcast facility to be supported by the network.

Page 111: MC0085

Advanced Operating Systems (Distributed Systems) Unit 5

Sikkim Manipal University Page No. 111

Another active time server algorithm that overcomes the drawbacks of the

above algorithm is the Berkeley algorithm proposed by Gusella and Zatti for

internal synchronization of clocks of a group of computers running the

Berkeley UNIX. In this algorithm, the time server periodically sends a

message (“time = ?”) to all the computers in the group. On receiving this

message, each computer sends back its clock value to the time server. The

time server has a priori knowledge of the approximate time required for the

propagation of a message from each node to its own node. Based on this

knowledge, it first readjusts the clock values of the reply messages, It then

takes a fault-tolerant average of the clock values of all the computers

(including its own). To take the fault tolerant average, the time server

chooses a subset of all clock values that do not differ from one another by

more than a specified amount, and the average is taken only for the clock

values in this subset. This approach eliminates readings from unreliable

clocks whose clock values could have a significant adverse effect if an

ordinary average was taken.

The calculated average is the current time to which all the clocks should be

readjusted, The time server readjusts its own clock to this value, Instead of

sending the calculated current time back to other computers, the time server

sends the amount by which each individual computer’s clock requires

adjustment, This can be a positive or negative value and is calculated based

on the knowledge the time server has about the approximate time required

for the propagation of a message from each node to its own node.

Centralized clock synchronization algorithms suffer from two major

drawbacks:

1. They are subject to single – point failure. If the time server node fails,

the clock synchronization operation cannot be performed. This makes

the system unreliable. Ideally, a distributed system, should be more

Page 112: MC0085

Advanced Operating Systems (Distributed Systems) Unit 5

Sikkim Manipal University Page No. 112

reliable than its individual nodes. If one goes down, the rest should

continue to function correctly.

2. From a scalability point of view it is generally not acceptable to get all

the time requests serviced by a single time server. In a large system,

such a solution puts a heavy burden on that one process.

Distributed algorithms overcome these drawbacks:

5.4 Distributed Algorithms

We know that externally synchronized clocks are also internally

synchronized. That is, if each node’s clock is independently synchronized

with real time, all the clocks of the system remain mutually synchronized.

Therefore, a simple method for clock synchronization may be to equip each

node of the system with a real time receiver so that each node’s clock can

be independently synchronized with real time. Multiple real time clocks (one

for each node) are normally used for this purpose.

Theoretically, internal synchronization of clocks is not required in this

approach. However, in practice, due to inherent inaccuracy of real-time

clocks, different real time clocks produce different time. Therefore, internal

synchronization is normally performed for better accuracy. One of the

following two approaches is used for internal synchronization in this case.

1. Global Averaging Distributed Algorithms: In this approach, the clock

process at each node broadcasts its local clock time in the form of a

special “resync” message when its local time equals T0+iR for some

integer I, where T0 is a fixed time in the past agreed upon by all nodes

and R is a system parameter that depends on such factors as the total

number of nodes in the system, the maximum allowable drift rate, and

so on. i.e. a resync message is broadcast from each node at the

beginning of every fixed length resynchronization interval. However,

Page 113: MC0085

Advanced Operating Systems (Distributed Systems) Unit 5

Sikkim Manipal University Page No. 113

since the clocks of different nodes run slightly different rates, these

broadcasts will not happen simultaneously from all nodes.

After broadcasting the clock value, the clock process of a node waits for

time T, where T is a parameter to be determined by the algorithm. During

this waiting period, the clock process records the time, according to its own

clock, when the message was received. At the end of the waiting period, the

clock process estimates the skew of its clock with respect to each of the

other nodes on the basis of the times at which it received resync messages.

It then computes a fault-tolerant average of the next resynchronization

interval.

The global averaging algorithms differ mainly in the manner in which the

fault-tolerant average of the estimated skews is calculated. Two commonly

used algorithms are:

1. The simplest algorithm is to take the average of the estimated skews

and use it as the correction for the local clock. However, to limit the

impact of faulty clocks on the average value, the estimated skew with

respect to each node is compared against a threshold, and skews

greater than the threshold are set to zero before computing the average

of the estimated skews.

2. In another algorithm, each node limits the impact of faulty clocks by first

discarding the m highest and m lowest estimated skews and then

calculating the average of the remaining skews, which is then used as

the correction for the local clock. The value of m is usually decided

based on the total number of clocks (nodes).

Localized Averaging Distributed Algorithms: In this approach, the nodes

of a distributed system are logically arranged in some kind of pattern, such

as a ring or a grid. Periodically, each node exchanges its clock time with its

neighbors in the ring, grid, or other structure and then sets its clock time to

Page 114: MC0085

Advanced Operating Systems (Distributed Systems) Unit 5

Sikkim Manipal University Page No. 114

the average of its own clock time to the average of its own clock time and

the clock times of its neighbors.

5.5 Event Ordering

Lamport observed that for most applications it is not necessary to keep the

clocks in a distributed system synchronized. Rather, it is sufficient to ensure

that all events that occur in a distributed system be totally ordered in a

manner that is consistent with an observed behavior.

For partial ordering of event, Lamport defined a new relation called

happened-before and introduced the concept of logical clocks for ordering of

events based on the happened-before relation. He then gave a distributed

algorithm extending his idea of partial ordering to a consistent total ordering

of all the events in a distributed system. His idea is given below:

Happened – Before Relation

The happened before relation (denoted by →) on a set of events satisfies

the following conditions:

1. If a and b are the events in the same process and a occurs before b,

then a → b.

2. If a is the event of sending a message by one process and b is the event

of the receipt of the same message by another process, then a → b.

This condition holds by the law of causality because a receiver cannot

receive a message until the sender sends it, and the time taken to

propagate a message from its sender to its receiver is always positive.

3. If a → b and b → c, then a → c. i.e. happened – before is a transitive

relation.

In a happened – before relation, two events a and b are said to be

concurrent if they are not related by the happened – before relation.

i.e. neither a → b nor b → a is true. This is possible if the two events occur

Page 115: MC0085

Advanced Operating Systems (Distributed Systems) Unit 5

Sikkim Manipal University Page No. 115

in different processes that do not exchange messages either directly or

indirectly via other processes. i.e. two events are concurrent if neither can

causally affect the other.

Given a system consisting of N processes pi, i {1, . . . ,N}, we define the

local event ordering → i as a binary relation, such that, if pi observes e

before e′, we have e → i e′. Based on this local ordering, we define a global

ordering as a happened before relation →, as proposed by Lamport

[Lam78]: The relation → is the smallest relation, such that

1. e →i e′ implies e → e′,

2. for every message m, send(m) → receive(m), and

3. e → e′ and e′ → e′′ implies e → e′′ (transitivity).

The relation → is almost a partial order (it lacks reflexivity). If a → b, then we

say a causally affects b. We consider unordered events to be concurrent if

they are unordered; i.e., a 6 → b and b 6 → a implies a b.

As an example, consider Figure 1. We have the following causal relations:

E11 → E12,E13,E14,E23,E24, . . .

E21 → E22,E23,E24,E13,E14, . . .

Figure 5.2: Example of Event Ordering

Page 116: MC0085

Advanced Operating Systems (Distributed Systems) Unit 5

Sikkim Manipal University Page No. 116

Moreover, the following events are concurrent:

E11 E21, E12 E22, E13 E23, and so on.

Lamport Clocks

Lamport’s logical clocks can be implemented as a software counter that

locally computes the happened-before relation →. This means that each

process pi maintains a logical clock Li. Given such a clock, Li(e) denotes a

Lamport timestamp of event e at pi and L(e) denotes a timestamp of event e

at the process it occurred at. Processes now proceed as follows:

1. Before time stamping a local event, a process pi executes Li:= Li + 1.

2. Whenever a message m is sent from pi to pj :

Process pi executes Li := Li + 1 and sends the new Li with m.

Process pj receives Lj with m and executes Lj := max(Lj ,Li) + 1.

receive(m) is annotated with the new Lj.

In this scheme, a → b implies L(a) < L(b), but L(a) < L(b) does not

necessarily imply a → b. As an example, consider Figure 5.3. In this figure

E12 → E23 and L1(E12) < L2(E23) (i.e., 2 < 3), however we also have

E13 → E24 while L1(E13) < L2(E24) (i.e., 3 < 4).

Figure 5.3: Example of the use of a Lamport’s Clocks

Page 117: MC0085

Advanced Operating Systems (Distributed Systems) Unit 5

Sikkim Manipal University Page No. 117

In some situations (e.g., to implement distributed locks), a partial ordering

on events is not sufficient and a total ordering is required. In these cases,

the partial ordering can be completed to total ordering by including process

identifiers. Given local time stamps Li(e) and Lj(e′), we define global time

stamps hLi(e), ii and hLj(e), ji. We, then, use standard lexicographical

ordering, where hLi(e), ii < hLj(e), ji iff Li(e) < Lj(e), or Li(e) = Lj(e) and i < j.

Vector Clocks

Figure 5.4: Example of the lack of causality with Lamport’s clocks

The main shortcoming of Lamport’s clocks is that L(a) < L(b) does not imply

a → b; hence, we cannot deduce causal dependencies from time stamps.

For example, in Figure 5.3, we have L1(E11) < L3(E33), but E11 → E33. The

root of the problem is that clocks advance independently or via messages,

but there is no history as to where advance comes from.

This problem can be solved by moving from scalar clocks to vector clocks,

where each process maintains a vector clock Vi. Vi is a vector of size N,

where N is the number of processes. The component Vi[j] contains the

process pi’s knowledge about pj’s clock. Initially, we have Vi[j] := 0 for i, j Є

{1, . . . ,N}. Clocks are advanced as follows:

1. Before pi timestamps an event, it executes Vi[i] := Vi[i] + 1.

Page 118: MC0085

Advanced Operating Systems (Distributed Systems) Unit 5

Sikkim Manipal University Page No. 118

2. Whenever a message m is sent from pi to pj :

Process pi executes Vi[i] := Vi[i] + 1 and sends Vi with m.

Process pj receives Vi with m and merges the vector clocks Vi and Vj

as follows:

Vj [k] := _ max(Vj [k], Vi[k]) + 1 , if j = k (as in scalar clocks)

max(Vj [k], Vi[k]) , otherwise.

This last part ensures that everything that subsequently happens at pj is

now causally related to everything that previously happened at pi.

Under this scheme, we have, for all i, j, Vi[i] ≥ Vj [i] (i.e., pi always has the

most up-to-date version of its own clock); moreover, a → b iff V (a) < V (b),

where

• V = V ′ iff V [i] = V ′[i] for all i Є {1, . . . ,N},

• V ≥ V ′ iff V [i] ≥ V ′[i] for all i Є {1, . . . ,N},

• V > V ′ iff V ≥ V ′ V 6= V ′; and

• V || V ′ iff V V ′ V ′ 6> V

5.6 Mutual Exclusion

There are several resources within a system that must not be used

simultaneously by multiple processes if program operation is to be correct.

For example, a file must not be simultaneously updated by multiple

processes. Exclusive access to shared resources by a process must be

ensured. This exclusiveness of access is called Mutual Exclusion between

processes. The sections of a program that need exclusive access to shared

resources are referred to as critical sections. For mutual exclusion, means

are introduced to prevent processes from executing concurrently within their

associated critical sections.

Page 119: MC0085

Advanced Operating Systems (Distributed Systems) Unit 5

Sikkim Manipal University Page No. 119

Requirements for Mutual Exclusion

Any facility or capability that is to provide support for mutual exclusion

should meet the following requirements:

1. Mutual exclusion must be enforced: Only one process at a time is

allowed into its critical section, among all processes that have critical

sections for the same resource or shared object.

2. A process that halts in its non-critical section must do so without

interfering with other processes.

3. It must not be possible for a process requiring access to a critical section

to be delayed indefinitely: no deadlock or starvation.

4. When no process is in a critical section, any process that requests entry

to its critical section must be permitted to enter without delay.

5. No assumptions are made about relative process speeds or number of

processors.

6. A process remains inside its critical section for a finite time only.

There are a number of ways in which the requirements for mutual exclusion

can be satisfied. One way is to leave the responsibility with the processes

that wish to execute concurrently. Thus processes, whether they are system

programs or application programs, would be required to coordinate with one

another to enforce mutual exclusion, with no support from the programming

language or the OS. We can refer to these as software approaches.

Although this approach is prone to high processing overhead and bugs, it is

nevertheless useful to examine such approaches to gain a better

understanding of the complexity of concurrent processing.

An algorithm for implementing mutual exclusion must satisfy the following

requirements:

1. Mutual Exclusion: Given a shared resource accessed by multiple

concurrent processes, at any time only one process should access the

Page 120: MC0085

Advanced Operating Systems (Distributed Systems) Unit 5

Sikkim Manipal University Page No. 120

resource. i.e. a process that has been granted the resource must

release it before it can be granted to another process.

2. No Starvation: If every process that is granted the resource eventually

releases it, every request must be eventually granted.

In uni-processor systems, mutual exclusion is implemented using

semaphores, monitors, and similar constructs. The three basic approaches

used by different algorithms for implementing mutual exclusion in distributed

systems are described below:

1. Centralized Approach:

In this approach, one of the processes in the system is elected as as the

coordinator and coordinates the entry to the critical sections. Each process

that wants to enter a critical section must first seek permission from the

coordinator. If no other process is currently in that critical section, the

coordinator can immediately grant the permission to the requesting process.

If two or more processes concurrently ask for permission to enter the same

critical section, the coordinator grants permission to only one process at a

time in accordance with some scheduling algorithm.

After executing a critical section, when a process exits the critical section, it

must notify the coordinator so that the coordinator can grant permission to

another process (if any) that has also asked permission to enter the same

critical section.

2. Distributed Approach:

In this approach, the decision making for mutual exclusion is distributed

across the entire system. i.e. all processes that want to enter the critical

section cooperate with each other before reaching a decision on which

process will enter the critical section next. The first such algorithm was

presented by Lamport [1978] based on his event – ordering scheme.

Page 121: MC0085

Advanced Operating Systems (Distributed Systems) Unit 5

Sikkim Manipal University Page No. 121

When a process wants to enter a critical section, it sends a request

message to all other processes. The message contains the following

information:

1. The process identifier of the process.

2. The name of the critical section that the process wants to enter.

3. A unique timestamp generated by the process for the request message.

On receiving a request message, a process either immediately sends back

a reply message to the sender or defers sending a reply based on the

following rules:

1. If the receiver process is itself currently executing in the critical section, it

simply queues the request message and defers sending a reply.

2. If the receiver process is currently not executing in the critical section but

is waiting for its turn to enter the critical section, it compares the

timestamp in the received request message with the timestamp in its

own request message that it has sent to other processes. If the

timestamp of the received request message is lower, it means that the

sender process made a request before the receiver process to enter the

critical section. Therefore, the receiver process immediately sends back

a reply message to the sender. On the other hand, if the receiver

process’s own request message has a lower timestamp, the receiver

queues the received request message and defers sending a reply

message.

3. If the receiver process is neither in the critical section nor is waiting for

its turn to enter the critical section, it immediately sends back a reply

message.

A process that sends out a request message keeps waiting for reply

messages from other processes. It enters the critical section as soon as it

has received reply messages from all processes. After it finishes executing

Page 122: MC0085

Advanced Operating Systems (Distributed Systems) Unit 5

Sikkim Manipal University Page No. 122

in the critical section, it sends reply messages to all processes in its queue

and deletes them from its queue.

3. Token – Passing Approach:

In this method, mutual exclusion is achieved by using a single token that is

circulated among the processes in the system. A token is a special type of

message that entitles its holder to enter a critical section. For fairness, the

processes in the system are logically organized in a ring structure, and the

token is circulated from one process to another around the ring always in

the same direction (clockwise or anticlockwise).

The algorithm works as follows. When a process receives the token, it

checks if it wants to enter a critical section and acts as follows:

If it wants to enter a critical section, it keeps the token, enters the critical

section, and exits from the critical section after finishing its work in the

critical section. It then passes the token along the ring to its neighbor

process. Note that the process can enter only one critical section when it

receives the token. If it wants to enter another critical section, it must

wait until it gets the token again.

If it does not want to enter a critical section, it just passes the token

along the ring to its neighbor process. Therefore, if none of the

processes is interested in entering a critical section, the token simply

keeps circulating around the ring.

5.7 Deadlock

There are several resources in a system for which the resource allocation

policy must ensure exclusive access by a process. Since a system consists

of a finite number of units of each resource type, multiple concurrent

processes normally.

Page 123: MC0085

Advanced Operating Systems (Distributed Systems) Unit 5

Sikkim Manipal University Page No. 123

Principles of Deadlock

Deadlock can be defined as the permanent blocking of a set of processes

that either compete for system resources or communicate with each other. A

set of processes is deadlocked when each process in the set is blocked

awaiting an event (typically the freeing up of some requested resource) that

can only be triggered by another blocked process in the set. Deadlock is

permanent because none of the events is ever triggered. Unlike other

problems in concurrent process management, there is no efficient solution in

the general case. All deadlocks involve conflicting needs for resources by

two or more processes.

Let us now look at a depiction of deadlock involving processes and

computer resources. Figure 5.5 below, which we refer to as a joint

progress diagram, illustrates the progress of two processes competing for

two resources. Each process needs exclusive use of both resources for a

certain period of time.

Figure 5.5: Example of a Deadlock

Page 124: MC0085

Advanced Operating Systems (Distributed Systems) Unit 5

Sikkim Manipal University Page No. 124

Two processes, P and Q, have the following general form:

Process P Process Q

• • • • • •

Get A Get B

• • • • • •

Get B Get A

• • • • • •

Release A Release B

• • • • • •

Release B Release A

• • • • • •

In Figure 5.5, the x-axis represents progress in the execution of P and the

y-axis represents progress in the execution of Q. The joint progress of the

two processes is therefore represented by a path that progresses from the

origin in a northeasterly direction. For a uniprocessor system, only one

process at a time may execute, and the path consists of alternating

horizontal and vertical segments, with a horizontal segment representing a

period when P executes and Q waits and a vertical segment representing a

period when Q executes and P waits. The Figure 5.5 indicates areas in

which both P and Q require resource A (upward slanted lines); both P and Q

require resource B (downward slanted lines); and both P and Q require both

resources.

Because we assume that each process requires exclusive control of any

resource, these are all forbidden regions; that is, it is impossible for any path

representing the joint execution progress of P and Q to enter these regions.

The figure shows six different execution paths. These can be summarized

as follows:

Page 125: MC0085

Advanced Operating Systems (Distributed Systems) Unit 5

Sikkim Manipal University Page No. 125

1. Q acquires B and then A and then releases B and A. When P resumes

execution, it will be able to acquire both resources.

2. Q acquires B and then A. P executes and blocks on a request for A.Q

releases B and A. When P resumes execution, it will be able to acquire

both resources.

3. Q acquires B and then P acquires A. Deadlock is inevitable, because as

execution proceeds, Q will block on A and P will block on B.

4. P acquires A and then Q acquires B. Deadlock is inevitable, because as

execution proceeds, Q will block on A and P will block on B.

5. P acquires A and then B.Q executes and blocks on a request for B. P

releases A and B. When Q resumes execution, it will be able to acquire

both resources.

6. P acquires A and then B and then releases A and B.When Q resumes

execution, it will be able to acquire both resources.

The gray-shaded area of Figure 5.5, which can be referred to as a fatal

region, applies to the commentary on paths 3 and 4. If an execution path

enters this fatal region, then deadlock is inevitable. Note that the existence

of a fatal region depends on the logic of the two processes. However,

deadlock is only inevitable if the joint progress of the two processes creates

a path that enters the fatal region.

Whether or not deadlock occurs depends on both the dynamics of the

execution and on the details of the application. For example, suppose that

P does not need both resources at the same time so that the two processes

have the following form:

Process P Process Q

• • • • • •

Get A Get B

• • • • • •

Page 126: MC0085

Advanced Operating Systems (Distributed Systems) Unit 5

Sikkim Manipal University Page No. 126

Release A Get A

• • • • • •

Get B Release B

• • • • • •

Release B Release A

• • • • • •

This situation is reflected in Figure 5.6 below. Some thought should

convince you that regardless of the relative timing of the two processes,

deadlock cannot occur. As shown, the joint progress diagram can be used

to record the execution history of two processes that share resources. In

cases where more than two processes may compete for the same resource,

a higher-dimensional diagram would be required. The principles concerning

fatal regions and deadlock would remain the same.

Figure 5.6: Example of No Deadlock

Page 127: MC0085

Advanced Operating Systems (Distributed Systems) Unit 5

Sikkim Manipal University Page No. 127

Resource Allocation Graphs

A useful tool in characterizing the allocation of resources to processes is the

resource allocation graph, introduced by Holt [HOLT72]. The resource

allocation graph is a directed graph that depicts a state of the system of

resources and processes, with each process and each resource

represented by a node.

A graph edge directed from a process to a resource indicates a resource

that has been requested by the process but not yet granted Figure 5.7 (a).

Within a resource node, a dot is shown for each instance of that resource.

Examples of resource types that may have multiple instances are

I/O devices that are allocated by a resource management module in the OS.

A graph edge directed from a reusable resource node dot to a process

indicates a request that has been granted Figure 5.7 (b); that is, the process

has been assigned one unit of that resource. A graph edge directed from a

consumable resource node dot to a process indicates that the process is the

producer of that resource.

Figure 5.7(c ) shows an example deadlock. There is only one unit each of

resources Ra and Rb. Process P1 holds Rb and requests Ra, while P2

holds Ra but requests Rb. Figure 5.6(d) has the same topology as

Figure 5.7(c), but there is no deadlock because multiple units of each

resource are available.

The resource allocation graph of Figure 5.7 corresponds to a deadlock

situation. Note that in this case, we do not have a simple situation in which

two processes each have one resource the other needs. Rather, in this

case, there is a circular chain of processes and resources that results in

deadlock.

Page 128: MC0085

Advanced Operating Systems (Distributed Systems) Unit 5

Sikkim Manipal University Page No. 128

Table 5.1: Summary of Deadlock Detection, Prevention, and Avoidance

Approaches for Operating Systems

Figure 5.7: Examples of Resource Allocation Graphs

Page 129: MC0085

Advanced Operating Systems (Distributed Systems) Unit 5

Sikkim Manipal University Page No. 129

The Conditions for Deadlock

Three conditions of policy must be present for a deadlock to be possible:

1. Mutual exclusion: Only one process may use a resource at a time. No

process may access a resource unit that has been allocated to another

process.

2. Hold and wait: A process may hold allocated resources while awaiting

assignment of other resources.

3. No preemption: No resource can be forcibly removed from a process

holding it. In many ways these conditions are quite desirable. For

example, mutual exclusion is needed to ensure consistency of results

and the integrity of a database. Similarly, preemption should not be done

arbitrarily. For example, when data resources are involved, preemption

must be supported by a rollback recovery mechanism, which restores a

process and its resources to a suitable previous state from which the

process can eventually repeat its actions. The first three conditions are

necessary but not sufficient for a deadlock to exist. For deadlock to

actually take place, a fourth condition is required.

4. Circular wait: A closed chain of processes exists, such that each

process holds at least one resource needed by the next process in the

chain (e.g., Figure 5.7 (c)).

The fourth condition is, actually, a potential consequence of the first three.

That is, given that the first three conditions exist, a sequence of events may

occur that lead to an un-resolvable circular wait. The un-resolvable circular

wait is in fact the definition of deadlock. The circular wait listed as condition

4 is un-resolvable because the first three conditions hold. Thus, the four

conditions, taken together, constitute necessary and sufficient conditions for

deadlock. Recall that we defined a fatal region as on such that once the

processes have progressed into that region, those processes will deadlock.

Page 130: MC0085

Advanced Operating Systems (Distributed Systems) Unit 5

Sikkim Manipal University Page No. 130

A fatal region exists only if all of the first three conditions listed above are

met. If one or more of these conditions are not met, there is no fatal region

and deadlock cannot occur. Thus, these are necessary conditions for

deadlock. For deadlock to occur, there must not only be a fatal region, but

also a sequence of resource requests that has led into the fatal region. If a

circular wait condition occurs, then in fact the fatal region has been entered.

Thus, all four conditions listed above are sufficient for deadlock. To

summarize:

Three general approaches exist for dealing with deadlock. First, one can

prevent deadlock by adopting a policy that eliminates one of the conditions

(conditions 1 through 4). Second, one can avoid deadlock by making the

appropriate dynamic choices based on the current state of resource

allocation. Third, one can attempt to detect the presence of deadlock

(conditions 1 through 4 hold) and take action to recover.

Deadlock Prevention

The strategy of deadlock prevention is, simply put, to design a system in

such a way that the possibility of deadlock is excluded. We can view

deadlock prevention methods as falling into two classes. An indirect method

of deadlock prevention is to prevent the occurrence of one of the three

necessary conditions listed previously (items 1 through 3). A direct method

of deadlock prevention is to prevent the occurrence of a circular wait

(item 4). We now examine techniques related to each of the four conditions.

Page 131: MC0085

Advanced Operating Systems (Distributed Systems) Unit 5

Sikkim Manipal University Page No. 131

Mutual Exclusion

In general, the first of the four listed conditions cannot be disallowed. If

access to a resource requires mutual exclusion, then mutual exclusion must

be supported by the OS. Some resources, such as files, may allow multiple

accesses for reads but only exclusive access for writes. Even in this case,

deadlock can occur if more than one process requires write permission.

Hold and Wait

The hold-and-wait condition can be prevented by requiring that a process

request all of its required resources at one time and blocking the process

until all requests can be granted simultaneously. This approach is inefficient

in two ways.

First, a process may be held up for a long time waiting for all of its resource

requests to be filled, when in fact it could have proceeded with only some of

the resources.

Second, resources allocated to a process may remain unused for a

considerable period, during which time they are denied to other processes.

Another problem is that a process may not know in advance all of the

resources that it will require.

There is also the practical problem created by the use of modular

programming or a multithreaded structure for an application. An application

would need to be aware of all resources that will be requested at all levels or

in all modules to make the simultaneous request.

No Preemption

This condition can be prevented in several ways. First, if a process holding

certain resources is denied a further request, that process must release its

original resources and, if necessary, request them again together with the

additional resource.

Page 132: MC0085

Advanced Operating Systems (Distributed Systems) Unit 5

Sikkim Manipal University Page No. 132

Alternatively, if a process requests a resource that is currently held by

another process, the OS may preempt the second process and require it to

release its resources. This latter scheme would prevent deadlock only if no

two processes possessed the same priority. This approach is practical only

when applied to resources whose state can be easily saved and restored

later, as is the case with a processor.

Circular Wait

The circular-wait condition can be prevented by defining a linear ordering of

resource types. If a process has been allocated resources of type R, then it

may subsequently request only those resources of types following R in the

ordering. To see that this strategy works, let us associate an index with each

resource type. Then resource Ri precedes Rj in the ordering if i < j. Now

suppose that two processes, A and B, are deadlocked because A has

acquired Ri and requested Rj, and B has acquired Rj and requested Ri. This

condition is impossible because it implies i < j and j < i.

As with hold-and-wait prevention, circular-wait prevention may be inefficient,

slowing down processes and denying resource access unnecessarily.

Deadlock Avoidance

An approach to solving the deadlock problem that differs subtly from

deadlock prevention is deadlock avoidance. In deadlock prevention, we

constrain resource requests to prevent at least one of the four conditions of

deadlock. This is either done indirectly, by preventing one of the three

necessary policy conditions (mutual exclusion, hold and wait, no

preemption), or directly by preventing circular wait. This leads to inefficient

use of resources and inefficient execution of processes. Deadlock

avoidance, on the other hand, allows the three necessary conditions but

makes judicious choices to assure that the deadlock point is never reached.

As such, avoidance allows more concurrency than prevention. With

Page 133: MC0085

Advanced Operating Systems (Distributed Systems) Unit 5

Sikkim Manipal University Page No. 133

deadlock avoidance, a decision is made dynamically whether the current

resource allocation request will, if granted, potentially lead to a deadlock.

Deadlock avoidance thus requires knowledge of future process resource

requests.

In this section, we describe two approaches to deadlock avoidance:

Do not start a process if its demands might lead to deadlock.

Do not grant an incremental resource request to a process if this

allocation might lead to deadlock.

Process Initiation Denial

Consider a system of n processes and m different types of resources. Let us

define the following vectors and matrices:

Table 5.2: Vector and Matrix Representations

The matrix Claim gives the maximum requirement of each process for each

resource, with one row dedicated to each process. This information must be

declared in advance by a process for deadlock avoidance to work. Similarly,

the matrix allocation gives the current allocation to each process. The

following relationships hold:

1. Rj =

n

1i

Aij, for all j All resources are either available or allocated.

Page 134: MC0085

Advanced Operating Systems (Distributed Systems) Unit 5

Sikkim Manipal University Page No. 134

2. Cij Rj, for all i, j No process can claim more than the total amount of

resources in the system.

3. Aij Cij, for all i, j No process is allocated more resources of any type

than the process originally claimed to need.

With these quantities defined, we can define a deadlock avoidance policy

that refuses to start a new process if its resource requirements might lead to

deadlock. Start a new process Pn+1 only if

Rj C(n+1) j +

n

1i

Cij for all j

That is, a process is only started if the maximum claim of all current

processes plus those of the new process can be met. This strategy is hardly

optimal, because it assumes the worst: that all processes will make their

maximum claims together.

Resource Allocation Denial

The strategy of resource allocation denial, referred to as the banker’s

algorithm, was first proposed in [DIJK65]. Let us begin by defining the

concepts of state and safe state. Consider a system with a fixed number of

processes and a fixed number of resources. At any time a process may

have zero or more resources allocated to it. The state of the system reflects

the current allocation of resources to processes. Thus, the state consists of

the two vectors, Resource and Available, and the two matrices, Claim and

Allocation, defined earlier. A safe state is one in which there is at least one

sequence of resource allocations to processes that does not result in a

deadlock (i.e., all of the processes can be run to completion). An unsafe

state is, of course, a state that is not safe.

Deadlock Detection

Deadlock prevention strategies are very conservative; they solve the

problem of deadlock by limiting access to resources and by imposing

Page 135: MC0085

Advanced Operating Systems (Distributed Systems) Unit 5

Sikkim Manipal University Page No. 135

restrictions on processes. At the opposite extreme, deadlock detection

strategies do not limit resource access or restrict process actions. With

deadlock detection, requested resources are granted to processes

whenever possible. Periodically, the OS performs an algorithm that allows it

to detect the circular wait condition.

5.8 Election Algorithms

Several distributed algorithms require that there be a coordinator process in

the entire system that performs some type of coordination activity needed

for the smooth running of other processes in the system. Two examples of

such coordinator processes encountered in this unit are the coordinator in

the centralized algorithm for mutual exclusion and the central coordinator in

the centralized deadlock algorithm. Since all other processes in the system

have to interact with the coordinator, they all must unanimously agree on

who the coordinator is. Furthermore, if the coordinator process fails due to

the failure of the site on which it is located, a new coordinator process must

be elected to take up the job of the failed coordinator. Election algorithms

are meant for electing a coordinator process from among the currently

running processes in such a manner that at any instance of time there is a

single coordinator for all processes in the system.

Election algorithms are based on the following assumptions:

1. Each process in the system has a unique priority number.

2. Whenever an election is held, the process having the highest priority

number among the currently active processes is elected as the

coordinator.

3. On recovery, a failed process can take appropriate to rejoin the set of

active processes.

Page 136: MC0085

Advanced Operating Systems (Distributed Systems) Unit 5

Sikkim Manipal University Page No. 136

Therefore, whenever initiated, an election algorithm basically finds out which

of the currently active processes has the highest priority number and then

informs this to all the active processes.

(i) The Bully Algorithm

This algorithm was proposed by Garcia-Molina. In this algorithm it is

assumed that every process knows the priority number of every other

process in the system. The algorithm works as follows:

When a process (say Pi) sends a request message to the coordinator and

does not receive a reply within a fixed timeout period, it assumes that the

coordinator has failed. It then initiates an election by sending an election

message to every process with a higher priority number than itself. If Pi does

not receive any response to its election message within a fixed timeout

period, it assumes that among the currently active processes it has the

highest priority number. Therefore it takes up the job of the coordinator and

sends a message (call it the coordinator message) to all processes having

lower priority numbers than itself, informing that from now on it is the new

coordinator. On the other hand, if Pi receives a response for its election

message, this means that some other process having higher priority number

is alive, Therefore Pi does not take any further action and just waits to

receive the final result (a coordinator message from the new coordinator) of

the election it initiated.

When a process (say Pj) receives an election message, it sends a response

message to the sender informing that it is alive and will take over the

election activity. Now Pj holds an election if it is not already holding one. In

this way, the election activity gradually moves on to the process that has the

highest priority number among the currently active processes and eventually

wins the election and becomes the new coordinator.

Page 137: MC0085

Advanced Operating Systems (Distributed Systems) Unit 5

Sikkim Manipal University Page No. 137

(ii) A Ring Algorithm

This algorithm assumes that all the processes in the system are organized

in a logical ring. The ring is unidirectional in the sense that all the messages

related to the election algorithm are always passed only in one direction

(clockwise / anticlockwise). Every process in the system knows the structure

of the ring, so that while trying to circulate a message over the ring, if the

successor of the sender process is down, the sender can skip over the

successor, or the one after that, until an active member is located. The

algorithm works as follows:

When a process (say Pi) sends a request message to the current

coordinator and does not receive a reply within a fixed timeout period, it

assumes that the coordinator has crashed. Therefore it initiates an election

by sending an election message to its successor (actually to the first

successor that is currently active). This message contains the priority

number of process Pi. On receiving the election message, the successor

appends its own priority number to the message and passes it on to the next

active member in the ring. This member appends its own priority number to

the message and forwards it to its own successor. In this manner, the

election message circulates over the ring from one active process to another

and eventually returns back to process Pi. Process Pi recognizes the

message as its own election message by seeing that in the list of priority

numbers held within the message the first priority number is its own priority

number.

Note that when process Pi receives its own election message, the message

contains the list of priority numbers of all processes that are currently active.

Therefore, of the processes in this list, it elects the process having the

highest priority number as the new coordinator. It then circulates a

coordinator message over the ring to inform all the other active processes

who the new coordinator is. When the coordinator message comes back to

Page 138: MC0085

Advanced Operating Systems (Distributed Systems) Unit 5

Sikkim Manipal University Page No. 138

process Pi after completing its one round along the ring, it is removed by

process Pi. At this point all the active processes know who the current

coordinator is.

When a process (say Pj), recovers after failure, it creates an inquiry

message and sends it to its successor. The message contains the identity of

process Pj. If the successor is not the current coordinator, it simply forwards

the enquiry message to its own successor. In this way, the inquiry message

moves forward along the ring until it reaches the current coordinator. On

receiving an inquiry message, the current coordinator sends a reply to

process Pj informing that it is the current coordinator.

Notice that in this algorithm two or more processes may almost

simultaneously discover that the coordinator has crashed and then each one

may circulate an election message over the ring. Although this results in a

little waste of network bandwidth, it does not cause any problem because

every process that initiated an election will receive the same list of active

processes, and all of them will choose the same process as the new

coordinator.

5.9 Terminal Questions

1. Discuss Clock synchronization issues in a DSM system.

2. Discuss the following synchronization issues in a DSM system:

Event Ordering

Mutual Exclusion

Deadlocks

3. Discuss the Election Algorithms in a Synchronized DSM system.

Page 139: MC0085

Advanced Operating Systems (Distributed Systems) Unit 6

Sikkim Manipal University Page No. 139

Unit 6 Resource Management

Structure:

6.1 Introduction

Objectives

6.2 Desirable Features of a Good Global Scheduling Algorithm

6.3 Task assignment Approach

6.4 Load – Balancing Approach

6.5 Load – Sharing Approach

6.6 Terminal Questions

6.1 Introduction

Every distributed system consists of a number of resources interconnected

by a network. Besides providing communication facilities, a network

facilitates resource sharing by migrating a local process and executing it at a

remote node of the network. A process may be migrated because the local

node does not have the required resources or the local node has to be shut

down. A process may also be executed remotely if the expected turnaround

time will be better. From a user’s point of view the set of available resources

in a distributed system acts like a single virtual system.

A resource can be logical, such as a shared file, or physical such as CPU.

For this unit, we consider a resource to be a processor of the system and

assume that each processor forms a node of the distributed system.

Page 140: MC0085

Advanced Operating Systems (Distributed Systems) Unit 6

Sikkim Manipal University Page No. 140

Figure 6.1: A Distributed System Connected by a Local Area Network

A resource manager schedules the processes in a distributed system to

make use of the system resources in such a manner that resource usage,

response time, network congestion, and scheduling overhead are optimized.

The following are different approaches for Process Scheduling:

1. Task Assignment Approach: Each process is viewed as a collection of

tasks. These tasks are scheduled to suitable processors to improve

performance. This is not a widely used approach because

It requires characteristics of all the processes to be known in

advance.

This approach does not take into consideration the dynamically

changing state of the system.

2. Load Balancing Approach: Processes are distributed among nodes to

equalize the load among all nodes.

3. Load-Sharing Approach: No node is allowed to be idle, while

processes are waiting to be served at other nodes. Requires the

knowledge of the load in the entire system.

Page 141: MC0085

Advanced Operating Systems (Distributed Systems) Unit 6

Sikkim Manipal University Page No. 141

Objectives:

This unit discusses the management of various resources present at

different locations on a distributed network. For effective utilization of

resources, there should be proper management of these resources which

could be done through scheduling. The various scheduling algorithms for

resource management are discussed here. The topics of Task Assignment,

Load Balancing, and Load Sharing are discussed in detail.

6.2 Desirable Features of a Good Global Scheduling Algorithm

i) No a priori Knowledge about the process: A good process

scheduling algorithm should operate with absolutely no a priori

knowledge about the processes.

ii) Dynamic in Nature: It is intended that a good process-scheduling

algorithm should be able to take care of the dynamically changing load

at various nodes. The process assignment decisions should be based

on the current load of the system and not on some fixed static policy.

iii) Quick Decision Making: A good process scheduling algorithm must

be capable of taking quick decisions regarding node assignment for

processes.

iv) Scheduling overhead: The general observation is that as overhead is

increased in an attempt to obtain more information regarding the

global state of the system, the usefulness of the information is

decreased due to both the aging of the information gathered and the

low scheduling frequency as a result of the cost of gathering and

processing that information. Hence algorithms that provide near

optimal system performance with a minimum of global state

information gathering overhead are desirable.

Page 142: MC0085

Advanced Operating Systems (Distributed Systems) Unit 6

Sikkim Manipal University Page No. 142

v) Stability: The algorithm should be stable: i.e., the system should not

enter a state in which nodes spend all their time migrating processes

or exchanging control messages without doing any useful work.

vi) Scalable: The algorithm should be scalable i.e. the system should be

able to handle small and large networked systems. A simple

approach to make an algorithm scalable is to probe only m of

N nodes for selecting a host. The value of m can be dynamically

adjusted depending on the value of N.

vii) Fault Tolerance: The algorithm should not be affected by the crash

of one or more nodes in the system. At any instance of time, it should

continue functioning for nodes that are up at that time. Algorithms

that have decentralized decision making a capability and consider

only available nodes in their decision making approach have better

fault tolerance capability.

viii) Fairness of service: How fairly a service is allocated is a common

concern. For example, two users simultaneously initiating equivalent

processes should receive the same quality of service. What is

desirable is a fair strategy that will improve response time to the

former without unduly affecting the latter. For this the concept of load

balancing has to be replaced by load-sharing, i.e., a node will share

some of its resources as long as its users are not significantly

affected.

6.3 Task Assignment Approach

In this approach, a process is considered to be composed of multiple tasks

and the goal is to find an optimal assignment policy for the tasks of an

individual process. The following are typical assumptions for the task

assignment approach:

A process is already split into pieces, called tasks

Page 143: MC0085

Advanced Operating Systems (Distributed Systems) Unit 6

Sikkim Manipal University Page No. 143

The amount of computation required for each task and the speed of the

processors are known

Cost of processing each task at every node is known

The interprocess communication between any two processes is known

Resource requirements of each task

Reassignment of tasks is generally not possible

Some of the goals of a good task assignment algorithm are:

Minimize IPC cost (this problem can be modeled using network flow

model)

Efficient resource utilization

Quick turnaround time

A high degree of parallelism

Why do we need Load Balancing or Load Sharing?

Consider a set of N identical servers (i.e. with the same task arrival rate and

same service rate)

Let þ = utilization of the server

Let P0 = 1 − þ, the probability a server is idle

Let P = probability that at least one task is waiting for service and at least

one server is idle

P =

n

1i

(N

i

) QiHN – i

Qi = The probability that a given set of i-servers are idle

= P iO

, from independence of servers

HN – i = the probability that a given set of N − i servers are not idle and at one

or more of them a task is waiting for service

= {probability that all N − i servers have at least one task} - {probability that

all (N − i) servers have exactly one task}

Page 144: MC0085

Advanced Operating Systems (Distributed Systems) Unit 6

Sikkim Manipal University Page No. 144

= (1 − P0)N−i − [(1 − P0)P0]

N−i

P=1 - þN(1 - (1 - þ)N) - (1 - þ2)N

Binomial Theorem (BT): (a+b)N =

N

i 0

(

N

i

) aibN−i

6.4 Load-Balancing Approach

The scheduling algorithms that use this approach are known as Load

Balancing or Load-Leveling Algorithms. These algorithms are based on the

intuition that for better resource utilization, it is desirable for the load in a

distributed system to be balanced evenly. Thus a load balancing algorithm

tries to balance the total system load by transparently transferring the

workload from heavily loaded nodes to lightly loaded nodes in an attempt to

ensure good overall performance relative to some specific metric of system

performance.

We can have the following categories of load balancing algorithms:

1. Static: Ignore the current state of the system. e.g. If a node is heavily

loaded, it picks up a task randomly and transfers it to a random node.

These algorithms are simpler to implement but performance may not be

good.

2. Dynamic: Use the current state information for load balancing. There is

an overhead involved in collecting state information periodically; they

perform better than static algorithms.

3. Deterministic: Algorithms in this class use the processor and process

characteristics to allocate processes to nodes.

4. Probabilistic: Algorithms in this class use information regarding static

attributes of the system such as number of nodes, processing capability,

etc.

Page 145: MC0085

Advanced Operating Systems (Distributed Systems) Unit 6

Sikkim Manipal University Page No. 145

5. Centralized: System state information is collected by a single node.

This node makes all scheduling decisions.

6. Distributed: Most desired approach. Each node is equally responsible

for making scheduling decisions based on the local state and the state

information received from other sites.

7. Cooperative: A distributed dynamic scheduling algorithm. In these

algorithms, the distributed entities cooperate with each other to make

scheduling decisions. Therefore they are more complex and involve

larger overhead than non-cooperative ones. But the stability of a

cooperative algorithm is better than that of a non-cooperative one.

8. Non-cooperative: A distributed dynamic scheduling algorithm. In these

algorithms, individual entities act as autonomous entities and make

scheduling decisions independently of the action of other entities.

Load Estimation Policy: This policy makes an effort to measure the load at

a particular node in a distributed system according to the following criteria:

The number of processes running at a node as a measure of the load at

the node.

The CPU utilization as a measure of load

None of the above fully captures the load at a node, other parameters such

as resource demands of these processes, architecture and speed of the

processor total remaining execution time of the processes, etc should be

taken into consideration as well.

Process Transfer Policy: The strategy of load balancing algorithms is

based on the idea of transferring some processes from the heavily loaded

nodes to lightly loaded nodes. To facilitate this, it is necessary to devise a

policy to decide whether or not a node is lightly or heavily loaded. The

threshold value of a node is the limiting value of its workload and is used to

decide whether a node is lightly or heavily loaded.

Page 146: MC0085

Advanced Operating Systems (Distributed Systems) Unit 6

Sikkim Manipal University Page No. 146

The threshold value of a node may be determined by any of the following

methods:

1. Static Policy: Each node has a predefined threshold value. If the

number of processes exceed the predefined threshold value, a process

is transferred. Can cause process thrashing under heavy load, thus

causing instability.

2. Dynamic Policy: In this method, the threshold value is dynamically

calculated. It is increased under heavy load and decreased under light

load. Thus process thrashing does not occur.

3. High-low Policy: Each node has two threshold values, high and low.

Thus, the state of a node can be overloaded, under-loaded or normal

depending on the number of processes greater than high, less than low

or otherwise.

Location Policies:

Once a decision has been made through the transfer policy to transfer a

process from a node, the next step is to select the destination node for that

process’ execution. This selection is made by the location policy of a

scheduling algorithm. The main location policies proposed are as follows:

1. Threshold: A random node is polled to check its state and the task is

transferred if it will not be overloaded; polling is continued until a suitable

node is found or a threshold number of nodes have been polled.

Experiment shows polling 3 to 5 five nodes performs as good as polling

large number of nodes, like 20 nodes. This also has substantial

performance over no load balancing at all.

2. Shortest: A predetermined number of nodes are polled and the node

with minimum load among these is picked for the task transfer; if that

node is overloaded the task is executed locally.

3. Bidding: In this method, each node acts as a manager (the one who

tries to transfer a task) and a contractor, the one that is able to accept a

Page 147: MC0085

Advanced Operating Systems (Distributed Systems) Unit 6

Sikkim Manipal University Page No. 147

new task. In this the Manager broadcasts a request-for-bids to all the

nodes. A contractor returns bids (quoted price based on the processor

capability, memory size, resource availability, etc). A Manager chooses

the best bidder for transferring the task. Problems that could arise as a

result of broadcasts of two or more managers concurrently need to be

addressed.

4. Pairing: This approach tries to reduce the variance in load between

pairs of nodes. In this approach, two nodes that differ greatly in load are

paired with each other so they can exchange tasks. Each node asks a

randomly picked node if it will pair with it. After a pairing is formed, one

or more processes are transfered from heavily loaded node to the lightly

loaded node.

State Information Exchange Policies:

The dynamic policies require frequent change of state information among

the nodes of the system. In fact, a dynamic load-balancing algorithm faces a

transmission dilemma because of the two opposing impacts the

transmission of a message has on the overall performance of the system.

On one hand, transmission improves the ability of the algorithm to balance

the load. On the other hand, it raises the expected queuing time of

messages because of the increase in the utilization of the communication

channel. Thus proper selection of the state information exchange policy is

essential. The proposed load balancing algorithms use one of the following

policies for the purpose:

1. Periodic Broadcast: Each node broadcasts its state information

periodically, say every t time units. It does not scale well and causes

heavy network traffic. May result in fruitless messages.

2. Broadcast When State Changes: This avoids fruitless messages. A

node broadcasts its state only when its state changes. For example,

when the state changes from normal to low or normal to high, etc.

Page 148: MC0085

Advanced Operating Systems (Distributed Systems) Unit 6

Sikkim Manipal University Page No. 148

3. On-Demand Exchange: Under this approach

A node broadcasts a state information request when its state

changes from normal load region to high or low load.

Upon receiving this request, other nodes send their current state

information to the requesting node.

If the requesting node includes its state information in the request

then, only those nodes that can cooperate with the requesting

node need to send reply.

4. Exchange by Polling: In this approach the state information is

exchanged with a polled node only. Polling stops after a predetermined

number of polling or after a suitable partner is found, whichever happens

first.

5. Priority Assignment Policies: One of the following priority assignment

rules may be used to assign priorities to local and remote processes

(i.e. processes that have migrated from other nodes):

i) Selfish: Local processes are given higher priority than remote

processes.

Study shows this approach yields worst response time of the

three policies.

This approach penalizes processes that arrive at a busy node

because they will be transferred and hence will execute as low

priority processes. It favors the processes that arrive at lightly

loaded nodes.

ii) Altruistic: Remote processes are given higher priority than local

processes

Study shows this approach yields best response time of all the

three approaches.

Under this approach, remote processes incur lower delays than

local processes.

Page 149: MC0085

Advanced Operating Systems (Distributed Systems) Unit 6

Sikkim Manipal University Page No. 149

iii) Intermediate: If the number of local processes are more, local

processes get higher priority; otherwise, remote processes get

higher priority.

Study shows that the overall response time performance under

this policy is much closer to that of the altruistic policy.

Under this policy, local processes are treated better than the

remote processes for a wide range of loads.

iv) Migration – Limiting Policies: This policy is used to decide about

the total number of times a process should be allowed to migrate.

Uncontrolled: Remote process is treated like local process. So,

there is no limit on the number of nodes it can migrate.

Controlled: Most systems use controlled policy to overcome the

instability problem

Migrating a partially executed process is expensive; so, many

systems limit the number of migrations to 1. For long running

processes, it might be beneficial to migrate more than once.

6.5 Load Sharing Approach

Several researchers believe that load balancing, with its implication of

attempting to equalize workload on all the nodes of the system, is not an

appropriate objective. This is because the overhead involved in gathering

the state information to achieve this objective is normally very large,

especially in distributed systems having a large number of nodes. In fact, for

the proper utilization of resources of a distributed system, it is not required

to balance the load on all the nodes. It is necessary and sufficient to prevent

the nodes from being idle while some other nodes have more than two

processes. This rectification is called the Dynamic Load Sharing instead of

Dynamic Load Balancing.

Page 150: MC0085

Advanced Operating Systems (Distributed Systems) Unit 6

Sikkim Manipal University Page No. 150

Issues in Load-Sharing Algorithms:

The design of a load sharing algorithm requires that proper decisions be

made regarding load estimation policy, process transfer policy, state

information exchange policy, priority assignment policy, and migration

limiting policy. It is simpler to decide about most of these policies in case of

load sharing, because load sharing algorithms do not attempt to balance the

average workload of all the nodes of the system. Rather, they only attempt

to ensure that no node is idle when a node is heavily loaded. The priority

assignment policies and the migration limiting policies for load-sharing

algorithms are the same as that of load-balancing algorithms.

1. Load Estimation Policies: In this an attempt is made to ensure that no

node is idle while processes wait for service at some other node. In general,

the following two approaches are used for estimation:

Use number of processes at a node as a measure of load

Use the CPU utilization as a measure of load

Process Transfer Policies: Load sharing algorithms are interested in busy

or idle states only and most of them employ the all-or-nothing strategy given

below:

All or Nothing Strategy: It uses a single threshold policy. A node becomes

a candidate to accept tasks from remote nodes only when it becomes idle. A

node becomes a candidate for transferring a task as soon as it has more

than one task. Under this approach, an idle process is not able to

immediately acquire a task, thus wasting processing power. To avoid this,

the threshold value can be set to 2 instead of 1.

Location Policies: Location Policy decides the sender node or the receiver

node of a process that is to be moved within the system for load sharing.

Depending on the type of node that takes the initiative to globally search for

a suitable node for the process, the location policies are of the following

types:

Page 151: MC0085

Advanced Operating Systems (Distributed Systems) Unit 6

Sikkim Manipal University Page No. 151

1. Sender-Initiated Policy: Under this policy, heavily loaded nodes search

for lightly loaded nodes to which task may be transferred. The search

can be done by sending a broadcast message or probing randomly

picked nodes

An advantage of this approach is that sender can transfer the freshly

arrived tasks, so no preemptive task transfers occur.

A disadvantage of this approach is it can cause system instability

under high system load.

2. Receiver-Initiated Location Policy: Under this policy, lightly loaded

nodes search for heavily loaded nodes from which tasks may be

transferred

The search for a sender can be done by sending a broadcast

message or by probing randomly picked nodes.

An disadvantage of this approach is it may result in preemptive task

transfers because sender may not have any freshly arrived tasks.

Advantage is, this does not cause system instability, because under

high system loads a receiver will quickly find a sender; and under

low system loads, it is OK for processes to process some additional

control messages.

3. Symmetrically Initiated Location Policy: Under this approach, both

senders and receivers search for receivers and senders respectively.

4. State Information Exchange Policies: Since it is not necessary to

equalize load at all nodes under load sharing, state information is

exchanged only when the state changes.

5. Broadcast When State Changes: A node broadcasts a state

information request message when it becomes under-loaded or

overloaded.

In the sender-initiated approach a node broadcasts this message

only when it is overloaded.

Page 152: MC0085

Advanced Operating Systems (Distributed Systems) Unit 6

Sikkim Manipal University Page No. 152

In the receiver-initiated approach, a node broadcasts this message

only when it is under-loaded.

6. Poll When State Changes: When a node’s state changes,

It randomly polls other nodes one by one and exchanges state

information with the polled nodes.

Polling stops when a suitable node is found or a threshold number of

nodes have been polled.

Under sender initiated policy, sender polls to find suitable receiver.

Under receiver initiated policy, receiver polls to find suitable sender.

The above Average Algorithm by Krueger and Finkel (A dynamic

load balancing algorithm) tries to maintain load at each node within

an acceptable range of the system average.

7. Transfer Policy: A threshold policy that uses two adaptive thresholds,

the upper threshold, and the lower threshold

A node with load lower than lower threshold is considered a receiver

A node with load higher than the higher threshold is considered a

sender.

A node’s estimated average load is supposed to lie in the middle of

the lower and upper thresholds.

6.6 Terminal Questions

1. Discuss the desirable features of a good global scheduling algorithm.

2. Discuss the Task Assignment approach.

3. Discuss the Load Sharing approach.

Page 153: MC0085

Advanced Operating Systems (Distributed Systems) Unit 7

Sikkim Manipal University Page No. 153

Unit 7 Process Management

Structure:

7.1 Introduction

Objectives

7.2 Process Migration

7.3 Threads

7.4 Terminal Questions

7.1 Introduction

The notion of a process is central to the understanding of operating

systems. There are quite a few definitions presented in the literature, but no

"perfect" definition has yet appeared.

Definition

The term "process" was first used by the designers of the MULTICS in

1960's. Since then, the term process is used somewhat interchangeably

with 'task' or 'job'. The process has been given many definitions, for

instance:

A program in Execution.

An asynchronous activity.

The 'animated spirit' of a procedure in execution.

The entity to which processors are assigned.

The 'dispatchable' unit.

and many more definitions have been given. As we can see from the above

that there is no universally agreed upon definition, but the definition

"Program in Execution" seems to be most frequently used. And this is a

concept used in the present study of operating systems.

Page 154: MC0085

Advanced Operating Systems (Distributed Systems) Unit 7

Sikkim Manipal University Page No. 154

Now that we have agreed upon the definition of process, the question is

what is the relation between process and program. It is same beast with

different name or when this beast is sleeping (not executing) it is called

program and when it is executing it becomes process. Well, to be very

precise, a Process is not the same as program. In the following discussion

we point out some of the differences between process and program.

Process is not the same as program. A process is more than a program

code. A process is an 'active' entity as opposed to program which is

considered to be a 'passive' entity. As we all know a program is an algorithm

expressed in some suitable notation, (e.g., programming language). Being

passive, a program is only a part of process. Process, on the other hand,

includes:

Current value of Program Counter (PC)

Contents of the processors registers

Value of the variables

The process stack (SP) which typically contains temporary data such as

subroutine parameter, return address, and temporary variables.

A data section that contains global variables.

A process is the unit of work in a system.

In Process model, all software on the computer is organized into a number

of sequential processes. A process includes PC, registers, and variables.

Conceptually, each process has its own virtual CPU. In reality, the CPU

switches back and forth among processes. (The rapid switching back and

forth is called multiprogramming).

Process Management

In a conventional (or centralized) operating system, process management

deals with mechanisms and policies for sharing the processor of the system

among all processes. In a Distributed Operating system, the main goal of

Page 155: MC0085

Advanced Operating Systems (Distributed Systems) Unit 7

Sikkim Manipal University Page No. 155

process management is to make the best possible use of the processing

resources of the entire system by sharing them among all the processes.

Three important concepts are used in distributed operating systems to

achieve this goal:

1. Processor Allocation: It deals with the process of deciding which

process should be assigned to which processor.

2. Process Migration: It deals with the movement of a process from its

current location to the processor to which it has been assigned.

3. Threads: They deal with fine-grained parallelism for better utilization of

the processing capability of the system.

This unit describes the concepts of process migration and threads.

Issues in Process Management

Transparent relocation of processes

– Preemptive process migration – costly

– Non-preemptive process migration

Selecting the source and destination nodes for migration

Cost of migration – size of the address space and time taken to migrate

Address space transfer mechanisms – total freezing, pre-transfering,

transfer on reference

Message forwarding for migrated processes

– Resending the message

– The origin site mechanism

– Link traversal mechanism

– Link update mechanism

Process migration in heterogeneous systems

Objectives:

This unit introduces the reader management of processes present in a

distributed network. It discusses the differences between the processes

Page 156: MC0085

Advanced Operating Systems (Distributed Systems) Unit 7

Sikkim Manipal University Page No. 156

running on a uni-processor system and processes running on a distributed

system in specific. It speaks about process migration mechanisms in which

the processes may be shifted or migrated to different machines on the

network depending on the availability of resources to complete the process

execution. It also discusses the concept of threads, their mechanisms, and

differences between a thread and a process on uni-processor system and a

distributed system.

7.2 Process Migration

Definition:

The relocation of a process from its current location (the source system) to

some other location (Destination).

A process may be migrated either before it starts executing on its source

node or during the course of its execution. The former is known as

pre-emptive process migration.

Process migration involves the following steps:

1. Selection of a process to be migrated

2. Selection of destination system or node

3. Actual transfer of the selected process to the destination system or node

The following are the desirable features of a good process migration

mechanism:

A good process migration mechanism must possess transparency, minimal

interferences, minimal residual dependencies, efficiency, and robustness.

i) Transparency: Levels of transparency:

Access to objects such as files and devices should be done in a

location-independent manner. To accomplish this, system should

provide a mechanism for transparent object naming.

Page 157: MC0085

Advanced Operating Systems (Distributed Systems) Unit 7

Sikkim Manipal University Page No. 157

System calls should be location-independent. However, system

calls related to physical properties of node need not be location-

independent.

Interprocess communication should be transparent. Messages

sent to a migrated process should be delivered to the process

transparently; i.e. the sender doesn’t have to resend it.

ii) Minimal Interference: Migration of a process should involve minimal

interference to the progress of the process and to the system as a

whole. For example, minimize freezing time; can be done by partial

transfer.

iii) Minimal residual dependencies: Migrated process should not continue

to depend in any way on its previous node, because such dependency

can diminish the benefits of migrating and also the failure of the previous

node will cause the process to fail.

iv) Efficiency: Time required for migrating a process and cost of supporting

remote execution should be minimized.

v) Robustness: Failure of any node other than the one on which the

process is running should not affect the execution of the process.

Process Migration Mechanism

Migration of a process is a complex activity that involves proper handling of

several sub-activities in order to meet the requirements of a good process

migration mechanism. The four major subactivities involved in process

migration are as follows:

1. Freezing the process and restarting on another node.

2. Transferring the process’ address space from its source node to its

destination node

3. Forwarding messages meant for the migrant process

Page 158: MC0085

Advanced Operating Systems (Distributed Systems) Unit 7

Sikkim Manipal University Page No. 158

4. Handling communication between cooperating processes that have

been separated as a result of process migration.

The commonly used mechanisms for handling each of these subactivities

are described below:

1. Mechanisms for freezing the process:

In pre-emptive process migration, the usual process is to take a “snapshot”

of the process’ on its source node and reinstate the snapshot on the

destination node. For this, at some point during migration, the process is

frozen on its source node, its state information is transferred to its

destination node, and the process is restarted on its destination node using

this state information. By freezing this process, we mean that the execution

of the process is suspended and all external interactions with the process

are deferred.

Some general issues involved in these operations are described below:

i) Immediate and delayed blocking: When can these two approaches be

used?

If the process is not executing a system call, it can be blocked

immediately.

If a process is executing a system call, it may or may not be

possible to block it immediately, depending on the situation and

implementation.

ii) Fast and slow I/O operations: It is feasible to wait for fast I/O

operations (e.g. disk I/O) after blocking. However, not feasible to wait

for slow I/O operations such as terminal. But proper mechanisms are

necessary for these I/O operations to continue.

iii) Information about open files: Names of files, file descriptors, current

modes, current position of their file pointers, etc need to preserved and

Page 159: MC0085

Advanced Operating Systems (Distributed Systems) Unit 7

Sikkim Manipal University Page No. 159

transferred. Also, temporary files would more efficiently be created at

the node on which the process is executing.

iv) Reinstating the process on the destination node: This involves

creating an empty process on the destination node, and the state of

the transferred process is copied into the empty process and is

unfrozen.

v) Address Transfer mechanisms: Migration of a process involves the

transfer of the process state (includes contents of registers, memory

tables, I/O states, process identifiers, etc.) and the process’s address

space (i.e., code, data, and the program stack).

There are three ways to transfer the address space:

a) Total freezing: Process execution is stopped while the address

space is being transferred. It is simple but inefficient

b) Pre-transferring: The address space is transferred while the

process is still running on the source node. Pre-transfer is

followed by repeated transfer of pages modified during the

transfer.

c) Transfer on reference: Only part of the address space is

transferred. The rest of the address space is transferred only on

demand.

vi) Message forwarding mechanisms: After the process has been

migrated, messages bound for that process should be forwarded to its

current node. The following are the three types of messages:

a) messages received at the source after the process execution is

stopped at the source but the process was not started at the new

node;

b) messages received at the source node after the process has

started executing at the destination;

Page 160: MC0085

Advanced Operating Systems (Distributed Systems) Unit 7

Sikkim Manipal University Page No. 160

c) messages to be sent to the migrant process from any other node

after the process started executing at the destination.

Message Forwarding Mechanisms

In moving a message, it must be ensured that all pending, en-route, and

future messages arrive at the process’s new location. The messages to be

forwarded to the migrant process’s new location can be classified into the

following:

Type 1: Messages received at the source node after the process’s

execution has been stopped on its source node and the process’s execution

has not yet been started on its destination node.

Type 2: Messages received at the source node after the process’s

execution has started on its destination node.

Type 3: Messages that are to be sent to the migrant process from any other

node after it has started executing on the destination node.

The different mechanisms used for message forwarding in existing

distributed systems are described below:

1. Resending the message: Instead of the source node forwarding the

messages received for the migrated process, it notifies the sender about

the status of the process. The sender locates the process and resends

the message.

2. Origin site mechanism: Process’s origin site is embedded in the

process identifier.

Each site is responsible for keeping information about the current

locations of all the processes created on it.

Messages are always sent to the origin site. The origin site then

forwards it to the process’s current location.

Page 161: MC0085

Advanced Operating Systems (Distributed Systems) Unit 7

Sikkim Manipal University Page No. 161

A drawback of this approach is that the failure of the origin site will

disrupt the message forwarding .

Another drawback is that there is continuous load on the origin site.

3. Link traversal mechanism: A forwarding address is left at the source

node

The forwarding address has two components

– The first component is a system-wide unique process identifier,

consisting of (id of the node on which the process was created,

local pid)

– The second component is the known location of the process.

This component is updated when the corresponding process is

accessed form the node.

Co-processes Handling Mechanisms

In systems that allow process migration, an important issue is the necessity

to provide efficient communication between a process (parent) and its

sub-processes (children), which might have been migrated and placed on

different nodes. The two different mechanisms used by existing distributed

operating systems to take care of this problem are described below:

1. Disallowing separation of co-processes: There are two ways to do

this

Disallow migration of processes that wait for one or more of their

children to complete.

Migrate children processes along with their parent process.

2. Home node or origin site concept: This approach.

Allows the processes and sub-processes to migrate independently.

All communication between the parent and children processes take

place via the home node.

Page 162: MC0085

Advanced Operating Systems (Distributed Systems) Unit 7

Sikkim Manipal University Page No. 162

Process Migration in Heterogeneous Systems

Following are the ways to handle heterogeneous systems

Use external data representation mechanism to handle this.

Issues related to handling floating point representation need to be

addressed. i.e., number of bits allocated to mantissa and exponent

should be at least as large as the largest representation in the system.

Signed infinity and signed 0 representation: Not all nodes in the system

may support this.

Process Migration Merits

Reducing the average response time of the processes

Speeding up individual jobs

Gaining higher throughput

Utilizing resources effectively

Reducing network traffic

Improving system reliability

Improving system security

7.3 Threads

Threads are a popular way to improve application performance through

parallelism. In traditional operating systems the basic unit of CPU utilization

is a process. Each process has its own program counter, register states,

stack, and address space. In operating systems with threads facility, the

basic unit of CPU utilization is a thread. In these operating systems, a

process consists of an address space and one or more threads of control.

Each thread of a process has its own program counter, register states, and

stack. But all the threads of a process share the same address space.

Hence they also share the same global variables. In addition, all threads of

a process also share the same set of operating system resources such as

Page 163: MC0085

Advanced Operating Systems (Distributed Systems) Unit 7

Sikkim Manipal University Page No. 163

open files, child processes, semaphores, signals, accounting information,

and so on. Threads share the CPU in the same way as processes do. i.e. on

a uni-processor system, threads run in a time-sharing mode, whereas on a

shared memory multi-processor, as many threads can run simultaneously

as there are processors. Akin to traditional processes, threads can create

child threads, can block waiting for system calls to complete, and can

change states during their course of execution. At a particular instance of

time, a thread can be in any one of several states: Running, Blocked,

Ready, or Terminated. In operating systems with threading facility, a

process having a single thread corresponds to a process of a traditional

operating system. Threads are referred to as lightweight processes and

traditional processes are referred to as heavyweight processes.

Why Threads?

Some of the limitations of the traditional process model are listed below:

1. Many applications wish to perform several largely independent tasks

that can run concurrently, but must share the same address space and

other resources.

For example, a database server or file server UNIX’s make facility allows

users to compile several files in parallel, using separate processes for

each.

2. Creating several processes and maintaining them involves lot of

overhead. When a context switch occurs, the state information of the

process (register values, page tables, file descriptors, outstanding I/O

requests, etc) need to be saved.

3. On UNIX systems, new processes are created using the fork system

call. fork is an expensive system call.

4. Processes cannot take advantage of multiprocessor architectures,

because a process can only use one processor at a time. An application

Page 164: MC0085

Advanced Operating Systems (Distributed Systems) Unit 7

Sikkim Manipal University Page No. 164

must create a number of processes and dispatch them to the available

processors.

5. Switching between threads sharing the same address space is

considerably cheaper than switching between processes The traditional

UNIX process is single-threaded.

Consider a set of single threaded processes executing on a Uni - processor

machine. The first three processes were spawned by a server in response

to three clients. The lower two processes run some other server application

Figure 7.1: Traditional UNIX system – Uniprocessor with

single-threaded processes

Two servers running on a uni – processor system. Each server runs as a

single process, with multiple threads sharing a single address space. Inter-

thread context-switching can be handled by either the OS kernel or a user-

level threads library.

Eliminating multiple nearly identical address spaces for each application

reduces the load on the memory subsystem.

Page 165: MC0085

Advanced Operating Systems (Distributed Systems) Unit 7

Sikkim Manipal University Page No. 165

Disadvantage: Multithreaded processes must be concerned with

synchronizing the access to the objects by several of their own threads.

Two Multithreaded processes running on a multiprocessor. All threads of

one process share the same address space, but run on different processors.

We get improved performance but synchronization is more complicated.

Figure 7.2: Multithreaded Processes in a Multiprocessor System

To summarize:

A Process can be divided into two components – a set of threads and a

collection of resources. The collection of resources include an address

space, open files, user credentials, quotas, etc, that are shared by all

threads in the process.

A Thread

is a dynamic object that represents a control point in the process and

that executes a sequence of instructions.

has its private objects, program counter, stack, and a register context.

Page 166: MC0085

Advanced Operating Systems (Distributed Systems) Unit 7

Sikkim Manipal University Page No. 166

User-level thread libraries.

IEEE POSIX standards group generated several drafts of a threads

package known as pthreads.

Sun’s Solaris OS supports pthreads library. It also has implemented its

own threads library.

Models for Organizing Threads

The following are some ways of organizing threads:

Dispatcher-workers model: Dispatcher thread accepts requests from

clients and dispatches it to one of the appropriate free worker threads for

further processing of the request.

Team Model: All threads behave equal in this model. Each thread gets and

process’s client’s request on its own Pipline model: In this model, threads

are arranged in a pipeline so that output data generated by the first thread is

used for processing by the second thread, output by second thread is used

by the third....

User-level Threads Libraries

The interface provided by the threads package must include several

important facilities such as for:

Creating and terminating threads

Suspending and resuming threads

Assigning priorities to the individual threads

Thread scheduling and context switching

Synchronizing activities through facilities such as semaphores and

mutual exclusion locks

Sending messages from one thread to another

Page 167: MC0085

Advanced Operating Systems (Distributed Systems) Unit 7

Sikkim Manipal University Page No. 167

Case Study – DCE threads

DCE threads comply with IEEE POSIX (Portable OS interface) standard

known as P-Threads.

DCE provides a set of user-level library procedures for the creation and

maintenance of threads.

To access the thread services DCE provides an API that is compatible to

the POSIX standard.

If a system supporting DCE has no intrinsic support for threads, the API

provides an interface to the thread library that is linked to the application.

If the system supporting DCE has OS kernel support for threads, DCE is

set up to use this facility. In this case the API serves as an interface to

kernel supported threads facility.

7.4 Terminal Questions

1. Differentiate between pre-emptive and non-preemptive process

migration. Mention their advantages and disadvantages.

2. Discuss the issues involved in freezing a migrant process on its source

node and restarting it on its destination node.

3. Discuss the threading issues with respect to process management in a

DSM system.

Page 168: MC0085

Advanced Operating Systems (Distributed Systems) Unit 8

Sikkim Manipal University Page No. 168

Unit 8 Distributed File Systems

Structure:

8.1 Introduction

Objectives

8.2 The Key Challenges of Distributed Systems

8.3 Client’s Perspective: File Services

8.4 File Access Semantics

8.5 Server’s Perspective Implementation

8.6 Stateful Versus Stateless Servers

8.7 Replication

8.8 Caching

8.9 Ceph

8.10 Terminal Questions

8.1 Introduction

In a distributed file system (DFS), multiple clients share files provided by a

shared file system. In the DFS paradigm communication between processes

is done using these shared files. Although this is similar to the DSM and

distributed object paradigms (in that communication is abstracted by shared

resources) a major difference between these paradigms and the DFS

paradigm is that the resources (files) in DFS are much longer lived. This

makes it, for example, much easier to provide asynchronous and persistent

communication using shared files than using DSM or distributed objects.

The basic model provided by distributed file systems is that of clients

accessing files and directories that are provided by one or more file servers.

A file server provides a client with a file service interface and a view of the

file system. Note that the view provided to different clients by the same

server may be different, for example, if clients only see files that they are

Page 169: MC0085

Advanced Operating Systems (Distributed Systems) Unit 8

Sikkim Manipal University Page No. 169

authorised to access. Access to files is achieved by clients performing

operations from the file service interface (such as create, delete, read, write,

etc.) on a file server. Depending on the implementation the operations may

be executed by the servers on the actual files, or by the client on local

copies of the file. We will return to this issue later.

Objectives:

This unit aims at teaching the students the key aspects of Distributed File

systems. It deals with the design concepts, client and server perspectives of

the file systems, and so on. It presents various examples of distributed file

systems in use.

8.2 The Key Challenges of Distributed Systems

A good distributed file system should have the features described below:

i) Transparency

Location: a client cannot tell where a file is located

Migration: a file can transparently move to another server

Replication: multiple copies of a file may exist

Concurrency: multiple clients access the same file

ii) Flexibility

In a flexible DFS it must be possible to add or replace file servers.

Also, a DFS should support multiple underlying file system types

(e.g., various Unix file systems, various Windows file systems, etc.)

iii) Reliability

In a good distributed file system, the probability of loss of stored data

should be minimized as far as possible. i.e. users should not feel

compelled to make backup copies of their files because of the

unreliability of the system. Rather, the file system should automatically

generate backup copies of critical files that can be used in the event of

Page 170: MC0085

Advanced Operating Systems (Distributed Systems) Unit 8

Sikkim Manipal University Page No. 170

loss of the original ones. Stable storage is a popular technique used by

several file systems for higher reliability.

iv) Consistency:

Employing replication and allowing concurrent access to files may

introduce consistency problems.

v) Security:

Clients must authenticate themselves and servers must determine

whether clients are authorised to perform requested operation.

Furthermore communication between clients and the file server must

be secured.

vi) Fault tolerance:

Clients should be able to continue working if a file server crashes.

Likewise, data must not be lost and a restarted file server must be able

to recover to a valid state.

vii) Performance:

In order for a DFS to offer good performance it may be necessary to

distribute requests across multiple servers. Multiple servers may also

be required if the amount of data stored by a file system is very large.

viii) Scalability:

A scalable DFS will avoid centralised components such as a

centralised naming service, a centralised locking facility, and a

centralised file store. A scalable DFS must be able to handle an

increasing number of files and users. It must also be able to handle

growth over a geographic area (e.g., clients that are widely spread

over the world), as well as clients from different administrative

domains.

Page 171: MC0085

Advanced Operating Systems (Distributed Systems) Unit 8

Sikkim Manipal University Page No. 171

8.3 Client’s Perspective: File Services

The File Service Interface represents files as an uninterpreted sequence of

bytes that are associated with a set of attributes (owner, size, creation date,

permissions, etc.) including information regarding protection (i.e., access

control lists or capabilities of clients). Moreover, there is a choice between

the upload/download model and the remote access model. In the first

model, files are downloaded from the server to the client. Modifications are

performed directly at the client after which the file is uploaded back to the

server. In the second model all operations are performed at the server itself,

with clients simply sending commands to the server.

There are benefits and drawbacks to both models. The first model, for

example, can avoid generating traffic every time it performs operations on a

file. Also, a client can potentially use a file even if it cannot access the file

server. A drawback of performing operations locally and then sending an

updated file back to the server is that concurrent modification of a file by

different clients can cause problems. The second approach makes it

possible for the file server to order all operations and therefore allow

concurrent modifications to the files. A drawback is that the client can only

use files if it has contact with the file server. If the file server goes down, or

the network connection is broken, then the client loses access to the files.

8.4 File Access Semantics

Ideally, the client would perceive remote files just like local ones.

Unfortunately, the distributed nature of a DFS makes this goal hard to

achieve. In the following discussion, we present the various file access

semantics available, and discuss how appropriate they are to a DFS.

Page 172: MC0085

Advanced Operating Systems (Distributed Systems) Unit 8

Sikkim Manipal University Page No. 172

The first type of access semantics that we consider are called Unix

semantics and they imply the following:

A read after a write returns the value just written.

When two writes follow in quick succession, the second persists.

In the case of a DFS, it is possible to achieve such semantics if there is only

a single file server and no client-side caching is used. In practice, such a

system is unrealistic because caches are needed for performance and write-

through caches (which would make Unix semantics possible to combine

with caching) are expensive. Furthermore deploying only a single file server

is bad for scalability. Because of this it is impossible to achieve Unix

semantics with distributed file systems.

Alternative semantic models that are better suited for a distributed

implementation include:

1. Session semantics,

2. Immutable files, and

3. Atomic transactions.

1. Session Semantics:

In the case of session semantics, changes to an open file are only

locally visible. Only after a file is closed, are changes propagated to the

server (and other clients). This raises the issue of what happens if two

clients modify the same file simultaneously. It is generally up to the

server to resolve conflicts and merge the changes. Another problem with

session semantics is that parent and child processes cannot share file

pointers if they are running on different machines.

2. Immutable Files:

Immutable files cannot be altered after they have been closed. In order

to change a file, instead of overwriting the contents of the existing file a

new file must be created. This file may then replace the old one as a

Page 173: MC0085

Advanced Operating Systems (Distributed Systems) Unit 8

Sikkim Manipal University Page No. 173

whole. This approach to modifying files does require that directories

(unlike files) be updatable. Problems with this approach include a race

condition when two clients try to replace the same file as well as the

question of what to do with processes that are reading a file at the same

time as it is being replaced by another process.

3. Atomic Transactions:

In the transaction model, a sequence of file manipulations can be

executed indivisibly, which implies that two transactions can never

interfere. This is the standard model for databases, but it is expensive to

implement.

8.5 Server’s Perspective: Implementation

Observations about the expected use of a file system can be used to guide

the design of a DFS. For example, a study by Satyanarayanan found the

following usage patterns for Unix systems at a university:

Most files are small – less than 10k

Reading is much more common than writing

Usually access is sequential; random access is rare

Most files have a short lifetime

File sharing is unusual

Most processes use only a few files

Distinct files classes with different properties exist

These usage patterns (small files, sequential access, high read-write ratio)

would suggest that an update/download model for a DFS would be

appropriate. Note, however, that different usage patterns may be observed

at different kinds of institutions. In situations where the files are large, and

are updated more often it may make more sense to use a DFS that

implements a remote access model.

Page 174: MC0085

Advanced Operating Systems (Distributed Systems) Unit 8

Sikkim Manipal University Page No. 174

Besides the usage characteristics, implementation tradeoffs may depend on

the requirements of a DFS. These include supporting a large file system,

supporting many users, the need for high performance, and the need for

fault tolerance. Thus, for example, a fault tolerant DFS may sacrifice some

performance for better reliability guarantees, while a high performance DFS

may sacrifice security and wide-area scalability in order to achieve extra

performance.

8.6 Stateful Vs Stateless Servers

The file servers that implement a distributed file service can be stateless or

stateful. Stateless file servers do not store any session state. This means

that every client request is treated independently, and not as part of a new

or existing session. Stateful servers, on the other hand, do store session

state. They may, therefore, keep track of which clients have opened which

files, current read and write pointers for files, which files have been locked

by which clients, etc.

The main advantage of stateless servers is that they can easily recover from

failure. Because there is no state that must be restored, a failed server can

simply restart after a crash and immediately provide services to clients as

though nothing happened. Furthermore, if clients crash the server is not

stuck with abandoned opened or locked files. Another benefit is that the

server implementation remains simple because it does not have to

implement the state accounting associated with opening, closing, and

locking of files.

The main advantage of stateful servers, on the other hand, is that they can

provide better performance for clients. Because clients do not have to

provide full file information every time they perform an operation, the size of

messages to and from the server can be significantly decreased. Likewise

Page 175: MC0085

Advanced Operating Systems (Distributed Systems) Unit 8

Sikkim Manipal University Page No. 175

the server can make use of knowledge of access patterns to perform read-

ahead and do other optimisations. Stateful servers can also offer clients

extra services such as file locking, and remember read and write positions.

8.7 Replication

The main approach to improving the performance and fault tolerance of a

DFS is to replicate its content. A replicating DFS maintains multiple copies

of files on different servers. This can prevent data loss, protect a system

against down time of a single server, and distribute the overall workload.

There are three approaches to replication in a DFS:

1. Explicit replication: The client explicitly writes files to multiple servers.

This approach requires explicit support from the client and does not

provide transparency.

2. Lazy file replication: The server automatically copies files to other

servers after the files are written. Remote files are only brought up to

date when the files are sent to the server. How often this happens is up

to the implementation and affects the consistency of the file state.

3. Group file replication: write requests are simultaneously sent to a

group of servers. This keeps all the replicas up to date, and allows

clients to read consistent file state from any replica.

8.8 Caching

Besides replication, caching is often used to improve the performance of a

DFS. In a DFS, caching involves storing either a whole file, or the results of

file service operations. Caching can be performed at two locations: at the

server and at the client. Server-side caching makes use of file caching

provided by the host operating system. This is transparent to the server and

helps to improve the server’s performance by reducing costly disk accesses.

Page 176: MC0085

Advanced Operating Systems (Distributed Systems) Unit 8

Sikkim Manipal University Page No. 176

Client-side caching comes in two flavours: on-disk caching, and in-memory

caching. On-disk caching involves the creation of (temporary) files on the

client’s disk. These can either be complete files (as in the upload/download

model) or they can contain partial file state, attributes, etc. In-memory

caching stores the results of requests in the client-machine’s memory. This

can be process-local (in the client process), in the kernel, or in a separate

dedicated caching process.

The issue of cache consistency in DFS has obvious parallels to the

consistency issue in shared memory systems, but there are other tradeoffs

(for example, disk access delays come into play, the granularity of sharing is

different, sizes are different, etc.). Furthermore, because write-through

caches are too expensive to be useful, the consistency of caches will be

weakened. This makes implementing Unix semantics impossible.

Approaches used in DFS caches include, delayed writes where writes are

not propagated to the server immediately, but in the background later on,

and write-on-close where the server receives updates only after the file is

closed. Adding a delay to write-on-close has the benefit of avoiding

superfluous writes if a file is deleted shortly after it has been closed.

1. Example: Network File System (NFS)

NFS is a remote access DFS that was introduced by Sun in 1985. The

currently used version is version 3, however a new version (4) has also

been defined. NFS integrates well into Unix’s model of mount points, but

does not implement Unix semantics. NFS servers are stateless (i.e., NFS

does not provide open & close operations). It supports caching, but no

replication. NFS has been ported to many platforms and, because the NFS

protocol is independent of the underlying file system, supports many

different underlying file systems. On Unix, an NFS server runs as a daemon

and reads the file /etc/export to determine what directories are exported to

whom under which policy (for example, who is allowed to mount them, who

Page 177: MC0085

Advanced Operating Systems (Distributed Systems) Unit 8

Sikkim Manipal University Page No. 177

is allowed to access them, etc.). Server-side caching makes use of file

caching as provided by the underlying operating system and is, therefore,

transparent.

On the client side, exported file systems can be explicitly mounted or

mounted on demand (called automounting). NFS can be used on diskless

workstations so does not require local disk space for caching files. It does,

however, support client-side caching, and allows both file contents as well

as file attributes to be cached. Although NFS allows caching, it leaves the

specifics up to the implementation. As such, file caching details are

implementation specific. Cache entries are generally discarded after a fixed

period of time and a form of delayed write-through is employed.

Traditionally, NFS trusts clients and servers and thus has only minimal

security mechanisms in place. Typically, the clients simply pass Unixuser ID

and group ID of the process performing a request to the server. This implies

that NFS users must not have root access on the clients, otherwise they

could simply switch their identity to that of another user and then access that

user’s files. New security mechanisms have been put in place, but they also

have their drawbacks:

Secure NFS using Diffie-Hellman public key cryptography is more complex

to implement and to manage the keys, and the key lengths used are too

short to provide security in practice. Using Kerberos would turn NFS more

secure, but it has high entry costs.

Example: Andrew File System (AFS)

The Andrew File System (AFS) is a DFS that came out of the Andrew

research project at Carnegie Mellon University (CMU). Its goal was to

develop a DFS that would scale to all computers on the university’s campus.

It was further developed into a commercial product and an open source

branch was later released under the name “OpenAFS”. AFS offers the same

Page 178: MC0085

Advanced Operating Systems (Distributed Systems) Unit 8

Sikkim Manipal University Page No. 178

API as Unix, implements Unix semantics for processes on the same

machine, but implements write-on-close semantics globally. All data in AFS

is mounted in the /afs directory and organised in cells (e.g. /afs/cs.cmu.edu).

Cells are administrative units that manage users and servers.

Files and directories are stored on a collection of trusted servers called Vice.

Client processes accessing AFS are redirected by the file system layer to a

local user-level process called Venice (the AFS daemon), which then

connects to the servers. The servers serve whole files, which are cached as

a whole on the clients’ local disks. For cached files a callback is installed on

the corresponding server. After a process finishes modifying a file by closing

it, the changes are written back to the server. The server then uses the

callbacks to invalidate the file in other clients’ caches. As a result, clients do

not have to validate cached files on access (except after a reboot) and

hence there is only very little cache validation traffic. Data is stored on

flexible volumes, which can be resized and moved between the servers of a

cell. Volumes can be marked as read only, e.g. for software installations.

AFS does not trust Unix user IDs and instead uses its own IDs which are

managed at a cell level. Users have to authenticate with Kerberos by using

the klog command. On successful authentication, a token will be installed in

the client’s cache managers. When a process tries to access a file, the

cache manager checks if there is a valid token and enforces the access

rights. Tokens have a time stamp and expire, so users have to renew their

token from time to time. Authorisation is implemented by directory-based

ACLs, which allow finer grained access rights than Unix.

2. Example: Coda

Coda is an experimental DFS developed at CMU by M. Satyanarayanan’s

group, it is the successor of the Andrew File System (AFS) but supports

disconnected, mobile operation of clients. Its design is much more ambitious

than that of NFS.

Page 179: MC0085

Advanced Operating Systems (Distributed Systems) Unit 8

Sikkim Manipal University Page No. 179

Coda has quite a number of similarities with AFS. On the client side, there is

only a single mount point /coda. This means that the name space appears

the same to all clients (and files therefore have the same name at all

clients). File names are location transparent (servers cannot be

distinguished). Coda provides client-side caching of whole files. The caching

is implemented in a user-level cache process called Venus. Coda provides

Unix semantics for files shared by processes on one machine, but applies

write-on-close (session) semantics globally. Because high availability is one

of Coda’s goals access to a cached copy of a file is only denied if it is known

to be inconsistent.

In contrast to AFS, Coda supports disconnected operation, which works as

follows. While disconnected (a client is disconnected with regards to a file if

it cannot contact any servers that serve copies of that file) all updates are

logged in a client modification log (CML). Upon reconnection, the operations

registered in the CML are replayed on the server. In order to allow clients to

work in disconnected mode, Coda tries to make sure that a client always

has up-to-date cached copies of files that they might require. This process is

called file hoarding. The system builds a user hoard database which it uses

to update frequently used files using a process called a hoard walk.

Conflicts upon reconnection are resolved automatically where possible,

otherwise, manual intervention becomes necessary.

Files in Coda are organised in organisational units called volumes. A volume

is a small logical unit of files (e.g., the home directory of a user or the source

tree of a program). Volumes can be mounted anywhere below the /coda

mount point (in particular, within other volumes). Coda allows files to be

replicated on read/write servers. Replication is organised on a per volume

basis, that is, the unit of replication is the volume. Updates are sent to all

replicas simultaneously using multicast RPCs (Coda defines its own RPC

Page 180: MC0085

Advanced Operating Systems (Distributed Systems) Unit 8

Sikkim Manipal University Page No. 180

protocol that includes a multicast RPC protocol). read operations can be

performed at any replica.

3. Example: Google File System

The Google File System (GFS) is a distributed file system developed to

support a system with very different requirements than traditionally assumed

when developing file systems. GFS was designed and built to support

operations (both production and research) at Google that typically involve

large amounts of data, run distributed over very large clusters, and include

much concurrent access to files. GFS assumes that most data operations

are large sequential reads and large concurrent appends. One of the key

assumptions driving the design is that, because very large clusters (built

from commodity parts) are used, failure (of hardware or software resulting in

crashes or corrupt data) is a regular occurrence rather than an anomaly.

8.9 Ceph

Ceph is a scalable, high-performance research DFS. It targets systems with

huge amounts of data (“petascale systems”) and, like GFS, assumes that

node failures are the norm, not an exception. It assumes that such systems

are built incrementally, that they are inherently dynamic and that workloads

shift over the lifetime of the system. If has three key design features. First,

Ceph decouples data and metadata by using a mapping function that maps

from a file’s unique ID to intelligent object storage devices (OSDs) which

store the file’s data, thus eliminating the need to store explicit allocation lists.

Secondly, Ceph adaptively and intelligently distributes responsibility of

metadata to a cluster of metadata servers. It can thus adapt to changing

workloads which require access to different parts of the metadata and

prevents hot spots from becoming potential bottlenecks. Thirdly, Ceph uses

intelligent OSDs to reliably and autonomically store data. A cluster of OSDs

collectively manages data migration, replication, failure detection and failure

recovery.

Page 181: MC0085

Advanced Operating Systems (Distributed Systems) Unit 8

Sikkim Manipal University Page No. 181

8.10 Terminal Questions

1. In what aspects is the design of a distributed file system different from

that of a centralized file system?

2. Name the main components of a distributed file system. What might be

the reasons for separating the various functions of a distributed file

system into these components.

3. Discuss the clients and servers perspective of a distributed file system.

4. Discuss any two example network file systems in use.

Page 182: MC0085

Advanced Operating Systems (Distributed Systems) Unit 9

Sikkim Manipal University Page No. 182

Unit 9 Naming

Structure:

9.1 Introduction

Objectives

9.2 Desirable Features of a Good Naming system

9.3 Fundamental Terminologies and Concepts

9.4 System Oriented Names

9.5 Object – Locating Mechanisms

9.6 Human – Oriented Names

9.7 Name Caches

9.8 Naming and Security

9.9 Terminal Questions

9.1 Introduction

In this unit, we first concentrate on different kinds of names, and how names

are organized into name spaces. We then continue with a discussion of the

important issue of how to resolve a name such that the entity it refers to can

be accessed. Also, we explain various options for distributing and

implementing large name spaces across multiple machines. The Internet

Domain Name System and OSI’s X.500 will be discussed as examples of

large-scale naming services.

Names, Identifiers, and Addresses

A name in a distributed system is a string of bits or characters that is used to

refer to an entity. An entity in a distributed system can be practically

anything. Typical examples include resources such as hosts, printers, disks,

and files. Other well-known examples of entities that are often explicitly

named are processes, users, mailboxes, newsgroups, Web pages,

graphical windows, messages, network connections, and so on.

Page 183: MC0085

Advanced Operating Systems (Distributed Systems) Unit 9

Sikkim Manipal University Page No. 183

Entities can be operated on. For example, a resource such as a printer

offers an interface containing operations for printing a document, requesting

the status of a print job, and the like. Furthermore, an entity such as a

network connection may provide operations for sending and receiving data,

setting quality-of-service parameters, requesting the status, and so forth.

To operate on an entity, it is necessary to access it, for which we need an

access point. An access point is yet another, but special, kind of entity in a

distributed system. The name of an access point is called an address. The

address of an access point of an entity is also simply called an address of

that entity.

An entity can offer more than one access point. As a comparison, a

telephone can be viewed as an access point of a person, whereas the

telephone number corresponds to an address. Indeed, many people

nowadays have several telephone numbers, each number corresponding to

a point where they can be reached. In a distributed system, a typical

example of an access point is a host running a specific server, with its

address formed by the combination of, for example, an IP address and port

number (i.e., the server’s transport-level address).

An entity may change its access points in the course of time. For example,

when a mobile computer moves to another location, it is often assigned a

different IP address than the one it had before. Likewise, when a person

moves to another city or country, it is often necessary to change telephone

numbers as well. In a similar fashion, changing jobs or Internet Service

Provider, means changing your e-mail address.

An address is thus just a special kind of name: it refers to an access point of

an entity. Because an access point is tightly associated with an entity, it

would seem convenient to use the address of an access point as a regular

name for the associated entity. Nevertheless, this is hardly ever done.

Page 184: MC0085

Advanced Operating Systems (Distributed Systems) Unit 9

Sikkim Manipal University Page No. 184

Objectives:

This unit discusses the naming structures used in addressing the individual

systems located within a network. It also describes the features useful for

designing a distributed system. It also addresses the various issues

concerned with human naming mechanisms, object locating mechanisms,

and security aspects.

9.2 Desirable Features of a Good Naming System

A good naming system for a distributed system should have the following

features:

i) Location transparency

The name of an object should not reveal any hint about the physical

location of the object

ii) Location independency

Name of an object should not be required to be changed when the

object’s location changes. Thus

A location independent naming system must support a dynamic

mapping scheme

An object at any node can be accessed without the knowledge of its

physical location

An object at any node can issue an access request without the

knowledge of its own physical location

iii) Scalability

Naming system should be able to handle the dynamically changing

scale of a distributed system

iv) Uniform naming convention

Should use the same naming conventions for all types of objects in the

system

Page 185: MC0085

Advanced Operating Systems (Distributed Systems) Unit 9

Sikkim Manipal University Page No. 185

v) Multiple user-defined names for the same object

Naming system should provide the flexibility to assign multiple user-

defined names for the same object.

vi) Grouping name

Naming system should allow many different objects to be identified by

the same name.

vii) Meaningful names

A naming system should support at least two levels of subject identifiers,

one convenient for human users and the other convenient for machines.

9.3 Fundamental Terminologies and Concepts

i) Name Server

Name servers manage the name spaces. A name server binds an object to

its location. Partitioned name spaces are easier to manage when compared

to flat name space, because each server needs to maintain information for

only one domain.

ii) Name agent

Name agents are known by various names.

e.g. In Internet domain name service (DNS) they are called “resolvers”, in

DCE directory service they are called “clerks”, A Name agent.

Acts between name servers and their clients

Maintains knowledge of existing name servers

Transfers user requests to proper name servers

iii) Context

A context is the environment in which a name is valid. Often contexts

represent a division of name space along regional, organizational or

functional boundaries. Contexts can be nested in an hierarchical name

space.

Page 186: MC0085

Advanced Operating Systems (Distributed Systems) Unit 9

Sikkim Manipal University Page No. 186

iv) Name resolution

Process of mapping an object’s name to its properties such as location. It is

basically the process of mapping an object’s name to the authoritative name

servers of that object. In partitioned name space, the name resolution

mechanism traverses a resolution chain from one context to another until

the authoritative name servers of the named object are encountered.

v) Abbreviation/Alias

Users can define their own abbreviation for qualified names. Abbreviations

defined by a user form a private context for that user.

vi) Absolute and relative names

In a tree structured name space, the full qualified name of an object need

not be specified within current working context. e.g., Unix directory structure,

Internet domain names, etc.

vii) Generic and Multicast names

In generic naming facility, a name is mapped to any one of the set of objects

to which it is bound. In group or multicast naming facility, a name is mapped

to all members of the set of objects to which it is bound.

9.4 System Oriented Names

System oriented names normally have the following characteristic features:

i) Characteristics of System-oriented names

They are large integers or bit strings.

These are also called unique identifiers because they are unique in time

and space.

System oriented names are of the same size

Generally shorter than human-oriented names and are easy for

manipulations like hashing, sorting and so on.

Page 187: MC0085

Advanced Operating Systems (Distributed Systems) Unit 9

Sikkim Manipal University Page No. 187

ii) Approaches for Generating System-Oriented names

1. Centralized approach: In this approach, a global identifier is generated

for each object by a centralized generator. The central node is the

bottleneck.

2. Distributed approach: In this approach, hierarchical concatenation is

used for creating global unique identifiers. Each identification domain is

identified by a unique identifier. Global identifier is obtained by

concatenating the identifier of domain with an identifier used within the

domain.

3. Generating Unique Identifiers in the event of crashes: A crash may

lead to loss of state information and hence may result in the generation

of non-unique identifiers. Two basic approaches to handle this problem:

– Using a clock that operates across failures: A clock is used at the

location of the unique identifier generator. The clock is guaranteed to

operate across failures.

– Using two or more levels of storage: In this approach, two or more

levels of storage are used and the unique identifiers are structured in

a hierarchical fashion with one field for each level.

9.5 Object-Locating Mechanisms

Object locating is mapping the system oriented names of objects to the

location of the object. Some object locating mechanisms are listed below:

i) Broadcasting

Object’s location is found by broadcasting a request from the client node.

Expanding ring broadcast: This approach is employed in an internetwork

consisting of LANs connected by gateways. A ring is a set of LANs that are

a certain distance (measured in terms of the number of gateways) away

from a processor. First a broadcast message is sent to the set of processors

at distance 0; if the object is not located, then the search goes to processors

Page 188: MC0085

Advanced Operating Systems (Distributed Systems) Unit 9

Sikkim Manipal University Page No. 188

at distance 1 and so on until a copy of the object is found. Cost of locating

an object is proportional to the distance of the object from the client.

ii) Encoding Location of objects within UID (unique identifier)

One field of UID identifies the location of the object. It is easy for the client to

locate the object. Disadvantages of this approach are:

– An object is fixed to one node throughout its life time.

– Limited to distributed systems that do not support object migration

– Object naming is not location transparent

iii) Searching creator node first and then broadcasting

This approach is an extension of the above approach and based on the

assumption that objects do not migrate often. The UID contains the identifier

of the node on which the object was created. To locate an object, first a

request is sent to the node that created the object. If the object has

migrated, then a search is done using broadcast.

Using forward location pointers

This is an extension of the above scheme and avoids broadcast. Whenever

an object migrates to another node, a forward location pointer is left at the

node. To locate an object, the creator is contacted first, and the location

pointer is followed, if necessary, until the object is found.

Some disadvantages of this approach are:

The object-locating cost is directly proportional to the length of the chain

of pointers

It is difficult if an intermediate pointer is lost due to node failure

Using hint cache and broadcasting

In this method, each node contains a hint on the current location of a

number of recently referenced objects in the form of (UID, last known

location) pairs. Object request is sent to the node indicated by the hint. If the

object is found to have migrated, then a broadcast message is sent

Page 189: MC0085

Advanced Operating Systems (Distributed Systems) Unit 9

Sikkim Manipal University Page No. 189

throughout the network requesting for current object location. This approach

is very efficient if high degree of locality is exhibited in locating objects from

a node. It is flexible since it can support object migration. The method of on-

use update of cache information avoids the expense and delay of having to

notify other nodes when an object migrates. If the hint has incorrect

information, broadcast will cause lot of overhead. This approach is widely

used approach in modern distributed OSs such as Amoeba, V-system,

Mach, etc.

9.6 Human-Oriented Names

System oriented names such as 31A5, 2B5F, etc. though useful for machine

handling, are not suitable for use by users. Users will have a tough time if

they are required to remember these names or type them in. Further, each

object has only a single system-oriented name, and therefore all the users

sharing an object must remember and use its only name. To overcome

these limitations, almost all naming systems provide the facility to the users

to define and use their own suitable names for the various objects in the

system. These user-defined object names, which form a name space on the

top of the name space for system-oriented names, are called Human-

Oriented Names.

i) Characteristics of human-oriented names

Character strings that are meaningful to the users

They are defined by the users

Different users can define their own suitable names for a shared object

They are variable in length and different names could be used for the

same object

Same name can be used by different users to refer to different objects.

So, human-oriented names are not unique in space or time

Page 190: MC0085

Advanced Operating Systems (Distributed Systems) Unit 9

Sikkim Manipal University Page No. 190

ii) Human-Oriented Hierarchical Naming Schemes

Basically there are four approaches for assigning system wide unique

human oriented names to the various objects in a distributed system. They

are described below:

1. Combining an object’s local name with its host name:

In this approach, the naming scheme uses a name space that is

comprised of several isolated name spaces. Each isolated name space

corresponds to a node in the distributed system, and a name in this

name space uniquely identifies an object in the node. In the global

system, objects are named by some combination of their hostname and

local name such as host-name:local-name. Disadvantage of this

approach is it is neither location transparent nor location independent.

2. Interlinking isolated name spaces into a single name space

In this scheme, the Global name space consists of several isolated

name spaces, the isolated name spaces are joined together to form a

single naming structure. The position of the component name spaces in

the naming hierarchy is arbitrary. A component name space can be

placed below any other component name space either directly or

through some other component name space. There is no notion of

absolute path name. Each path name is relative to some context, either

to current working context or current component name space. An

advantage of this scheme is it is simple to join existing name spaces into

a single global name space.

iii) Sharing remote name spaces on explicit request: Used by Sun NFS

This scheme is based on the idea of attaching isolated name spaces of

various nodes to create a new name space. Unlike the above schemes

users are given the flexibility to attach a context of the remote name space

to one of the contexts of their local name space. So, the global view of the

resulting name structure is a forest of trees, not a single tree. In NFS, the

Page 191: MC0085

Advanced Operating Systems (Distributed Systems) Unit 9

Sikkim Manipal University Page No. 191

mount protocol is used to attach a remote name space to a local directory. A

client can mount the directory using one of the following ways:

Manual mounting: Client uses the mount and umount command to to

mount and unmount remote server’s directories to client’s name space.

Static mounting: Allows clients to mount the directories automatically

without manual intervention. This is done by running a shell script at the

time the client machine is booted.

Automounting: Allows the servers’ directories to be mounted and

unmounted on a need basis.

iv) A single global name space

In this approach a single name space spans across all nodes in the system.

The same name space is visible to all users and an object’s absolute name

is always the same irrespective of the location of the object and the user

accessing it. This approach is used in many modern distributed operating

systems such as Sprite and V-System.

v) Issues involved in using a single global name space

Partitioning name space into contexts:

Storing complete naming information at one node or replicating it at every

node is not desirable. So, naming information should be kept decentralized

and replicated How to decompose and distribute the naming information

database among different servers:

The notion of context is used for partitioning name space into smaller

components

Partitioning into contexts is done by using clustering conditions

Three basic clustering methods used are:

Algorithmic clustering

Syntactic clustering

Attribute clustering

Page 192: MC0085

Advanced Operating Systems (Distributed Systems) Unit 9

Sikkim Manipal University Page No. 192

vi) Issues in context binding

When the name server is presented with the name to be resolved

The server looks at the authoritative name servers for the named

object.

If the authority attribute does not contain the name server

corresponding to the given name, additional configuration data,

called context bindings is used for finding the authoritative name

servers.

A context binding associates the context within which it is stored to

another context that is more knowledgeable about the named object.

Two strategies commonly used for context binding in naming systems.

vii) Table-based strategy

Most commonly used approach in tree-structured name spaces

Each context is a table having two fields: the first field stores a

component name and the second field stores either the context binding

information or the authority attribute information.

viii) Procedure-Based strategy

In this method a context binding takes the form of a procedure, which, when

executed, supplies information about the next context to be consulted for the

named object.

ix) Distribution of context and name resolution mechanisms

Centralized approach

A single name server in the entire distributed system is located at a central

node

The location of the central server is known to all other nodes

The name server resolves a name by traversing the complete resolution

chain of contexts locally and finally returns the attributes of the named

object.

Page 193: MC0085

Advanced Operating Systems (Distributed Systems) Unit 9

Sikkim Manipal University Page No. 193

Fully replicated approach: Each node has a name server:

Distribution based on physical structure of name space: A commonly

used approach for hierarchical tree-structured name spaces

Name space tree is divided into several subtrees, called zones, or

domains.

There are several name servers in the distributed system.

Each name server provides storage for one or more of these zones.

So name resolution involves sending the name resolution request to the

appropriate server.

To facilitate the mapping of names to servers, each client maintains

name prefix table that is built and updated dynamically.

This approach is used in Sprite file systems.

Advantages and disadvantages of this approach:

Number of prefix table entries will be small.

As opposed to global directory look up, in which all directories starting

from the root to the last component need to be searched one by one, the

prefix table helps in bypassing part of the directory lookup mechanism.

Bypassing upper level directories can have consequences of the

system’s security mechanisms.

Consistency of the prefix table is checked and updated if necessary only

when it is used, and there is no need to inform all clients when a table

entry they are storing becomes invalid.

Page 194: MC0085

Advanced Operating Systems (Distributed Systems) Unit 9

Sikkim Manipal University Page No. 194

9.7 Name Caches

Caching can help increase the performance of name resolution operations

for the following reasons:

i) High degree of locality of name lookup: Due to locality of reference,

a reasonable size cache, used for caching the recently used naming

information can increase performance.

ii) Slow update of name information database: Cost of maintaining

consistency of cached data is very low because naming data does not

change fast. i.e., the read/write ratio of naming data is very high.

iii) On-use consistency of cached information is possible

Name cache consistency can be maintained by detecting and

discarding stale cache entries on use.

Issues related to Name Caches:

Types of name caches

Directory cache: All recently used directory pages that are brought to the

client node during name resolution are cached for a while.

Advantages and disadvantages of this approach

When a directory is accessed it is likely that the contents of the directory

pages are used for operations such as (ls,../, etc.).

For getting one useful entry, namely the directory entry, an entire page

of directory blocks large area of cache.

Prefix cache: Used in Zone-based context distribution mechanisms that we

saw earlier.

Full-name cache: In this type of cache, each entry consists of an object’s

full path name and the identifier and location of its authoritative name

server.

Page 195: MC0085

Advanced Operating Systems (Distributed Systems) Unit 9

Sikkim Manipal University Page No. 195

Approaches for name cache implementation:

A cache per process: A separate cache is maintained for each process.

Advantages and disadvantages:

Since each cache is maintained in the process’s address space,

accessing is fast.

Every new process must create its own name cache from scratch.

Cache hit ratio will be small due to start-up misses. To minimize startup

misses, a process can inherit the name cache from its parent (V-system

uses this approach).

Possibility of naming information being duplicated unnecessarily at a

node.

A cache per node: All processes at a node share the same cache. Some of

the problems related to the above approach are overcome. However, cache

needs to be in the OS area and hence access could be slow.

Approaches for maintaining consistency of name caches:

1. Immediate invalidate: In this method, all related name cache entries

are immediately invalidated. This can be done in one of the following

ways.

Whenever a naming data update is done, an invalidate message

identifying the data to be invalidated is sent to all nodes so each

node can update its cache. This approach is expensive in large

systems.

Invalidation message is sent to only the nodes that have cached the

data.

2. On-Use update: When a client uses a stale cached data, it is informed

by the naming system that the data is stale so that the client can get the

updated data.

Page 196: MC0085

Advanced Operating Systems (Distributed Systems) Unit 9

Sikkim Manipal University Page No. 196

9.8 Naming and Security

An important job of the naming system of several centralized and distributed

operating systems is to control unauthorized access to both the named

objects and the information in the naming database. This section describes

only those security issues that are pertinent to object naming. Three basic

naming-related access control mechanisms are described below:

i) Object Names as Protection Keys

In this method, an object’s name acts as protection key for the object. A

user who knows the name of an object (i.e. has the key for the object) can

access the object by using its name. An object may have several keys in

those systems that allow an object to have multiple names. In this case, any

of the keys can be used to access the object.

In systems using this method, users are not allowed by the system to define

a name for an object that they are not authorized to access. This scheme is

based on the assumption that object names cannot be forged or stolen. The

following are the limitations of this scheme:

The scheme does not guarantee a reliable access control mechanism.

It does not provide the flexibility of specifying the modes of access

control.

ii) Capabilities

This is a simple extension of the above scheme that overcome its

limitations. As shown below, a capability is a special type of object identifier

that contains additional information redundancy for protection.

Figure 9.1: The two basic parts of a capability

Object Identifier Rights Information

Page 197: MC0085

Advanced Operating Systems (Distributed Systems) Unit 9

Sikkim Manipal University Page No. 197

It may be considered as an unforgettable ticket that allows its holders to

access the object (identified by its object identifier) in one or more

permission modes (specified by its access control information part).

When a process wants to perform an operation on an object, it must send to

the name server a message containing the object’s capability. The name

server verifies if the capability provided by the client allows the type of

operation requested by the client on the relevant object. If not a permission

denied message is returned to the client process. If allowed, the client’s

request is forwarded to the manager of the object.

iii) Associating Protection with Name Resolution Path

Protection can be associated with an object or with the name resolution path

of the name used to identify the object. The more common scheme provides

protection on the name resolution path.

Systems using this approach usually employ access control list (ACL) based

protection, which controls access dependent on the identity of the user. The

mechanism based on ACL requires, in addition to the object identifier,

another trusted identifier representing the accessing principal, the entity with

which access rights are associated. This trusted identifier might be a

password, address, or any other identifier form that cannot be forged or

stolen. An ACL is associated with an object and specifies the user name

(user identifier) and the types of access allowed for each user of that object.

When a user requests access to an object, the operating system checks the

ACL associated with that object. If the user is listed for the requested

access, the access is allowed. Otherwise, a protection violation occurs and

the user job is denied access to the object.

By associating an ACL with each context (directory) of the name space,

access can be controlled to both named objects and the information in the

naming database. When a name server receives an access request for a

Page 198: MC0085

Advanced Operating Systems (Distributed Systems) Unit 9

Sikkim Manipal University Page No. 198

directory, it first verifies if the accessing process is authorized for the

requested access. With this approach, name servers do not provide

information to clients that are not authorized to have it, and at the same time

name servers do not accept unauthorized updates to naming information

stored in the context of name space.

9.9 Terminal Questions

1. List the main jobs performed by the naming subsystem of distributed

operating system.

2. Differentiate between the terms location transparency and location

independency. Which is more powerful and why?

3. Differentiate between human-oriented and system-oriented names used

in the operating system.

4. Discuss the Naming and Security mechanisms in a distributed operating

system.

Page 199: MC0085

Advanced Operating Systems (Distributed Systems) Unit 10

Sikkim Manipal University Page No. 199

Unit 10 Security in Distributed Systems

Structure:

10.1 Introduction

Objectives

10.2 Potential attacks to Computer Systems

10.3 Cryptography

10.4 Authentication

10.5 Access Control

10.6 Digital Signatures

10.7 Design Principles

10.8 Terminal Questions

10.1 Introduction

Before we embark on our journey of understanding the various concepts

and technical issues related to security, it is essential to know what we are

trying to protect. What are the various dangers when we use computers,

computer networks, and the biggest network of them all, the Internet? What

can happen if we do not set up the right security policies, framework and

technology implementations?

Why is security required in the first place? People sometimes say that

security is like statistics: What it reveals is trivial, what it conceals is vital!

The right security infrastructure opens up just enough doors that are

mandatory.

We discuss the principles of security that help us identify various areas,

which are crucial while determining the security threats and possible

solutions to tackle them. Since electronic documents and messages are

now becoming equivalent to the paper documents in terms of their legal

validity and binding, we examine the various implications in this regard.

Page 200: MC0085

Advanced Operating Systems (Distributed Systems) Unit 10

Sikkim Manipal University Page No. 200

This would be followed by a discussion of the types of attacks. There are

certain theoretical concepts associated with attacks, and there is a practical

side to it as well.

With the introduction of the computer, the need for automated tools for

protecting files and other information stored on the computer became

evident. This is especially the case for a shared system, such as a time-

sharing system, and the need is even more acute for systems that can be

accessed over a public telephone or data network. The generic name for the

collection of tools designed to protect data and to thwart hackers is

Computer Security.

The second major change that affected security is the introduction of

distributed systems and the use of networks and communication facilities for

carrying data between terminal user and computer and between computer

and computer. Network security measures are needed to protect data during

their transmission.

One of the most publicized types of attack on information systems is the

computer virus. A virus may be introduced into a system physically when it

arrives on a diskette and is subsequently loaded onto a computer. Viruses

may also arrive over an Internet. In either case, once the virus is resident on

a computer system, internal computer security tools are needed to detect

and recover from the virus.

This unit focuses on Internet security that consists of measures to deter,

prevent, detect, and correct security violations that involve the transmission

of information. This is a broad statement that covers the host of possibilities.

Security involving communications and networks is not as simple as it

appears for a layman to understand and implement. Most of the major

requirements for security services include:

Confidentiality

Page 201: MC0085

Advanced Operating Systems (Distributed Systems) Unit 10

Sikkim Manipal University Page No. 201

Authentication

Non-Repudiation

Integrity

In developing a particular security mechanism or algorithm, one must

always consider potential counter measures. In most of the cases, counter

measures are designed by looking at the problem in a completely different

way, therefore exploiting an unexpected weakness in the mechanism.

Security mechanisms involve more than a particular algorithm or protocol.

They usually also require that participants be in possession of some secret

information (like an encryption key), which raises questions about creation,

distribution, and protection of that secret information.

A Model for Network Security

A message is to be transferred from one party to another across some sort

of Internet. The two parties, who are the principals in this transaction, must

cooperate for the exchange to take place. A logical information channel is

established by defining a route through the Internet from source to

destination and by the cooperative use of communication protocols

(like TCP / IP, HTTP) by the two principals.

Security aspects come into the play when it is necessary or desirable to

protect the information transmission from an opponent which may present a

threat to the confidentiality, authenticity, and so on. All the techniques for

providing security have two components:

1. A security related transformation on the information to be sent.

Examples include the encryption of message, which scrambles the

message so that it is unreadable by the opponent, and the addition of a

code based on the contents of the message, which can be used to verify

the identity of the sender.

Page 202: MC0085

Advanced Operating Systems (Distributed Systems) Unit 10

Sikkim Manipal University Page No. 202

2. Some secret information shared by the two principals and, it is hoped,

unknown to the opponent. An example is an encryption key used in

conjunction with the transformation to scramble the message before

transmission and unscramble it on reception.

A trusted third party may be needed to achieve secure transmission. As an

example, a third party may be responsible for distributing the secret

information to the two principals while keeping it away from opponent. A

third party may also be necessary to arbitrate disputes between the two

principals concerning the authenticity of a message transmission.

The above stated theory regarding the general model shows that there are

four basic tasks in designing a particular security service:

1. Design an algorithm for performing security related transformations. The

algorithm should be such that an opponent cannot defeat its purpose.

2. Generate the secret information to be used with the algorithm.

3. Develop methods for the distribution and sharing of secret information.

4. Specify a protocol to be used by the two principals that make use of the

security algorithm and the secret information to achieve a particular

security service.

Page 203: MC0085

Advanced Operating Systems (Distributed Systems) Unit 10

Sikkim Manipal University Page No. 203

Figure 10.1: Model for Network Security

Objectives:

This unit makes the user familiar with security issues to be taken up in case

of distributed systems. It discusses the types of possible attacks on nodes in

a distributed system and also the protection mechanisms to counter these

attacks. It diescribes the secured way of transmitting the messages i.e. the

aspects of encoding and decoding data and the underlying principles behind

them. It also describes the Authentication, Access Control mechanisms. It

describes the Digital Signatures, and design principles to be followed in

designing a secured distributed system.

10.2 Potential attacks to Computer Systems

Attacks on the security of a computer system or network are best

characterized by viewing the function of the computer system as providing

information. In general, there is a flow of information from a source, such as

a file or a region of main memory, to a destination, such as another file or

user.

Trusted Third Party

Principal

Message

Principal

Information

Channel

Opponent

Security Related

Transformation

Security Related

Transformation

Message

Secret

Information Secret

Information

Page 204: MC0085

Advanced Operating Systems (Distributed Systems) Unit 10

Sikkim Manipal University Page No. 204

Figure 10.2: Security Threats & General Categories of Attacks

The following points describe the four general categories of attacks:

Interruption: An asset of the system is destroyed or becomes

unavailable or unusable. This is an attack on Availability.

Examples: Destruction of hard disk, cutting of communication lines, and

so on.

Interception: An unauthorized party gains access to an asset. This is an

attack on Confidentiality. The unauthorized person may be a person, a

program, or a computer.

Figure (a): Normal Flow

Information

Source

Information

Destination

Figure (b): Interruption

Figure (c): Interception

Figure (d): Modification Figure (e): Fabrication

Page 205: MC0085

Advanced Operating Systems (Distributed Systems) Unit 10

Sikkim Manipal University Page No. 205

Examples: Wiretapping to capture data in a network, unauthorized

copying of files or programs.

Modification: An unauthorized party not only gains access to but

tampers with an asset. This is an attack on Integrity.

Examples: Changing values in a data file, altering a program so that it

performs differently, modification of contents of messages transmitted on

a network.

Fabrication: An unauthorized party inserts counterfeit objects into the

system. This is an attack on Authenticity.

Examples: Insertion of spurious messages in a network, Addition of

records to a file.

There are two types of possible attacks on a computer system:

1. Passive Attacks, and

2. Active Attacks

Figure 10.3: Possible attacks on a computer system

1. Passive Attacks

In these type of attacks, the attacker indulges in eavesdropping or

monitoring of data transmission, i.e. the attacker aims to obtain information

that is in transit. The term passive indicates that the attacker does not

attempt to perform any modification to the data. This is why passive attacks

Attacks

Passive Attacks Active Attacks

Page 206: MC0085

Advanced Operating Systems (Distributed Systems) Unit 10

Sikkim Manipal University Page No. 206

are harder to detect. Therefore the general approach to deal with passive

attacks is to think about prevention, rather than detection or corrective

actions.

Figure 10.4: Categories of Passive Attacks

Release of Message Contents: When we send a confidential email

message to our friend, we desire that only he / she would be able to access

it. Otherwise, the contents of the message are released against our wishes

to someone else.

Traffic Analysis: If we can encode messages using a coding language, so

that only the desired parties understand the contents of a message,

because only they know the code language. If many such messages are

passing through, a passive attacker could try to figure out the similarities

between them to come up with some sort of pattern that provides her some

clues regarding the communication that is taking place. Such attempts of

analyzing (encoded) messages to come up with likely patterns are the work

of traffic analysis attack.

Passive Attacks (Interception)

Release of

Message Contents

Traffic Analysis

Page 207: MC0085

Advanced Operating Systems (Distributed Systems) Unit 10

Sikkim Manipal University Page No. 207

2. Active Attacks

The Active attacks are based on modification of the original message in

some manner, or on creation of a false message. These attacks cannot be

prevented easily. However, they can be detected with some effort, and

attempts can be made to recover from them. These attacks can be in the

form of interruption, modification, and fabrication.

Figure 10.5: Active Attacks

Masquerade: Caused when an unauthorized entity pretends to be another

entity. A user C might pose as user A and send a message to user B. User

B might be led to believe that the message indeed came from user A.

Replay Attack: A user captures a sequence of events, or some data units,

and resends them. For instance, suppose user A wants to transfer some

amount to user C‟s bank account. Both users A and C have accounts with

bank B. User A might send an electronic message to the bank requesting for

funds transfer. User C could capture the message, and send a second copy

of the same to bank B. Bank B would have no idea that this is an

Active Attacks

Interruption

(Masquerade) Modification

Fabrication (Denial of

Service – DOS)

Replay Attacks Alteration

Page 208: MC0085

Advanced Operating Systems (Distributed Systems) Unit 10

Sikkim Manipal University Page No. 208

unauthorized message, and would treat this as a second, and different,

funds transfer request from user A. Therefore, user C would get the benefit

of the funds transfer twice: once authorized, once through a replay attack.

Alteration of Messages: It involves some change to the original message.

For example, Assume that user A sends an electronic message Transfer

$1000 to D‟s account to bank B. User C might capture this, and change it to

Transfer $10000 to C‟s account. Note that both the beneficiary and the

amount have been changed.

Denial of Service (DOS): These attacks make an attempt to prevent

legitimate users from accessing some services, which they are eligible for.

For instance, an unauthorized user might send too many login requests to a

server using random user id‟s one after the other in quick succession, so as

to flood the network and deny other legitimate users an access to the

network.

10.3 Cryptography

Network security is mostly achieved through the use of Cryptography, a

science based on abstract algebra.

Definition: Cryptography, a word with Greek origins, means “Secret

Writing”. However, we use the term to refer to the science and art of

transforming messages to make them secure and immune to attacks. Figure

below shows the components involved in the cryptography:

Page 209: MC0085

Advanced Operating Systems (Distributed Systems) Unit 10

Sikkim Manipal University Page No. 209

Figure 10.6: Components of Cryptography

The original message before being transformed is called Plaintext. After the

message is transformed, it is called Ciphertext. An Encryption algorithm

transforms the plain text into cipher text; A Decryption algorithm transforms

the cipher text back into plain text. The sender uses an encryption algorithm

and the receiver uses a decryption algorithm.

Cipher: The Encryption and Decryption algorithms are referred to as

Ciphers. It is also used to refer to different categories of algorithms in

cryptography. One cipher can serve millions of communicating pairs.

Key: It is a number (or a set of numbers) that the cipher, as an algorithm,

operates on. To encrypt a message, we need an encryption algorithm, an

encryption key, and the plaintext. These create the ciphertext. To decrypt a

message, we need a decryption algorithm, a decryption key, and the

ciphertext. These reveal the original plaintext.

Alice, Bob, and Eve

In cryptography, it is customary to use three characters in an information

exchange scenario: we use Alice, Bob, and Eve. Alice is the person who

needs to send secure data. Bob is the recipient of data. Eve is the person

SENDER RECEIVER

Encryption Decryption Plaintext Plaintext

Ciphertext

Page 210: MC0085

Advanced Operating Systems (Distributed Systems) Unit 10

Sikkim Manipal University Page No. 210

who somehow disturbs the communication between Alice and Bob by

intercepting messages to uncover the data or by sending her own disguised

messages. These three names represent computers or processes that

actually send or receive data, or intercept or change data.

Cryptographic algorithms can be divided into two groups:

Symmetric (Also called Secret – Key)

Asymmetric (Also called Public – Key)

Symmetric Key Cryptography: In this both the parties use the same key.

The sender uses this key and an encryption algorithm to encrypt data; the

receiver uses the same key and the corresponding decryption algorithm to

decrypt the data.

Figure 10.7: Symmetric – Key Cryptography

Asymmetric Key Cryptography: (or Public Key Cryptography)

In this, there are two keys: a private key and a public key. The private key

is kept by the receiver. The public key is announced to the public. In the

figure shown below, assume that Alice wants to send a message to Bob.

Alice uses the public key to encrypt the message. When the message is

Plaintext

Encryption Decryption

Alice Bob

Plaintext

Ciphertext

Shared Secret Key

Page 211: MC0085

Advanced Operating Systems (Distributed Systems) Unit 10

Sikkim Manipal University Page No. 211

received by Bob, the private key is used to decrypt the message. In this

method the public key used for encryption is different from the private key

used for decryption. The public key is available to the public; the private key

is available only to an individual.

Three types of Keys

There are three types of keys dealt in the context of cryptography:

1. Secret Key: A shared key used in Symmetric Key Cryptography

2. Public Key

3. Private Key

The second and third keys are the public and private keys used in

asymmetric-key cryptography.

Encryption can be thought of as electronic locking; decryption as electronic

unlocking. The sender puts the message in a box and locks the box by

using a key; the receiver unlocks the box with a key and takes out the

message. The difference lies in the mechanism of the locking and unlocking

and the type of keys used.

In symmetric key cryptography, the same key locks and unlocks the box. In

asymmetric key cryptography, one key locks the box, but another key is

needed to unlock it.

Figure 10.8: Symmetric Key Cryptography

Encryption Decryption

Alice Bob

Ciphertext

Plaintext

K1

K1

Page 212: MC0085

Advanced Operating Systems (Distributed Systems) Unit 10

Sikkim Manipal University Page No. 212

Figure 10.9: Asymmetric Key Cryptography

Key management is the set of techniques and procedures supporting the

establishment and maintenance of keying relationships between authorized

parties.

Key management encompasses techniques and procedures supporting:

1. Initialization of system users within a domain,

2. Generation, distribution, and installation of keying material,

3. Controlling the use of keying material,

4. Update, revocation, and destruction of keying material, and

5. Storage, backup/recovery, and archival of keying material.

Point-to-point and centralized key management

Point-to-point communications and centralized key management, using key

distribution centers or key translation centers, are examples of simple key

distribution (communications) models relevant to symmetric-key systems.

Here “simple” implies involving at most one third party. These are illustrated

in Figure 10.10 and described below, where KXY denotes a symmetric key

shared by X and Y.

a) Point-to-Point Key distribution

Encryption Decryption Ciphertext Plaintext

Plaintext

K2

K1

Alice Bob

Page 213: MC0085

Advanced Operating Systems (Distributed Systems) Unit 10

Sikkim Manipal University Page No. 213

b) Key Distribution Center (KDC)

c) Key Translation Center

Figure 10.10: Simple Key Distribution Models (Symmetric Key)

1. Point-to-point mechanisms. These involve two parties communicating

directly.

2. Key Distribution Centers (KDCs): KDCs are used to distribute keys

between users which share distinct keys with the KDC, but not with each

other.

A basic KDC protocol proceeds as follows. Upon request from A to

share a key with B, the KDC T generates or otherwise acquires a key K,

then sends it encrypted under KAT to A, along with a copy of K (for B)

encrypted under KBT. Alternatively, T may communicate K (secured

under KBT) to B directly.

Page 214: MC0085

Advanced Operating Systems (Distributed Systems) Unit 10

Sikkim Manipal University Page No. 214

3. Key Translation Centers (KTCs): The assumptions and objectives of

KTCs are as for KDCs above, but here one of the parties (e.g., A)

supplies the session key rather than the trusted center.

A basic KTC protocol proceeds as follows. A sends a key K to the KTC T

encrypted under KAT. The KTC deciphers and re-enciphers K under KBT,

then returns this to A (to relay to B) or sends it to B directly.

KDCs provide centralized key generation, while KTCs allow distributed key

generation. Both are centralized techniques in that they involve an on-line

trusted server.

Note: (Initial keying requirements) Point-to-point mechanisms require that A

and B share a secret key a priori. Centralized key management involving a

trusted party T requires that A and B each share a secret key with T. These

shared long-term keys are initially established by non-cryptographic, out-of-

band techniques providing confidentiality and authenticity (e.g., in person, or

by trusted courier). By comparison, with public keys confidentiality is not

required; initial distribution of these need only guarantee authenticity.

Techniques for distributing public keys

Protocols involving public-key cryptography are typically described

assuming a priori possession of (authentic) public keys of appropriate

parties. This allows full generality among various options for acquiring such

keys. Alternatives for distributing explicit public keys with guaranteed or

verifiable authenticity, including public exponentials for Diffie-Hellman key

agreement (or more generally, public parameters), include the following:

1. Point-to-point delivery over a trusted channel: Authentic public keys

of other users are obtained directly from the associated user by personal

exchange, or over a direct channel, originating at that user, and which

(procedurally) guarantees integrity and authenticity (e.g., a trusted

courier or registered mail). This method is suitable if used infrequently

Page 215: MC0085

Advanced Operating Systems (Distributed Systems) Unit 10

Sikkim Manipal University Page No. 215

(e.g., one-time user registration), or in small closed systems. A related

method is to exchange public keys and associated information over an

untrusted electronic channel, and provide authentication of this

information by communicating a hash thereof (using a collision-resistant

hash function) via an independent, lower bandwidth authentic channel,

such as a registered mail.

Drawbacks of this method include: inconvenience (elapsed time); the

requirement of non-automated key acquisition prior to secured

communications with each new party (chronological timing); and the cost

of the trusted channel.

2. Direct access to a trusted public file (public-key registry): A public

database, the integrity of which is trusted, may be set up to contain the

name and authentic public key of each system user. This may be

implemented as a public-key registry operated by a trusted party. Users

acquire keys directly from this registry.

While remote access to the registry over unsecured channels is

acceptable against passive adversaries, a secure channel is required for

remote access in the presence of active adversaries. One method of

authenticating a public file is by tree authentication of public keys.

3. Use of an on-line trusted server: An on-line trusted server provides

access to the equivalent of a public file storing authentic public keys,

returning requested (individual) public keys in signed transmissions;

confidentiality is not required. The requesting party possesses a copy of

the server‟s signature verification public key, allowing verification of the

authenticity of such transmissions.

Disadvantages of this approach include: the trusted server must be on-

line; the trusted server may become a bottleneck; and communications

Page 216: MC0085

Advanced Operating Systems (Distributed Systems) Unit 10

Sikkim Manipal University Page No. 216

links must be established with both the intended communicant and the

trusted server.

4. Use of an off-line server and certificates: In a one-time process, each

party A contacts an off-line trusted party referred to as a certification

authority (CA), to register its public key and obtain the CA‟s signature

verification public key (allowing verification of other users‟ certificates).

The CA certifies A‟s public key by binding it to a string identifying A,

thereby creating a certificate. Parties obtain authentic public keys by

exchanging certificates or extracting them from a public directory.

5. Use of systems implicitly guaranteeing authenticity of public

parameters: In such systems, including identity-based systems and

those using implicitly certified keys, by algorithmic design, modification

of public parameters results in detectable, non-compromising failure of

cryptographic techniques.

10.4 Authentication

In most computer security contexts, user authentication is the fundamental

building block and the primary line of defense. User authentication is the

basis for most types of access control and for user accountability.

The process of verifying an identity claimed by or for a system entity. An

authentication process consists of two steps:

Identification step: Presenting an identifier to the security system.

(Identifiers should be assigned carefully, because authenticated

identities are the basis for other security services, such as access

control service.)

Verification step: Presenting or generating authentication information

that corroborates the binding between the entity and the identifier.

Page 217: MC0085

Advanced Operating Systems (Distributed Systems) Unit 10

Sikkim Manipal University Page No. 217

For example, user Alice Toklas could have the user identifier ABTOKLAS.

This information needs to be stored on any server or computer system that

Alice wishes to use and could be known to system administrators and other

users. A typical item of authentication information associated with this user

ID is a password, which is kept secret (known only to Alice and to the

system). If no one is able to obtain or guess Alice‟s password, then the

combination of Alice‟s user ID and password enables administrators to set

up Alice‟s access permissions and audit her activity. Because Alice‟s ID is

not secret, system users can send her e-mail, but because her password is

secret, no one can pretend to be Alice.

In essence, identification is the means by which a user provides a claimed

identity to the system; user authentication is the means of establishing the

validity of the claim. Note that user authentication is distinct from message

authentication.

Message authentication is a procedure that allows communicating parties to

verify that the contents of a received message have not been altered and

that the source is authentic. This unit is concerned solely with user

authentication.

Means of Authentication

There are four general means of authenticating a user‟s identity, which can

be used alone or in combination:

Something the individual knows: Examples includes a password, a

personal identification number (PIN), or answers to a prearranged set of

questions.

Something the individual possesses: Examples include electronic

keycards, smart cards, and physical keys. This type of authenticator is

referred to as a token.

Page 218: MC0085

Advanced Operating Systems (Distributed Systems) Unit 10

Sikkim Manipal University Page No. 218

Something the individual is (static biometrics): Examples include

recognition by fingerprint, retina, and face.

Something the individual does (dynamic biometrics): Examples

include recognition by voice pattern, handwriting characteristics, and

typing rhythm.

All of these methods, properly implemented and used, can provide secure

user authentication. However, each method has problems. An adversary

may be able to guess or steal a password. Similarly, an adversary may be

able to forge or steal a token. A user may forget a password or lose a token.

Further, there is a significant administrative overhead for managing

password and token information on systems and securing such information

on systems. With respect to biometric authenticators, there are a variety of

problems, including dealing with false positives and false negatives, user

acceptance, cost, and convenience.

Password-Based Authentication

A widely used line of defense against intruders is the password system.

Virtually all multi-user systems, network-based servers, Web-based

e-commerce sites, and other similar services require that a user provide not

only a name or identifier (ID) but also a password. The system compares

the password to a previously stored password for that user ID, maintained in

a system password file. The password serves to authenticate the ID of the

individual logging on to the system. In turn, the ID provides security in the

following ways:

The ID determines whether the user is authorized to gain access to a

system. In some systems, only those who already have an ID filed on

the system are allowed to gain access.

The ID determines the privileges accorded to the user. A few users may

have supervisory or “superuser” status that enables them to read files

Page 219: MC0085

Advanced Operating Systems (Distributed Systems) Unit 10

Sikkim Manipal University Page No. 219

and perform functions that are especially protected by the operating

system. Some systems have guest or anonymous accounts, and users

of these accounts have more limited privileges than others.

The ID is used in what is referred to as discretionary access control. For

example, by listing the IDs of the other users, a user may grant

permission to them to read files owned by that user.

The Use of Hashed Passwords: A widely used password security

technique is the use of hashed passwords and a salt value. This scheme is

found on virtually all UNIX variants as well as on a number of other

operating systems. The following procedure is employed Figure 10.11 (a).

To load a new password into the system, the user selects or is assigned a

password. This password is combined with a fixed-length salt value. In

older implementations, this value is related to the time at which the

password is assigned to the user. Newer implementations use a

pseudorandom or random number. The password and salt serve as inputs

to a hashing algorithm to produce a fixed-length hash code. The hash

algorithm is designed to be slow to execute to thwart attacks. The hashed

password is then stored, together with a plaintext copy of the salt, in the

password file for the corresponding user ID. The hashed-password method

has been shown to be secure against a variety of cryptanalytic attacks.

When a user attempts to log on to a UNIX system, the user provides an ID

and a password Figure 10.11 (b). The operating system uses the ID to index

into the password file and retrieve the plaintext salt and the encrypted

password. The salt and user-supplied password are used as input to the

encryption routine. If the result matches the stored value, the password is

accepted.

Page 220: MC0085

Advanced Operating Systems (Distributed Systems) Unit 10

Sikkim Manipal University Page No. 220

a) Loading a new password

b) Verifying a Password

Figure 10.11: Unix Password Scheme

The salt serves three purposes:

It prevents duplicate passwords from being visible in the password file.

Even if two users choose the same password, those passwords will be

assigned different salt values. Hence, the hashed passwords of the two

users will differ.

Page 221: MC0085

Advanced Operating Systems (Distributed Systems) Unit 10

Sikkim Manipal University Page No. 221

It greatly increases the difficulty of offline dictionary attacks. For a salt of

length b bits, the number of possible passwords is increased by a factor

of 2b, increasing the difficulty of guessing a password in a dictionary

attack.

It becomes nearly impossible to find out whether a person with

passwords on two or more systems has used the same password on all

of them.

To see the second point, consider the way that an offline dictionary attack

would work. The attacker obtains a copy of the password file. Suppose first

that the salt is not used. The attacker‟s goal is to guess a single password.

To that end, the attacker submits a large number of likely passwords to the

hashing function. If any of the guesses matches one of the hashes in the

file, then the attacker has found a password that is in the file. But faced with

the UNIX scheme, the attacker must take each guess and submit it to the

hash function once for each salt value in the dictionary file, multiplying the

number of guesses that must be checked.

There are two threats to the UNIX password scheme. First, a user can gain

access on a machine using a guest account or by some other means and

then run a password guessing program, called a password cracker, on that

machine. The attacker should be able to check many thousands of possible

passwords with little resource consumption. In addition, if an opponent is

able to obtain a copy of the password file, then a cracker program can be

run on another machine at leisure. This enables the opponent to run through

millions of possible passwords in a reasonable period.

Token - Based Authentication

Objects that a user possesses for the purpose of user authentication are

called tokens. In this subsection, we examine two types of tokens that are

Page 222: MC0085

Advanced Operating Systems (Distributed Systems) Unit 10

Sikkim Manipal University Page No. 222

widely used; these are cards that have the appearance and size of bank

cards.

Memory Cards: Memory cards can store but not process data. The most

common such card is the bank card with a magnetic stripe on the back. A

magnetic stripe can store only a simple security code, which can be read

(and unfortunately reprogrammed) by an inexpensive card reader. There are

also memory cards that include an internal electronic memory.

Memory cards can be used alone for physical access, such as a hotel room.

For computer user authentication, such cards are typically used with some

form of password or personal identification number (PIN). A typical

application is an automatic teller machine (ATM).

The memory card, when combined with a PIN or password, provides

significantly greater security than a password alone. An adversary must gain

physical possession of the card (or be able to duplicate it) plus must gain

knowledge of the PIN.

Among the potential drawbacks are the following:

Requires special reader: This increases the cost of using the token

and creates the requirement to maintain the security of the reader‟s

hardware and software.

Token loss: A lost token temporarily prevents its owner from gaining

system access. Thus there is an administrative cost in replacing the lost

token. In addition, if the token is found, stolen, or forged, then an

adversary now need only determine the PIN to gain unauthorized

access.

User dissatisfaction: Although users may have no difficulty in

accepting the use of a memory card for ATM access, its use for

computer access may be deemed inconvenient.

Page 223: MC0085

Advanced Operating Systems (Distributed Systems) Unit 10

Sikkim Manipal University Page No. 223

Smart Cards: A wide variety of devices qualify as smart tokens. These can

be categorized along three dimensions that are not mutually exclusive:

Physical characteristics: Smart tokens include an embedded

microprocessor. A smart token that looks like a bank card is called a smart

card. Other smart tokens can look like calculators, keys, or other small

portable objects.

Interface: Manual interfaces include a keypad and display for human/

token interaction. Smart tokens with an electronic interface communicate

with a compatible reader/writer.

Authentication protocol: The purpose of a smart token is to provide a

means for user authentication. We can classify the authentication

protocols used with smart tokens into three categories:

– Static: With a static protocol, the user authenticates himself or

herself to the token and then the token authenticates the user to the

computer. The latter half of this protocol is similar to the operation of

a memory token.

– Dynamic password generator: In this case, the token generates a

unique password periodically (e.g., every minute). This password is

then entered into the computer system for authentication, either

manually by the user or electronically via the token. The token and

the computer system must be initialized and kept synchronized so

that the computer knows the password that is current for this token.

– Challenge-response: In this case, the computer system generates

a challenge, such as a random string of numbers. The smart token

generates a response based on the challenge. For example, public-

key cryptography could be used and the token could encrypt the

challenge string with the token‟s private key.

Page 224: MC0085

Advanced Operating Systems (Distributed Systems) Unit 10

Sikkim Manipal University Page No. 224

For user authentication to computer, the most important category of smart

token is the smart card, which has the appearance of a credit card, has an

electronic interface, and may use any of the type of protocols just described.

The remainder of this section discusses smart cards.

A smart card contains within it an entire microprocessor, including

processor, memory, and I/O ports. Some versions incorporate a special co-

processing circuit for cryptographic operation to speed the task of encoding

and decoding messages or generating digital signatures to validate the

information transferred. In some cards, the I/O ports are directly accessible

by a compatible reader by means of exposed electrical contacts. Other

cards rely instead on an embedded antenna for wireless communication

with the reader.

Biometric Authentication

A biometric authentication system attempts to authenticate an individual

based on his or her unique physical characteristics. These include static

characteristics, such as fingerprints, hand geometry, facial characteristics,

and retinal and iris patterns; and dynamic characteristics, such as voiceprint

and signature. In essence, biometrics is based on pattern recognition.

Compared to passwords and tokens, biometric authentication is both

technically complex and expensive. While it is used in a number of specific

applications, biometrics has yet to mature as a standard tool for user

authentication to computer systems.

A number of different types of physical characteristics are either in use or

under study for user authentication. The most common are the following:

Facial characteristics: Facial characteristics are the most common

means of human-to-human identification; thus it is natural to consider

them for identification by computer. The most common approach is to

define characteristics based on relative location and shape of key facial

Page 225: MC0085

Advanced Operating Systems (Distributed Systems) Unit 10

Sikkim Manipal University Page No. 225

features, such as eyes, eyebrows, nose, lips, and chin shape. An

alternative approach is to use an infrared camera to produce a face

thermogram that correlates with the underlying vascular system in the

human face.

Fingerprints: Fingerprints have been used as a means of identification

for centuries, and the process has been systematized and automated

particularly for law enforcement purposes. A fingerprint is the pattern of

ridges and furrows on the surface of the fingertip. Fingerprints are

believed to be unique across the entire human population. In practice,

automated fingerprint recognition and matching system extract a number

of features from the fingerprint for storage as a numerical surrogate for

the full fingerprint pattern.

Hand geometry: Hand geometry systems identify features of the hand,

including shape, and lengths and widths of fingers.

Retinal pattern: The pattern formed by veins beneath the retinal surface

is unique and therefore suitable for identification. A retinal biometric

system obtains a digital image of the retinal pattern by projecting a low-

intensity beam of visual or infrared light into the eye.

Iris: Another unique physical characteristic is the detailed structure of

the iris.

Signature: Each individual has a unique style of handwriting, and this is

reflected especially in the signature, which is typically a frequently

written sequence. However, multiple signature samples from a single

individual will not be identical. This complicates the task of developing a

computer representation of the signature that can be matched to future

samples.

Voice: Whereas the signature style of an individual reflects not only the

unique physical attributes of the writer but also the writing habit that has

Page 226: MC0085

Advanced Operating Systems (Distributed Systems) Unit 10

Sikkim Manipal University Page No. 226

developed, voice patterns are more closely tied to the physical and

anatomical characteristics of the speaker. Nevertheless, there is still a

variation from sample to sample over time from the same speaker,

complicating the biometric recognition task.

10.5 Access Control

An access control policy dictates what types of access are permitted, under

what circumstances, and by whom. Access control policies are generally

grouped into the following categories:

Discretionary access control (DAC): Controls access based on the

identity of the requestor and on access rules (authorizations) stating what

requestors are (or are not) allowed to do.This policy is termed discretionary

because an entity might have access rights that permit the entity, by its own

volition, to enable another entity to access some resource.

Mandatory access control (MAC): Controls access based on comparing

security labels (which indicate how sensitive or critical system resources

are) with security clearances (which indicate system entities are eligible to

access certain resources). This policy is termed mandatory because an

entity that has clearance to access a resource may not, just by its own

volition, enable another entity to access that resource.

Role-based access control (RBAC): Controls access based on the roles

that users have within the system and on rules stating what accesses are

allowed to users in given roles.

DAC is the traditional method of implementing access control. MAC is a

concept that evolved out of requirements for military information security

and is beyond the scope of this book. RBAC has become increasingly

popular and is introduced later in this section.

Page 227: MC0085

Advanced Operating Systems (Distributed Systems) Unit 10

Sikkim Manipal University Page No. 227

These three policies are not mutually exclusive Figure 10.12. An access

control mechanism can employ two or even all three of these policies to

cover different classes of system resources.

Discretionary Access Control (DAC)

This section introduces a general model for DAC developed by Lampson,

Graham, and Denning. The model assumes a set of subjects, a set of

objects, and a set of rules that govern the access of subjects to objects. Let

us define the protection state of a system to be the set of information, at a

given point in time, that specifies the access rights for each subject with

respect to each object. We can identify three requirements: representing the

protection state, enforcing access rights, and allowing subjects to alter the

protection state in certain ways. The model addresses all three

requirements, giving a general, logical description of a DAC system.

Figure 10.12: Access Control Policies

To represent the protection state, we extend the universe of objects in the

access control matrix to include the following:

Processes: Access rights include the ability to delete a process, stop

(block), and wake up a process.

Devices: Access rights include the ability to read/write the device, to

control its operation (e.g., a disk seek), and to block/unblock the device

for use.

Discretionary Access

Control Policy

Mandatory Access control Policy

Role Based Access Control

Policy

Page 228: MC0085

Advanced Operating Systems (Distributed Systems) Unit 10

Sikkim Manipal University Page No. 228

Memory locations or regions: Access rights include the ability to

read/write certain locations of regions of memory that are protected so

that the default is that access is not allowed.

Subjects: Access rights with respect to a subject have to do with the

ability to grant or delete access rights of that subject to other objects, as

explained subsequently.

Figure 10.13 is an example for an access control matrix A, each entry

A[S, X] contains strings, called access attributes, that specify the access

rights of subject S to object X. For example, in Figure 10.13, S1 may read

file F2, because „read‟ appears in A[S1, F1].

Figure 10.13: Access Control Matrix

From a logical or functional point of view, a separate access control module

is associated with each type of object Figure 10.14. The module evaluates

each request by a subject to access an object to determine if the access

right exists. An access attempt triggers the following steps:

1. A subject S0 issues a request of type α for object X.

Page 229: MC0085

Advanced Operating Systems (Distributed Systems) Unit 10

Sikkim Manipal University Page No. 229

2. The request causes the system (the operating system or an access

control interface module of some sort) to generate a message of the

form (S0,α,X) to the controller for X.

Figure 10.14: Organization of Access Control Function

3. The controller interrogates the access matrix A to determine if α is in

A[S0,X]. If so, the access is allowed; if not, the access is denied and a

Page 230: MC0085

Advanced Operating Systems (Distributed Systems) Unit 10

Sikkim Manipal University Page No. 230

protection violation occurs. The violation should trigger a warning and

appropriate action.

Table 10.1: Access Control System Commands

Figure 10.14 suggests that every access by a subject to an object is

mediated by the controller for that object, and that the controller‟s decision is

based on the current contents of the matrix. In addition, certain subjects

have the authority to make specific changes to the access matrix. A request

to modify the access matrix is treated as an access to the matrix, with the

individual entries in the matrix treated as objects. Such accesses are

mediated by an access matrix controller, which controls updates to the

matrix. The model also includes a set of rules that govern modifications to

the access matrix, shown in Table 10.1. For this purpose, we introduce the

access rights „owner‟ and „control‟ and the concept of a copy flag, explained

Page 231: MC0085

Advanced Operating Systems (Distributed Systems) Unit 10

Sikkim Manipal University Page No. 231

in the subsequent paragraphs. The first three rules deal with transferring,

granting, and deleting access rights. Suppose that the entry α* exists in

A[S0, X]. This means that S0 has access right α to subject X and, because of

the presence of the copy flag, can transfer this right, with or without copy

flag, to another subject. Rule R1 expresses this capability. A subject would

transfer the access right without the copy flag if there were a concern that

the new subject would maliciously transfer the right to another subject that

should not have that access right. For example, S1 may place „read‟ or „read

*‟ in any matrix entry in the F1 column. Rule R2 states that if S0 is designated

as the owner of object X, then S0 can grant an access right to that object for

any other subject. Rule 2 states that S0 can add any access right to A[S,X]

for any S, if S0 has „owner‟ access to x. Rule R3 permits S0 to delete any

access right from any matrix entry in a row for which S0 controls the subject

and for any matrix entry in a column for which S0 owns the object. Rule R4

permits a subject to read that portion of the matrix that it owns or controls.

The remaining rules in Table 10.1 govern the creation and deletion of

subjects and objects. Rule R5 states that any subject can create a new

object, which it owns, and can then grant and delete access to the object.

Under rule R6, the owner of an object can destroy the object, resulting in the

deletion of the corresponding column of the access matrix. Rule R7 enables

any subject to create a new subject; the creator owns the new subject and

the new subject has control access to itself. Rule R8 permits the owner of a

subject to delete the row and column (if there are subject columns) of the

access matrix designated by that subject.

The set of rules in Table 10.1 is an example of the rule set that could be

defined for an access control system. The following are examples of

additional or alternative rules that could be included. A transfer-only right

could be defined, which results in the transferred right being added to the

target subject and deleted from the transferring subject. The number of

Page 232: MC0085

Advanced Operating Systems (Distributed Systems) Unit 10

Sikkim Manipal University Page No. 232

owners of an object or a subject could be limited to one by not allowing the

copy flag to accompany the owner right.

The ability of one subject to create another subject and to have „owner‟

access right to that subject can be used to define a hierarchy of subjects.

For example, in Figure 10.13, S1 owns S2 and S3, so that S2 and S3 are

subordinate to S1. By the rules of Table 10.1, S1 can grant and delete to S2

access rights that S1 already has. Thus, a subject can create another

subject with a subset of its own access rights. This might be useful, for

example, if a subject is invoking an application that is not fully trusted, and

does not want that application to be able to transfer access rights to other

subjects.

Role-Based Access Control

Traditional DAC systems define the access rights of individual users and

groups of users. In contrast, RBAC is based on the roles that users assume

in a system rather than the user‟s identity. Typically, RBAC models define a

role as a job function within an organization. RBAC systems assign access

rights to roles instead of individual users. In turn, users are assigned to

different roles, either statically or dynamically, according to their

responsibilities.

RBAC now enjoys widespread commercial use and remains an area of

active research. The National Institute of Standards and Technology (NIST)

has issued a standard, Security Requirements for Cryptographic Modules,

that requires support for access control and administration through roles.

The relationship of users to roles is many to many, as is the relationship of

roles to resources, or system objects Figure 10.15. The set of users

changes, in some environments frequently, and the assignment of a user to

one or more roles may also be dynamic. The set of roles in the system in

most environments is likely to be static, with only occasional additions or

Page 233: MC0085

Advanced Operating Systems (Distributed Systems) Unit 10

Sikkim Manipal University Page No. 233

deletions. Each role will have specific access rights to one or more

resources. The set of resources and the specific access rights associated

with a particular role are also likely to change infrequently.

We can use the access matrix representation to depict the key elements of

an RBAC system in simple terms, as shown in Figure 10.15. The upper

matrix relates individual users to roles. Typically there are many more users

than roles. Each matrix entry is either blank or marked, the latter indicating

that this user is assigned to this role. Note that a single user may be

assigned multiple roles (more than one mark in a row) and that multiple

users may be assigned to a single role (more than one mark in a

column).The lower matrix has the same structure as the DAC access control

matrix, with roles as subjects. Typically, there are few roles and many

objects or resources. In this matrix the entries are the specific access rights

enjoyed by the roles. Note that a role can be treated as an object, allowing

the definition of role hierarchies.

RBAC lends itself to an effective implementation of the principle of least

privilege. That is, each role should contain the minimum set of access rights

needed for that role. A user is assigned to a role that enables him or her to

perform only what is required for that role. Multiple users assigned to the

same role enjoy the same minimal set of access rights.

Page 234: MC0085

Advanced Operating Systems (Distributed Systems) Unit 10

Sikkim Manipal University Page No. 234

Figure 10.15: Users, Roles, and Resources

10.6 Digital Signatures

A digital signature of a message is a number dependent on some secret

known only to the signer, and, additionally, on the content of the message

being signed. Signatures must be verifiable; if a dispute arises as to whether

a party signed a document (caused by either a lying signer trying to

Page 235: MC0085

Advanced Operating Systems (Distributed Systems) Unit 10

Sikkim Manipal University Page No. 235

repudiate a signature it did create, or a fraudulent claimant), an unbiased

third party should be able to resolve the matter equitably, without requiring

access to the signer‟s secret information (private key).

Digital signatures have many applications in information security, including

authentication, data integrity, and non-repudiation. One of the most

significant applications of digital signatures is the certification of public keys

in large networks. Certification is a means for a trusted third party (TTP) to

bind the identity of a user to a public key, so that at some later time, other

entities can authenticate a public key without assistance from a trusted third

party.

The concept and utility of a digital signature was recognized several years

before any practical realization was available. The first method discovered

was the RSA signature scheme, which remains today one of the most

practical and versatile techniques available. Subsequent research has

resulted in many alternative digital signature techniques. Some offer

significant advantages in terms of functionality and implementation.

Basic definitions

1. A digital signature is a data string which associates a message (in

digital form) with some originating entity.

2. A digital signature generation algorithm (or signature generation

algorithm) is a method for producing a digital signature.

3. A digital signature verification algorithm (or verification algorithm) is

a method for verifying that a digital signature is authentic (i.e., was

indeed created by the specified entity).

4. A digital signature scheme (or mechanism) consists of a signature

generation algorithm and an associated verification algorithm.

Page 236: MC0085

Advanced Operating Systems (Distributed Systems) Unit 10

Sikkim Manipal University Page No. 236

5. A digital signature signing process (or procedure) consists of a

(mathematical) digital signature generation algorithm, along with a

method for formatting data into messages which can be signed.

6. A digital signature verification process (or procedure) consists of a

verification algorithm, along with a method for recovering data from the

message.

Table 10.2: Notation for Digital Signature Mechanisms

(messages) M is the set of elements to which a signer can affix a digital

signature.

(signing space) MS is the set of elements to which the signature

transformations are applied. The signature transformations are not

applied directly to the set M.

(signature space) S is the set of elements associated to messages in

M. These elements are used to bind the signer to the message.

(indexing set) R is used to identify specific signing transformations.

Page 237: MC0085

Advanced Operating Systems (Distributed Systems) Unit 10

Sikkim Manipal University Page No. 237

A classification of digital signature schemes

There are two general classes of digital signature schemes, which can be

briefly summarized as follows:

1. Digital signature schemes with appendix require the original message as

input to the verification algorithm.

2. Digital signature schemes with message recovery do not require the

original message as input to the verification algorithm. In this case, the

original message is recovered from the signature itself.

Definition: A digital signature scheme (with either message recovery or

appendix) is said to be a randomized digital signature scheme if |R| > 1;

otherwise, the digital signature scheme is said to be deterministic.

Figure 10.16 illustrates this classification. Deterministic digital signature

mechanisms can be further subdivided into one-time signature schemes and

multiple-use schemes.

Figure 10.16: A taxonomy of Digital Signature schemes

Digital signature schemes with appendix

Digital signature schemes with appendix, as discussed in this section, are

the most commonly used in practice. They rely on cryptographic hash

Page 238: MC0085

Advanced Operating Systems (Distributed Systems) Unit 10

Sikkim Manipal University Page No. 238

functions rather than customized redundancy functions, and are less prone

to existential forgery attacks.

Definition: Digital signature schemes which require the message as input to

the verification algorithm are called digital signature schemes with appendix.

Examples of mechanisms providing digital signatures with appendix are the

DSA, ElGamal, and Schnorr signature schemes.

Algorithm: Key generation for digital signature schemes with appendix

Each entity creates a private key for signing messages, and a

corresponding public key to be used by other entities for verifying

signatures.

1. Each entity A should select a private key which defines a set SA = {SA;k :

k R} of transformations. Each SA,k is a 1-1 mapping from Mh to S and

is called a signing transformation.

2. SA defines a corresponding mapping VA from Mh X S to {true, false} such

that

VA is called a verification transformation and is constructed such that it

may be computed without knowledge of the signer‟s private key.

3. A‟s public key is VA; A‟s private key is the set SA.

Algorithm: Signature generation and verification (digital signature schemes

with appendix)

The entity A produces a signature s S for a message m M, which can

later be verified by any entity B.

Page 239: MC0085

Advanced Operating Systems (Distributed Systems) Unit 10

Sikkim Manipal University Page No. 239

Figure 10.17: provides a schematic overview of a digital signature scheme

with appendix. The following properties are required of the signing and

verification transformations:

Figure 10.17: Overview of a digital signature scheme with appendix

Page 240: MC0085

Advanced Operating Systems (Distributed Systems) Unit 10

Sikkim Manipal University Page No. 240

Digital signature schemes with message recovery

The digital signature schemes described in this section have the feature that

the message signed can be recovered from the signature itself. In practice,

this feature is of use for short messages.

Definition: A digital signature scheme with message recovery is a digital

signature scheme for which a priori knowledge of the message is not

required for the verification algorithm. Examples of mechanisms providing

digital signatures with message recovery are RSA, Rabin, and Nyberg-

Rueppel public-key signature schemes.

Algorithm: Key generation for digital signature schemes with message

recovery. Each entity creates a private key to be used for signing messages,

and a corresponding public key to be used by other entities for verifying

signatures.

Algorithm: Signature generation and verification for schemes with message

recovery.

The entity A produces a signature sS for a message mM, which can

later be verified by any entity B. The message m is recovered from s.

1. Signature generation: Entity A should do the following:

a) Select an element k R.

b) Compute ~

m = R(m) and S* = SA,k (~

m ). (R is a redundancy function)

c) A‟s signature is s*; this is made available to entities which may wish

to verify the signature and recover m from it.

Page 241: MC0085

Advanced Operating Systems (Distributed Systems) Unit 10

Sikkim Manipal University Page No. 241

2. Verification: Entity B should do the following:

a) Obtain A‟s authentic public key VA.

b) Compute ~

m = VA (S*)

c) Verify that ~

mMR. (If ~

m MR, then reject the signature.)

d) Recover m from ~

m by computing R-1(~

m ).

Figure 10.18: Overview of a digital signature scheme with message recovery

Figure 10.18 provides a schematic overview of a digital signature scheme

with message recovery. The following properties are required of the signing

and verification transformations:

i. for each k R, SA,k should be efficient to compute;

ii. VA should be efficient to compute; and

iii. it should be computationally infeasible for an entity other than A to find

any s* S such that VA(s*) MR.

10.7 Design Principles

Designers of security components of a distributed operating system should

follow the following guidelines while designing a secured network:

1. Least Privilege: This principle is also known as need-to-know

principle. It states that any process should be given only those access

rights that enable it to access, at any time, what it needs to accomplish

Page 242: MC0085

Advanced Operating Systems (Distributed Systems) Unit 10

Sikkim Manipal University Page No. 242

its function and nothing more and nothing less. i.e. the security

system must be flexible enough to allow the access rights of a process

to grow and shrink with its changing access requirements. This

principle serves to limit the damage when a system‟s security is

broken.

2. Fail-Safe defaults: Access rights should be acquired by explicit

permission only and the default should be no access. This principle

requires that access control decisions should be based on why an

object should be accessible to a process rather than on why it should

not be accessible.

3. Open design: This principle requires that the design should not be

secret but should be public. It is a mistake on the part of a designer to

assume that intruders will not know how the security mechanism of the

system works.

4. Built into the system: This principle requires that the security be

designed into the systems at their inception and be built into the lowest

layers of the systems. i.e. security should not be treated as an add-on

feature because security problems cannot be resolved very effectively

by patching the penetration holes detected in an existing system.

5. Check for current authority: This principle requires that every access

to every object must be checked using an access control database for

authority. This is necessary to have immediate effect of revocation of

previously given access rights.

6. Easy granting and revocation of access rights: For greater

flexibility, a security system must allow access rights for an object to

be granted or revoked dynamically. It should be possible to restrict

some of the rights and to grant to a user only those rights that are

sufficient to accomplish its functions. On the other hand, a good

Page 243: MC0085

Advanced Operating Systems (Distributed Systems) Unit 10

Sikkim Manipal University Page No. 243

security system should allow immediate revocation with the flexibility of

selective and partial revocation.

7. Never trust other parties: For producing a secured distributed

system, the system components must be designed with the

assumption that other parties (human or programs) are not trustworthy

until they are demonstrated to be trustworthy.

8. Always ensure freshness of messages: To avoid security violations

through the replay of messages, the security of a distributed system

must be designed to always ensure freshness of messages

exchanged between two communicating entities.

9. Build firewalls: To limit the damage in case of a system‟s security

being compromised, the system must have firewalls built into it. One

way to meet these requirements is to allow only short-lived passwords

and keys in the system.

10. Efficient: The security mechanisms used must execute efficiently and

be simple to implement.

11. Convenient to use: To be psychologically acceptable, the security

mechanisms must be convenient to use. Otherwise, they are likely to

be bypassed or incorrectly used by the users.

12. Cost Effective: It is often the case that security needs to be traded off

with other goals of the system, such as performance or ease of use.

10.8 Terminal Questions

1. Discuss the major requirements for security services and with a labeled

diagram explain the Security Model. (Refer to Section 10.1)

2. Discuss about the potential attacks on a computer system. Describe the

four general categories of attacks. (Refer to Section 10.2)

Page 244: MC0085

Advanced Operating Systems (Distributed Systems) Unit 10

Sikkim Manipal University Page No. 244

3. Define Cryptography. Describe the components of Cryptography with a

neat labeled diagram. (Refer to Section 10.3)

4. Define Authentication. Explain various methods of implementing

authentication (Refer to Section 10.4)

5. Describe the following two types of Access Control Mechanisms:

Discretionary Access Control

Role based Access Control

(Refer to Section 10.5)

––––––––––––––––––––––––––