Page 1
Advanced Operating Systems (Distributed Systems) Unit 1
Sikkim Manipal University Page No. 1
Unit 1 Introduction
Structure:
1.0 Objectives
1.1 Distributed Computing Systems
1.2 Distributed Computing System Models
1.3 Advantages of Distributed Systems
1.4 Distributed Operating Systems
1.5 Issues in Designing Distributed Operating Systems
1.6 Distributed Computing Environment
1.7 Summary
1.8 Terminal Questions
1.0 Objectives
After studying this unit, you will get familiar with
Fundamentals of distributed computing systems
Distributed design models
Distributed operating systems and their design issues
Distributed computing environment
1.1 Distributed Computing Systems
Over the past two decades, advancements in microelectronic technology
have resulted in the availability of fast, inexpensive processors, and
advancements in communication technology have resulted in the availability
of cost-effective and highly efficient computer networks. The advancements
in these two technologies favour the use of interconnected, multiple
processors in place of a single, high-speed processor.
Page 2
Advanced Operating Systems (Distributed Systems) Unit 1
Sikkim Manipal University Page No. 2
Computer architectures consisting of interconnected, multiple processors
are basically of two types:
In tightly coupled systems, there is a single system wide primary
memory (address space) that is shared by all the processors (Fig. 1.1).
If any processor writes, for example, the value 100 to the memory
location x, any other processor subsequently reading from location x will
get the value 100. Therefore, in these systems, any communication
between the processors usually takes place through the shared
memory.
In loosely coupled systems, the processors do not share memory, and
each processor has its own local memory (Fig. 1.2). If a processor writes
the value 100 to the memory location x, this write operation will only
change the contents of its local memory and will not affect the contents
of the memory of any other processor. Hence, if another processor
reads the memory location x, it will get whatever value was there before
in that location of its own local memory. In these systems, all physical
communication between the processors is done by passing messages
across the network that interconnects the processors.
Usually, tightly coupled systems are referred to as parallel processing
systems, and loosely coupled systems are referred to as distributed
computing systems, or simply distributed systems. In contrast to the
tightly coupled systems, the processors of distributed computing
systems can be located far from each other to cover a wider
geographical area. Furthermore, in tightly coupled systems, the number
of processors that can be usefully deployed is usually small and limited
by the bandwidth of the shared memory. This is not the case with
distributed computing systems that are more freely expandable and can
have an almost unlimited number of processors.
Page 3
Advanced Operating Systems (Distributed Systems) Unit 1
Sikkim Manipal University Page No. 3
Fig. 1.1: Tightly Coupled Multiprocessor Systems
Fig. 1.2: Loosely Coupled Multiprocessor Systems
Hence, a distributed computing system is basically a collection of
processors interconnected by a communication network in which each
processor has its own local memory and other peripherals, and the
communication between any two processors of the system takes place by
message passing over the communication network. For a particular
processor, its own resources are local, whereas the other processors and
their resources are remote. Together, a processor and its resources are
usually referred to as a node or site or machine of the distributed computing
system.
Page 4
Advanced Operating Systems (Distributed Systems) Unit 1
Sikkim Manipal University Page No. 4
1.2 Distributed Computing System Models
Distributed Computing system models can be broadly classified into five
categories. They are
Minicomputer model
Workstation model
Workstation – server model
Processor – pool model
Hybrid model
Minicomputer Model
The minicomputer model (Fig. 1.3) is a simple extension of the centralized
time-sharing system. A distributed computing system based on this model
consists of a few minicomputers (they may be large supercomputers as
well) interconnected by a communication network. Each minicomputer
usually has multiple users simultaneously logged on to it. For this, several
interactive terminals are connected to each minicomputer. Each user is
logged on to one specific minicomputer, with remote access to other
minicomputers. The network allows a user to access remote resources that
are available on some machine other than the one on to which the user is
currently logged.
The minicomputer model may be used when resource sharing (such as
sharing of information databases of different types, with each type of
database located on a different machine) with remote users is desired.
The early ARPAnet is an example of a distributed computing system based
on the minicomputer model.
Page 5
Advanced Operating Systems (Distributed Systems) Unit 1
Sikkim Manipal University Page No. 5
Fig. 1.3: A Distributed Computing System based on Minicomputer Model
Workstation Model
A distributed computing system based on the workstation model (Fig. 1.4)
consists of several workstations interconnected by a communication
network. An organization may have several workstations located throughout
a building or campus, each workstation equipped with its own disk and
serving as a single-user computer. It has been often found that in such an
environment, at any one time a significant proportion of the workstations are
idle (not being used), resulting in the waste of large amounts of CPU time.
Therefore, the idea of the workstation model is to interconnect all these
workstations by a high-speed LAN so that idle workstations may be used to
process jobs of users who are logged onto other workstations and do not
have sufficient processing power at their own workstations to get their jobs
processed efficiently.
Page 6
Advanced Operating Systems (Distributed Systems) Unit 1
Sikkim Manipal University Page No. 6
Fig. 1.4: A Distributed Computing System based on Workstation Model
In this model, a user logs onto one of the workstations called his or her
"home" workstation and submits jobs for execution. When the system finds
that the user's workstation does not have sufficient processing power for
executing the processes of the submitted jobs efficiently, it transfers one or
more of the processes from the user's workstation to some other workstation
that is currently idle and gets the process executed there, and finally the
result of execution is returned to the user's workstation.
This model is not so simple to implement as it might appear at first sight
because several issues must be resolved. Tanenbaum studies summarize
these issues as follows and these issues are carefully handled to achieve
the maximum efficiency:
1 How does the system find an idle workstation?
2 How is a process transferred from one workstation to get it executed on
another workstation?
Page 7
Advanced Operating Systems (Distributed Systems) Unit 1
Sikkim Manipal University Page No. 7
3 What happens to a remote process if a user logs onto a workstation that
was idle until now and was being used to execute a process of another
workstation?
Workstation – Server Model
The workstation model is a network of personal workstations, each with its
own disk and a local file system. A workstation with its own local disk is
usually called a diskful workstation and a workstation without a local disk is
called a diskless workstation. With the proliferation of high-speed networks,
diskless workstations have become more popular in network environments
than diskful workstations, making the workstation-server model more
popular than the workstation model for building distributed computing
systems.
A distributed computing system based on the workstation-server model
(Fig. 1.5) consists of a few minicomputers and several workstations (most of
which are diskless, but a few of which may be diskful) interconnected by a
communication network.
Fig. 1.5: A Distributed Computing System based on Workstation-server Model
Page 8
Advanced Operating Systems (Distributed Systems) Unit 1
Sikkim Manipal University Page No. 8
Note that when diskless workstations are used on a network, the file system
to be used by these workstations must be implemented either by a diskful
workstation or by a minicomputer equipped with a disk for file storage. One
or more of the minicomputers are used for implementing the file system.
Other minicomputers may be used for providing other types of services,
such as database service and print service. Therefore, each minicomputer is
used as a server machine to provide one or more types of services.
Therefore in the workstation-server model, in addition to the workstations,
there are specialized machines (may be specialized workstations) for
running server processes (called servers) for managing and providing
access to shared resources.
For a number of reasons, such as higher reliability and better scalability,
multiple servers are often used for managing the resources of a particular
type in a distributed computing system. For example, there may be multiple
file servers, each running on a separate minicomputer and cooperating via
the network, for managing the files of all the users in the system. Due to this
reason, a distinction is often made between the services that are provided to
clients and the servers that provide them. That is, a service is an abstract
entity that is provided by one or more servers. For example, one or more file
servers may be used in a distributed computing system to provide file
service to the users.
In this model, a user logs onto a workstation called his or her home
workstation. Normal computation activities required by the user's processes
are performed at the user's home workstation, but requests for services
provided by special servers (such as a file server or a database server) are
sent to a server providing that type of service that performs the user's
requested activity and returns the result of request processing to the user's
workstation. Therefore, in this model, the user's processes need not be
Page 9
Advanced Operating Systems (Distributed Systems) Unit 1
Sikkim Manipal University Page No. 9
migrated to the server machines for getting the work done by those
machines.
For better overall system performance, the local disk of a diskful workstation
is normally used for such purposes as storage of temporary files, storage of
unshared files, storage of shared files that are rarely changed, paging
activity in virtual-memory management, and caching of remotely accessed
data.
Compared to the workstation model, the workstation-server model has
several advantages:
1. In general, it is much cheaper to use a few minicomputers equipped with
large, fast disks that are accessed over the network than a large number
of diskful workstations, with each workstation having a small, slow disk.
2. Diskless workstations are also preferred to diskful workstations from a
system maintenance point of view. Backup and hardware maintenance
are easier to perform with a few large disks than with many small disks
scattered all over a building or campus. Furthermore, installing new
releases of software (such as a file server with new functionalities) is
easier when the software is to be installed on a few file server machines
than on every workstation.
3. In the workstation-server model, since all files are managed by the file
servers, users have the flexibility to use any workstation and access the
files in the same manner irrespective of which workstation the user is
currently logged on. Note that this is not true with the workstation model,
in which each workstation has its local file system, because different
mechanisms are needed to access local and remote files.
4. In the workstation-server model, the request-response protocol
described above is mainly used to access the services of the server
machines. Therefore, unlike the workstation model, this model does not
need a process migration facility, which is difficult to implement.
Page 10
Advanced Operating Systems (Distributed Systems) Unit 1
Sikkim Manipal University Page No. 10
The request-response protocol is known as the client-server model of
communication. In this model, a client process (which in this case
resides on a workstation) sends a request to a server process (which in
this case resides on a minicomputer) for getting some service such as
reading a block of a file. The server executes the request and sends
back a reply to the client that contains the result of request processing.
The client-server model provides an effective general-purpose approach
to the sharing of information and resources in distributed computing
systems. It is not only meant for use with the workstation-server model
but also can be implemented in a variety of hardware and software
environments. The computers used to run the client and server
processes need not necessarily be workstations and minicomputers.
They can be of many types and there is no need to distinguish between
them. It is even possible for both the client and server processes to be
run on the same computer. Moreover, some processes are both client
and server processes. That is, a server process may use the services of
another server, appearing as a client to the latter.
5. A user has guaranteed response time because workstations are not
used for executing remote processes. However, the model does not
utilize the processing capability of idle workstations.
The V-System proposed by Cheriton in 1988 is an example of a
distributed computing system that is based on the workstation-server
model.
Processor – Pool Model
The processor-pool model is based on the observation that most of the time
a user does not need any computing power but once in a while the user may
need a very large amount of computing power for a short time (e.g., when
recompiling a program consisting of a large number of files after changing a
Page 11
Advanced Operating Systems (Distributed Systems) Unit 1
Sikkim Manipal University Page No. 11
basic shared declaration). Therefore, unlike the workstation-server model in
which a processor is allocated to each user, in the processor-pool model the
processors are pooled together to be shared by the users as needed. The
pool of processors consists of a large number of microcomputers and
minicomputers attached to the network. Each processor in the pool has its
own memory to load and run a system program or an application program of
the distributed computing system.
The pure processor-pool model (Fig. 1.6), the processors in the pool have
no terminals attached directly to them, and users access the system from
terminals that are attached to the network via special devices. These
terminals are either small diskless workstations or graphic terminals, such
as X terminals. A special server (called a run server) manages and allocates
the processors in the pool to different users on a demand basis. When a
user submits a job for computation, an appropriate number of processors
are temporarily assigned to his or her job by the run server. For example, if
the user's computation job is the compilation of a program having n
segments, in which each of the segments can be compiled independently to
produce separate relocatable object files, n processors from the pool can be
allocated to this job to compile all the n segments in parallel. When the
computation is completed, the processors are returned to the pool for use by
other users.
In the processor-pool model there is no concept of a home machine. That is,
a user does not log onto a particular machine but to the system as a whole.
This is in contrast to other models in which each user has a home machine
(e.g., a workstation or minicomputer) onto which he or she logs and runs
most of his or her programs there by default.
Page 12
Advanced Operating Systems (Distributed Systems) Unit 1
Sikkim Manipal University Page No. 12
Fig. 1.6: A distributed computing system based on the processor-pool model
As compared to the workstation-server model, the processor-pool model
allows better utilization of the available processing power of a distributed
computing system. This is because in the processor-pool model, the entire
processing power of the system is available for use by the currently logged-
on users, whereas this is not true for the workstation-server model in which
several workstations may be idle at a particular time but they cannot be
used for processing the jobs of other users. Furthermore, the processor-pool
model provides greater flexibility than the workstation-server model in the
sense that the system's services can be easily expanded without the need
to install any more computers; the processors in the pool can be allocated to
act as extra servers to carry any additional load arising from an increased
user population or to provide new services. However, the processor-pool
model is usually considered to be unsuitable for high-performance
interactive applications, especially those using graphics or window systems.
This is mainly because of the slow speed of communication between the
Page 13
Advanced Operating Systems (Distributed Systems) Unit 1
Sikkim Manipal University Page No. 13
computer on which the application program of a user is being executed and
the terminal via which the user is interacting with the system. The
workstation-server model is generally considered to be more suitable for
such applications.
Amoeba proposed by Mullender et al. in 1990 is an example of distributed
computing systems based on the processor-pool model.
Hybrid Model
Out of the four models described above, the workstation-server model, is
the most widely used model for building distributed computing systems. This
is because a large number of computer users only perform simple
interactive tasks such as editing jobs, sending electronic mails, and
executing small programs. The workstation-server model is ideal for such
simple usage. However, in a working environment that has groups of users
who often perform jobs needing massive computation, the processor-pool
model is more attractive and suitable.
To combine the advantages of both the workstation-server and processor-
pool models, a hybrid model may be used to build a distributed computing
system. The hybrid model is based on the workstation-server model but with
the addition of a pool of processors. The processors in the pool can be
allocated dynamically for computations that are too large for workstations or
that requires several computers concurrently for efficient execution. In
addition to efficient execution of computation-intensive jobs, the hybrid
model gives guaranteed response to interactive jobs by allowing them to be
processed on local workstations of the users. However, the hybrid model is
more expensive to implement than the workstation-server model or the
processor-pool model.
Page 14
Advanced Operating Systems (Distributed Systems) Unit 1
Sikkim Manipal University Page No. 14
1.3 Advantages of Distributed Systems
From the models of distributed computing systems presented above, it is
obvious that distributed computing systems are much more complex and
difficult to build than traditional centralized systems (those consisting of a
single CPU, its memory, peripherals, and one or more terminals). The
increased complexity is mainly due to the fact that in addition to being
capable of effectively using and managing a very large number of distributed
resources, the system software of a distributed computing system should
also be capable of handling the communication and security problems that
are very different from those of centralized systems. For example, the
performance and reliability of a distributed computing system depends to a
great extent on the performance and reliability of the underlying
communication network. Special software is usually needed to handle loss
of messages during transmission across the network or to prevent
overloading of the network, which degrades the performance and
responsiveness to the users. Similarly, special software security measures
are needed to protect the widely distributed shared resources and services
against intentional or accidental violation of access control and privacy
constraints.
From the models of distributed computing systems presented above, it is
obvious that distributed computing systems are much more complex and
difficult to build than traditional centralized systems (those consisting of a
single CPU, its memory, peripherals, and one or more terminals). The
increased complexity is mainly due to the fact that in addition to being
capable of effectively using and managing a very large number of distributed
resources, the system software of a distributed computing system should
also be capable of handling the communication and security problems that
are very different from those of centralized systems. For example, the
performance and reliability of a distributed computing system depends to a
Page 15
Advanced Operating Systems (Distributed Systems) Unit 1
Sikkim Manipal University Page No. 15
great extent on the performance and reliability of the underlying
communication network. Special software is usually needed to handle loss
of messages during transmission across the network or to prevent
overloading of the network, which degrades the performance and
responsiveness to the users. Similarly, special software security measures
are needed to protect the widely distributed shared resources and services
against intentional or accidental violation of access control and privacy
constraints.
Despite the increased complexity and the difficulty of building distributed
computing systems, the installation and use of distributed computing
systems are rapidly increasing. This is mainly because the advantages of
distributed computing systems outweigh their disadvantages. The technical
needs, the economic pressures, and the major advantages that have led to
the emergence and popularity of distributed computing systems are
described next.
Inherently Distributed Applications
Distributed computing systems come into existence in some very natural
ways. For example, several applications are inherently distributed in nature
and require a distributed computing system for their realization. For
instance, in an employee database of a nationwide organization, the data
pertaining to a particular employee are generated at the employee's branch
office, and in addition to the global need to view the entire database, there is
a local need for frequent and immediate access to locally generated data at
each branch office. Applications such as these require that some processing
power be available at the many distributed locations for collecting,
preprocessing, and accessing data, resulting in the need for distributed
computing systems. Some other examples of inherently distributed
applications are a computerized worldwide airline reservation system, a
computerized banking system in which a customer can deposit/withdraw
Page 16
Advanced Operating Systems (Distributed Systems) Unit 1
Sikkim Manipal University Page No. 16
money from his or her account from any branch of the bank, and a factory
automation system controlling robots and machines all along an assembly
line.
Information Sharing among Distributed Users
Another reason for the emergence of distributed computing systems was a
desire for efficient person-to-person communication facility by sharing
information over great distances. In a distributed computing system,
information generated by one of the users can be easily and efficiently
shared by the users working at other nodes of the system. This facility may
be useful in many ways. For example, a project can be performed by two or
more users who are geographically far off from each other but whose
computers are a part of the same distributed computing system. In this
case, although the users are geographically separated from each other, they
can work in cooperation, for example, by transferring the files of the project,
logging onto each other's remote computers to run programs, and
exchanging messages by electronic mail to coordinate the work.
Resource Sharing
Information is not the only thing that can be shared in a distributed
computing system. Sharing of software resources such as software libraries
and databases as well as hardware resources such as printers, hard disks,
and plotters can also be done in a very effective way among all the
computers and the users of a single distributed computing system. For
example, we saw that in a distributed computing system based on the
workstation-server model the workstations may have no disk or only a small
disk (10-20 megabytes) for temporary storage, and access to permanent
files on a large disk can be provided to all the workstations by a single file
server.
Page 17
Advanced Operating Systems (Distributed Systems) Unit 1
Sikkim Manipal University Page No. 17
Better Price-Performance Ratio
This is one of the most important reasons for the growing popularity of
distributed computing systems. With the rapidly increasing power and
reduction in the price of microprocessors, combined with the increasing
speed of communication networks, distributed computing systems
potentially have a much better price-performance ratio than a single large
centralized system. For example, we saw how a small number of CPUs in a
distributed computing system based on the processor-pool model can be
effectively used by a large number of users from inexpensive terminals,
giving a fairly high price-performance ratio as compared to either a
centralized time-sharing system or a personal computer. Another reason for
distributed computing systems to be more cost-effective than centralized
systems is that they facilitate resource sharing among multiple computers.
For example, a single unit of expensive peripheral devices such as color
laser printers, high-speed storage devices, and plotters can be shared
among all the computers of the same distributed computing system. If these
computers are not linked together with a communication network, each
computer must have its own peripherals, resulting in higher cost.
Shorter Response Times and Higher Throughput
Due to multiplicity of processors, distributed computing systems are
expected to have better performance than single-processor centralized
systems. The two most commonly used performance metrics are response
time and throughput of user processes. That is, the multiple processors of a
distributed computing system can be utilized properly for providing shorter
response times and higher throughput than a single-processor centralized
system. For example, if there are two different programs to be run, two
processors are evidently more powerful than one because the programs can
be simultaneously run on different processors. Furthermore, if a particular
computation can be partitioned into a number of subcomputations that can
Page 18
Advanced Operating Systems (Distributed Systems) Unit 1
Sikkim Manipal University Page No. 18
run concurrently, in a distributed computing system all the subcomputations
can be simultaneously run with each one on a different processor.
Distributed computing systems with very fast communication networks are
increasingly being used as parallel computers to solve single complex
problems rapidly. Another method often used in distributed computing
systems for achieving better overall performance is to distribute the load
more evenly among the multiple processors by moving jobs from currently
overloaded processors to lightly loaded ones. For example, in a distributed
computing system based on the workstation model, if a user currently has
two processes to run, out of which one is an interactive process and the
other is a process that can be run in the background, it may be
advantageous to run the interactive process on the home node of the user
and the other one on a remote idle node (if any node is idle).
Higher Reliability
Reliability refers to the degree of tolerance against errors and component
failures in a system. A reliable system prevents loss of information even in
the event of component failures. The multiplicity of storage devices and
processors in a distributed computing system allows the maintenance of
multiple copies of critical information within the system and the execution of
important computations redundantly to protect them against catastrophic
failures. With this approach, if one of the processors fails, the computation
can be successfully completed at the other processor, and if one of the
storage devices fails, the information can still be used from the other storage
device. Furthermore, the geographical distribution of the processors and
other resources in a distributed computing system limits the scope of
failures caused by natural disasters.
An important aspect of reliability is availability, which refers to the fraction of
time for which a system is available for use. In comparison to a centralized
system, a distributed computing system also enjoys the advantage of
Page 19
Advanced Operating Systems (Distributed Systems) Unit 1
Sikkim Manipal University Page No. 19
increased availability. For example, if the processor of a centralized system
fails (assuming that it is a single-processor centralized system), the entire
system breaks down and no useful work can be performed. However, in the
case of a distributed computing system, a few parts of the system can be
down without interrupting the jobs of the users who are using the other parts
of the system. For example, if a workstation of a distributed computing
system that is based on the workstation-server model fails, only the user of
that workstation is affected. Other users of the system are not affected by
this failure. Similarly, in a distributed computing system based on the
processor-pool model, if some of the processors in the pool are down at any
moment, the system can continue to function normally, simply with some
loss in performance that is proportional to the number of processors that are
down. In this case, none of the users are affected and the users cannot
even know that some of the processors are down.
The advantage of higher reliability is an important reason for the use of
distributed computing systems for critical applications whose failure may be
disastrous. However, often reliability comes at the cost of performance.
Therefore, it is necessary to maintain a balance between the two.
Extensibility and Incremental Growth
Another major advantage of distributed computing systems is that they are
capable of incremental growth. That is, it is possible to gradually extend the
power and functionality of a distributed computing system by simply adding
additional resources (both hardware and software) to the system as and
when the need arises. For example, additional processors can be easily
added to the system to handle the increased workload of an organization
that might have resulted from its expansion. Incremental growth is a very
attractive feature because for most existing and proposed applications it is
practically impossible to predict future demands of the system. Extensibility
Page 20
Advanced Operating Systems (Distributed Systems) Unit 1
Sikkim Manipal University Page No. 20
is also easier in a distributed computing system because addition of new
resources to an existing system can be performed without significant
disruption of the normal functioning of the system. Properly designed
distributed computing systems that have the property of extensibility and
incremental growth are called open distributed systems.
Better Flexibility in Meeting Users’ Needs
Different types of computers are usually more suitable for performing
different types of computations. For example, computers with ordinary
power are suitable for ordinary data processing jobs, whereas high-
performance computers are more suitable for complex mathematical
computations. In a centralized system, the users have to perform all types of
computations on the only available computer. However, a distributed
computing system may have a pool of different types of computers, in which
case the most appropriate one can be selected for processing a user's job
depending on the nature of the job. For instance, we saw that in a
distributed computing system that is based on the hybrid model, interactive
jobs can be processed at a user's own workstation and the processors in
the pool may be used to process noninteractive, computation-intensive jobs.
1.4 Distributed Operating Systems
Tanenbaum and Van Renesse define an operating system as a program
that controls the resources of a computer system and provides its users with
an interface or virtual machine that is more convenient to use than the bare
machine. According to this definition, the two primary tasks of an operating
system are as follows:
1. To present users with a virtual machine that is easier to program than
the underlying hardware.
2. To manage the various resources of the system. This involves
performing such tasks as keeping track of who is using which resource,
Page 21
Advanced Operating Systems (Distributed Systems) Unit 1
Sikkim Manipal University Page No. 21
granting resource requests, accounting for resource usage, and
mediating conflicting requests from different programs and users.
Therefore, the users' view of a computer system, the manner in which the
users access the various resources of the computer system, and the ways
in which the resource requests are granted depend to a great extent on the
operating system of the computer system. The operating systems commonly
used for distributed computing systems can be broadly classified into two
types – network operating systems and distributed operating systems. The
three most important features commonly used to differentiate between these
two types of operating systems are system image, autonomy, and fault
tolerance capability. These features are given below:
System image: Under network OS, the user views the distributed system
as a collection of machines connected by a communication subsystem.
i.e the user is aware of the fact that multiple computers are used. A
distributed OS hides the existence of multiple computers and provides a
single system image to the users.
Autonomy: A network OS is built on a set of existing centralized OSs and
handles the interfacing and coordination of remote operations and
communications between these OSs. So, in this case, each machine has its
own OS. With a distributed OS, there is a single system-wide OS and each
computer runs part of this global OS.
Fault tolerance capability: A network operating system provides little or no
fault tolerance capability in the sense that if 10% of the machines of the
entire distributed computing system are down at any moment, at least 10%
of the users are unable to continue with their work. On the other hand, with
a distributed operating system, most of the users are normally unaffected by
the failed machines and can continue to perform their work normally, with
only a 10% loss in performance of the entire distributed computing system.
Page 22
Advanced Operating Systems (Distributed Systems) Unit 1
Sikkim Manipal University Page No. 22
Therefore, the fault tolerance capability of a distributed operating system is
usually very high as compared to that of a network operating system.
1.5 Issues in Designing Distributed Operating Systems
In general, designing a distributed operating system is more difficult than
designing a centralized operating system for several reasons. In the design
of a centralized operating system, it is assumed that the operating system
has access to complete and accurate information about the environment in
which it is functioning. For example, a centralized operating system can
request status information, being assured that the interrogated component
will not change state while awaiting a decision based on that status
information, since only the single operating system asking the question may
give commands. However, a distributed operating system must be designed
with the assumption that complete information about the system
environment will never be available. In a distributed system, the resources
are physically separated, there is no common clock among the multiple
processors, delivery of messages is delayed, and messages could even be
lost. Due to all these reasons, a distributed operating system does not have
up-to-date, consistent knowledge about the state of the various components
of the underlying distributed system. Obviously, lack of up-to-date and
consistent information makes many things (such as management of
resources and synchronization of cooperating activities) much harder in the
design of a distributed operating system. For example, it is hard to schedule
the processors optimally if the operating system is not sure how many of
them are up at the moment.
Despite these complexities and difficulties, a distributed operating system
must be designed to provide all the advantages of a distributed system to its
users. That is, the users should be able to view a distributed system as a
virtual centralized system that is flexible, efficient, reliable, secure, and easy
Page 23
Advanced Operating Systems (Distributed Systems) Unit 1
Sikkim Manipal University Page No. 23
to use. To meet this challenge, the designers of a distributed operating
system must deal with several design issues. Some of the key design issues
are described below.
Transparency
One of the main goals of a distributed operating system is to make the
existence of multiple computers invisible (transparent) and provide a single
system image to its users. That is, a distributed operating system must be
designed in such a way that a collection of distinct machines connected by a
communication subsystem appears to its users as a virtual uniprocessor.
Achieving complete transparency is a difficult task and requires that several
different aspects of transparency be supported by the distributed operating
system. The eight forms of transparency identified by the International
Standards Organization's Reference Model for Open Distributed Processing
[ISO 1992] are access transparency, location transparency, replication
transparency, failure transparency, migration transparency, concurrency
transparency, performance transparency, and scaling transparency.
Access Transparency
Access transparency means that users should not need or be able to
recognize whether a resource (hardware or software) is remote or local.
This implies that the distributed operating system should allow users to
access remote resources in the same way as local resources. That is, the
user interface, which takes the form of a set of system calls, should not
distinguish between local and remote resources, and it should be the
responsibility of the distributed operating system to locate the resources and
to arrange for servicing user requests in a user-transparent manner.
This requirement leads to the development and deployment of a well-
designed set of system calls that are meaningful in both centralized and
distributed environments and a global resource naming facility. Due to the
Page 24
Advanced Operating Systems (Distributed Systems) Unit 1
Sikkim Manipal University Page No. 24
need to handle communication failures in distributed systems, it is not
possible to design system calls that provide complete access transparency.
However, the area of designing a global resource naming facility has been
well researched with considerable success. The distributed shared memory
mechanism is also meant to provide a uniform set of system calls for
accessing both local and remote memory objects. Although this mechanism
is quite useful in providing access transparency, it is suitable only for limited
types of distributed applications due to its performance limitation.
Location Transparency
The two main aspects of location transparency are as follows:
Name transparency refers to the fact that the name of a resource
(hardware or software) should not reveal any hint as to the physical
location of the resource. That is the name of a resource should be
independent of the physical connectivity or topology of the system or the
current location of the resource. Furthermore, such resources, which are
capable of being moved from one node to another in a distributed
system (such as a file), must be allowed to move without having their
names changed. Therefore, resource names must be unique
systemwide.
User mobility refers to the fact that no matter which machine a user is
logged onto, he or she should be able to access a resource with the
same name. That is, the user should not be required to use different
names to access the same resource from two different nodes of the
system. In a distributed system that supports user mobility, users can
freely log on to any machine in the system and access any resource
without making any extra effort.
Both name transparency and user mobility requirements call for a
system wide, global resource naming facility.
Page 25
Advanced Operating Systems (Distributed Systems) Unit 1
Sikkim Manipal University Page No. 25
Replication Transparency
For better performance and reliability, almost all distributed operating
systems have the provision to create replicas (additional copies) of files and
other resources on different nodes of the distributed system. In these
systems, both the existence of multiple copies of a replicated resource and
the replication activity should be transparent to the users. That is, two
important issues related to replication transparency are naming of replicas
and replication control. It is the responsibility of the system to name the
various copies of a resource and to map a user-supplied name of the
resource to an appropriate replica of the resource. Furthermore, replication
control decisions such as how many copies of the resource should be
created, where should each copy be placed, and when should a copy be
created/deleted should be made entirely automatically by the system in a
user-transparent manner.
Failure Transparency
Failure transparency deals with masking from the users' partial failures in
the system, such as a communication link failure, a machine failure, or a
storage device crash. A distributed operating system having failure
transparency property will continue to function, perhaps in a degraded form,
in the face of partial failures. For example, suppose the file service of a
distributed operating system is to be made failure transparent. This can be
done by implementing it as a group of file servers that closely cooperate
with each other to manage the files of the system and that function in such a
manner that the users can utilize the file service even if only one of the file
servers is up and working. In this case, the users cannot notice the failure of
one or more file servers, except for slower performance of file access
operations. Any type of service can be implemented in this way for failure
transparency. However, in this type of design, care should be taken to
Page 26
Advanced Operating Systems (Distributed Systems) Unit 1
Sikkim Manipal University Page No. 26
ensure that the cooperation among multiple servers does not add too much
overhead to the system.
Complete failure transparency is not achievable with the current state of the
art in distributed operating systems because all types of failures cannot be
handled in a user-transparent manner. For example, failure of the
communication network of a distributed system normally disrupts the work of
its users and is noticeable by the users. Moreover, an attempt to design a
completely failure-transparent distributed system will result in a very slow
and highly expensive system due to the large amount of redundancy
required for tolerating all types of failures. The design of such a distributed
system, although theoretically possible, is not practically justified.
Migration Transparency
For better performance, reliability, and security reasons, an object that is
capable of being moved (such as a process or a file) is often migrated from
one node to another in a distributed system. The aim of migration
transparency is to ensure that the movement of the object is handled
automatically by the system in a user-transparent manner. Three important
issues in achieving this goal are as follows:
i) Migration decisions such as which object is to be moved from where
to where should be made automatically by the system.
ii) Migration of an object from one node to another should not require
any change in its name.
iii) When the migrating object is a process, the interprocess
communication mechanism should ensure that a message sent to the
migrating process reaches it without the need for the sender process
to resend it if the receiver process moves to another node before the
message is received.
Page 27
Advanced Operating Systems (Distributed Systems) Unit 1
Sikkim Manipal University Page No. 27
Concurrency Transparency
In a distributed system, multiple users who are spatially separated use the
system concurrently. In such a situation, it is economical to share the
system resources (hardware or software) among the concurrently executing
user processes. However, since the number of available resources in a
computing system is restricted, one user process must necessarily influence
the action of other concurrently executing user processes, as it competes for
resources. For example, concurrent update to the same file by two different
processes should be prevented. Concurrency transparency means that
each user has a feeling that he or she is the sole user of the system and
other users do not exist in the system. For providing concurrency
transparency, the resource sharing mechanisms of the distributed operating
system must have the following four properties:
i) An event-ordering property ensures that all access requests to various
system resources are properly ordered to provide a consistent view to
all users of the system.
ii) A mutual-exclusion property ensures that at any time at most one
process accesses a shared resource, which must not be used
simultaneously by multiple processes if program operation is to be
correct.
iii) A no-starvation property ensures that if every process that is granted
a resource, which must not be used simultaneously by multiple
processes, eventually releases it, every request for that resource is
eventually granted.
iv) A no-deadlock property ensures that a situation will never occur in
which competing processes prevent their mutual progress even
though no single one requests more resources than available in the
system.
Page 28
Advanced Operating Systems (Distributed Systems) Unit 1
Sikkim Manipal University Page No. 28
Performance Transparency
The aim of performance transparency is to allow the system to be
automatically reconfigured to improve performance, as loads vary
dynamically in the system. As far as practicable, a situation in which one
processor of the system is overloaded with jobs while another processor is
idle should not be allowed to occur. That is, the processing capability of the
system should be uniformly distributed among the currently available jobs in
the system. This requirement calls for the support of intelligent resource
allocation and process migration facilities in distributed operating systems.
Scaling Transparency
The aim of scaling transparency is to allow the system to expand in scale
without disrupting the activities of the users. This requirement demands for
open-system architecture and the use of scalable algorithms for designing
the distributed operating system components.
Reliability
In general, distributed systems are expected to be more reliable than
centralized systems due to the existence of multiple instances of resources.
However, the existence of multiple instances of the resources alone cannot
increase the system's reliability. Rather, the distributed operating system,
which manages these resources, must be designed properly to increase the
system's reliability by taking full advantage of this characteristic feature of a
distributed system.
A fault is a mechanical or algorithmic defect that may generate an error. A
fault in a system causes system failure. Depending on the manner in which
a failed system behaves, system failures are of two types – fail-stop and
Byzantine. In the case of fail-stop failure, the system stops functioning after
changing to a state in which its failure can be detected. On the other hand,
in the case of Byzantine failure, the system continues to function but
Page 29
Advanced Operating Systems (Distributed Systems) Unit 1
Sikkim Manipal University Page No. 29
produces wrong results. Undetected software bugs often cause Byzantine
failure of a system. Obviously, Byzantine failures are much more difficult to
deal with than fail-stop failures.
For higher reliability, the fault-handling mechanisms of a distributed
operating system must be designed properly to avoid faults, to tolerate
faults, and to detect and recover from faults. Commonly used methods for
dealing with these issues are:
Fault Avoidance: Fault avoidance deals with designing the components of
the system in such a way that the occurrence of faults is minimized.
Conservative design practices such as using high-reliability components are
often employed for improving the system's reliability based on the idea of
fault avoidance. Although a distributed operating system often has little or
no role to play in improving the fault avoidance capability of a hardware
component, the designers of the various software components of the
distributed operating system must test them thoroughly to make these
components highly reliable.
Fault Tolerance: Fault tolerance is the ability of a system to continue
functioning in the event of partial system failure. The performance of the
system might be degraded due to partial failure, but otherwise the system
functions properly.
Fault Detection and Recovery: The fault detection and recovery method of
improving reliability deals with the use of hardware and software
mechanisms to determine the occurrence of a failure and then to correct the
system to a state acceptable for continued operation.
Flexibility
Another important issue in the design of distributed operating systems is
flexibility. Flexibility is the most important feature for open distributed
Page 30
Advanced Operating Systems (Distributed Systems) Unit 1
Sikkim Manipal University Page No. 30
systems. The design of a distributed operating system should be flexible
due to the following reasons:
Ease of modification. From the experience of system designers, it has
been found that some parts of the design often need to be replaced/
modified either because some bug is detected in the design or because
the design is no longer suitable for the changed system environment or
new-user requirements. Therefore, it should be easy to incorporate
changes in the system in a user-transparent manner or with minimum
interruption caused to the users.
Ease of enhancement. In every system, new functionalities have to be
added from time to time to make it more powerful and easy to use.
Therefore, it should be easy to add new services to the system.
Furthermore, if a group of users do not like the style in which a particular
service is provided by the operating system, they should have the
flexibility to add and use their own service that works in the style with
which the users of that group are more familiar and feel more
comfortable.
The most important design factor that influences the flexibility of a
distributed operating system is the model used for designing its kernel. The
kernel of an operating system is its central controlling part that provides
basic system facilities. It operates in a separate address space that is
inaccessible to user processes. It is the only part of an operating system
that a user cannot replace or modify. In the case of a distributed operating
system identical kernels are run on all the nodes of a distributed system.
Performance
If a distributed system is to be used, its performance must be at least as
good as a centralized system. That is, when a particular application is run
on a distributed system, its overall performance should be better than or at
Page 31
Advanced Operating Systems (Distributed Systems) Unit 1
Sikkim Manipal University Page No. 31
least equal to that of running the same application on a single-processor
system. However, to achieve this goal, it is important that the various
components of the operating system of a distributed system be designed
properly; otherwise, the overall performance of the distributed system may
turn out to be worse than a centralized system. Some design principles
considered useful for better performance are as follows:
Batch if possible. Batching often helps in improving performance
greatly. For example, transfer of data across the network in large chunks
rather than as individual pages is much more efficient. Similarly,
piggybacking of acknowledgment of previous messages with the next
message during a series of messages exchanged between two
communicating entities also improves performance.
Cache whenever possible. Caching of data at clients' sites frequently
improves overall system performance because it makes data available
wherever it is being currently used, thus saving a large amount of
computing time and network bandwidth. In addition, caching reduces
contention on centralized resources.
Minimize copying of data. Data copying overhead involves a
substantial CPU cost of many operations. For example, while being
transferred from its sender to its receiver, a message data may take the
following path on the sending side:
a. From sender's stack to its message buffer
b. From the message buffer in the sender's address space to the
message buffer in the kernel's address space
c. Finally, from the kernel to the network interface board
On the receiving side, the data probably takes a similar path in the
reverse direction. Therefore, in this case, a total of six copy operations
are involved in the message transfer operation. Similarly, in several
Page 32
Advanced Operating Systems (Distributed Systems) Unit 1
Sikkim Manipal University Page No. 32
systems, the data copying overhead is also large for read and write
operations on block I/O devices. Therefore, for better performance, it is
desirable to avoid copying of data, although this is not always simple to
achieve. Making optimal use of memory management often helps in
eliminating much data movement between the kernel, block I/O devices,
clients, and servers.
Minimize network traffic. System performance may also be improved
by reducing internode communication costs. For example, accesses to
remote resources require communication, possibly through intermediate
nodes. Therefore, migrating a process closer to the resources it is using
most heavily may be helpful in reducing network traffic in the system if
the decreased cost of accessing its favorite resource offsets the possible
increased cost of accessing its less favored ones. Another way to
reduce network traffic is to use the process migration facility to cluster
two or more processes that frequently communicate with each other on
the same node of the system. Avoiding the collection of global state
information for making some decision also helps in reducing network
traffic.
Take advantage of fine-grain parallelism for multiprocessing.
Performance can also be improved by taking advantage of fine-grain
parallelism for multiprocessing. For example, threads (described in
Chapter 8) are often used for structuring server processes. Servers
structured as a group of threads can operate efficiently because they
can simultaneously service requests from several clients. Fine-grained
concurrency control of simultaneous accesses by multiple processes to
a shared resource is another example of application of this principle for
better performance.
Page 33
Advanced Operating Systems (Distributed Systems) Unit 1
Sikkim Manipal University Page No. 33
Scalability
Scalability refers to the capability of a system to adapt to increased service
load. It is inevitable that a distributed system will grow with time since it is
very common to add new machines or an entire subnetwork to the system to
take care of increased workload or organizational changes in a company.
Therefore, a distributed operating system should be designed to easily cope
with the growth of nodes and users in the system. That is, such growth
should not cause serious disruption of service or significant loss of
performance to users. Some guiding principles for designing scalable
distributed systems are as follows:
Avoid centralized entities: such as single file server
Avoid centralized algorithms
Perform most operations on client workstations: servers are shared by
several clients
Heterogeneity
A heterogeneous distributed system consists of interconnected sets of
dissimilar hardware or software systems. Because of the diversity, designing
heterogenous distributed systems is far more difficult than designing
homogeneous distributed systems in which each system is based on the
same, or closely related, hardware and software. However, as a
consequence of large scale, heterogeneity is often inevitable in distributed
systems. Furthermore, often heterogeneity is preferred by many users
because heterogeneous distributed systems provide the flexibility to their
users of different computer platforms for different applications. For example,
a user may have the flexibility of a supercomputer for simulations, a
Macintosh for document processing, and a UNIX workstation for program
development.
Page 34
Advanced Operating Systems (Distributed Systems) Unit 1
Sikkim Manipal University Page No. 34
Incompatibilities in a heterogeneous distributed system may be of different
types. For example, the internal formatting schemes of different
communication and host processors may be different; or when several
networks are interconnected via gateways, the communication protocols
and topologies of different networks may be different; or the servers
operating at different nodes of the system may be different. For instance,
some hosts use 32-bit word lengths while others use word lengths of 16 or
64 bits. Byte ordering within these data constructs can vary as well,
requiring special converters to enable data sharing between incompatible
hosts.
In a heterogeneous distributed system, some form of data translation is
necessary for interaction between two incompatible nodes. Some earlier
systems left this translation to the users, but this is no longer acceptable.
The data translation job may be performed either at the sender's node or at
the receiver's node. Suppose this job is performed at the receiver's node.
With this approach, at every node there must be a translator to convert each
format in the system to the format used on the receiving node. Therefore, if
there are n different formats, n - 1 pieces of translation software must be
supported at each node, resulting in a total of n (n - 1) pieces of translation
software in the system. This is undesirable, as adding a new type of format
becomes a more difficult task over time. Performing the translation job at the
sender's node instead of the receiver's node also suffers from the same
drawback.
The software complexity of this translation process can be greatly reduced
by using an intermediate standard data format. In this method, an
intermediate standard data format is declared, and each node only requires
a translation software for converting from its own format to the standard
format and from the standard format to its own format. In this case, when
two incompatible nodes interact at the sender node, the data to be sent is
Page 35
Advanced Operating Systems (Distributed Systems) Unit 1
Sikkim Manipal University Page No. 35
first converted to the standard format, the data is moved in the format of the
standard, and finally, at the receiver node, the data is converted from the
standard format to the receiver's format. By choosing the standard format to
be the most common format in the system, the number of conversions can
be reduced.
Security
In order that the users can trust the system and rely on it, the various
resources of a computer system must be protected against destruction and
unauthorized access. Enforcing security in a distributed system is more
difficult than in a centralized system because of the lack of a single point of
control and the use of insecure networks for data communication. In a
centralized system, all users are authenticated by the system at login time,
and the system can easily check whether a user is authorized to perform the
requested operation on an accessed resource. In a distributed system,
however, since the client-server model is often used for requesting and
providing services, when a client sends a request message to a server, the
server must have some way of knowing who the client is. This is not so
simple as it might appear because any client identification field in the
message cannot be trusted. This is because an intruder (a person or
program trying to obtain unauthorized access to system resources) may
pretend to be an authorized client or may change the message contents
during transmission. Therefore, as compared to a centralized system,
enforcement of security in a distributed system has the following additional
requirements:
It should be possible for the sender of a message to know that the
message was received by the intended receiver.
It should be possible for the receiver of a message to know that the
message was sent by the genuine sender.
Page 36
Advanced Operating Systems (Distributed Systems) Unit 1
Sikkim Manipal University Page No. 36
It should be possible for both the sender and receiver of a message to
be guaranteed that the contents of the message were not changed while
it was in transfer.
Cryptography is the only known practical method for dealing with these
security aspects of a distributed system. In this method, comprehension of
private information is prevented by encrypting the information, which can
then be decrypted only by authorized users.
Another guiding principle for security is that a system whose security
depends on the integrity of the fewest possible entities is more likely to
remain secure as it grows. For example, it is much simpler to ensure
security based on the integrity of the much smaller number of servers rather
than trusting thousands of clients. In this case, it is sufficient to only ensure
the physical security of these servers and the software they run.
1.6 Introduction to Distributed Computing Environment (DCE)
A vendor-independent distributed computing environment, DCE was defined
by the Open Software Foundation (OSF), a consortium of computer
manufacturers, including IBM, DEC, and Hewlett-Packard. It is not an
operating system, nor is it an application. Rather, it is an integrated set of
services and tools that can be installed as a coherent environment on top of
existing operating systems and serve as a platform for building and running
distributed applications.
A primary goal of DCE is vendor independence. It runs on many different
kinds of computers, operating systems, and networks produced by different
vendors. For example, some operating systems to which DCE can be easily
ported include OSF/1, AIX, DOMAIN OS, ULTRIX, HP-UX, SINIX, SunOS,
UNIX System V, VMS, WINDOWS, and OS/2. On the other hand, it can be
used with any network hardware and transport software, including TCP/IP,
X.25, as well as other similar products.
Page 37
Advanced Operating Systems (Distributed Systems) Unit 1
Sikkim Manipal University Page No. 37
As shown in Figure 1.7, DCE is a middleware software layered between the
DCE applications layer and the operating system and networking layer. The
basic idea is to take a collection of existing machines (possibly from different
vendors), interconnect them by a communication network, add the DCE
software platform on top of the native operating systems of the machines,
and then be able to build and run distributed applications. Each machine
has its own local operating system, which may be different from that of other
machines. The DCE software layer on top of the operating system and
networking layer hides the differences between machines by automatically
performing data-type conversions when necessary. Therefore, the
heterogeneous nature of the system is transparent to the applications
programmers, making their job of writing distributed applications much
simpler.
DCE applications
DCE software
Operating systems and networking
Fig. 1.7: Position of DCE Software in a DCE-based Distributed System
DCE Components
DCE is a blend of various technologies developed independently and nicely
integrated by OSF. Each of these technologies forms a component of DCE.
The main components of DCE are as follows:
Threads package: It provides a simple programming model for building
concurrent applications. It includes operations to create and control
multiple threads of execution in a single process and to synchronize
access to global data within an application.
Remote Procedure Call (RPC) facility: It provides programmers with a
number of powerful tools necessary to build client-server applications. In
Page 38
Advanced Operating Systems (Distributed Systems) Unit 1
Sikkim Manipal University Page No. 38
fact, the DCE RPC facility is the basis for all communication in DCE
because the programming model underlying all of DCE is the client-
server model. It is easy to use, network-independent and protocol-
independent, provides secure communication between a client and a
server, and hides differences in data requirements by automatically
converting data to the appropriate forms needed by clients and servers.
Distributed Time Service (DTS): It closely synchronizes the clocks of
all the computers in the system. It also permits the use of time values
from external time sources to synchronize the clocks of the computers in
the system with external time. This facility can also be used to
synchronize the clocks of the computers of one distributed environment
with the clocks of the computers of another distributed environment.
Name services: The name services of DCE include the Cell Directory
Service (CDS), the Global Directory Service (GDS), and the Global
Directory Agent (GDA). These services allow resources such as servers,
files, devices, and so on, to be uniquely named and accessed in a
location-transparent manner.
Security Service: It provides the tools needed for authentication and
authorization to protect system resources against illegitimate access.
Distributed File Service (DFS): It provides a systemwide file system
that has such characteristics as location transparency, high
performance, and high availability. A unique feature of DCE DFS is that
it can also provide file services to clients of other file systems.
DCE Cells
The DCE system is highly scalable in the sense that a system running DCE
can have thousands of computers and millions of users spread over a
worldwide geographic area. To accommodate such large systems, DCE
uses the concept of cells. This concept helps break down a large system
into smaller, manageable units called cells.
Page 39
Advanced Operating Systems (Distributed Systems) Unit 1
Sikkim Manipal University Page No. 39
In a DCE system, a cell is a group of users, machines, or other resources
that typically have a common purpose and share common DCE services.
The minimum cell configuration requires a cell directory server, a security
server, a distributed time server, and one or more client machines. Each
DCE client machine has client processes for security service, cell directory
service, distributed time service, RPC facility, and threads facility. A DCE
client machine may also have a process for distributed file service if a cell
configuration has a DCE distributed file server. Due to the use of the method
of intersection for clock synchronization, it is recommended that each cell in
a DCE system should have at least three distributed time servers.
An important decision to be made while setting up a DCE system is to
decide the cell boundaries. The following four factors should be taken into
consideration for making this decision.
i) Purpose: The machines of users working on a common goal should be
put in the same cell, as they need easy access to a common set of
system resources. That is, users of machines in the same cell have
closer interaction with each other than with users of machines in
different cells. For example, if a company manufactures and sells
various types of products, depending on the manner in which the
company functions, either a product-oriented or a function-oriented
approach may be taken to decide cell boundaries. In the product-
oriented approach, separate cells are formed for each product, with the
users of the machines belonging to the same cell being responsible for
all types of activities (design, manufacturing, marketing, and support
services) related to one particular product. On the other hand, in the
function-oriented approach, separate cells are formed for each type of
activity, with the users belonging to the same cell being responsible for
a particular activity, such as design, of all types of products.
Page 40
Advanced Operating Systems (Distributed Systems) Unit 1
Sikkim Manipal University Page No. 40
ii) Administration: Each system needs an administrator to register new
users in the system and to decide their access rights to the system's
resources. To perform his or her job properly, an administrator must
know the users and the resources of the system. Therefore, to simplify
administration jobs, all the machines and their users that are known to
and manageable by an administrator should be put in a single cell. For
example, all machines belonging to the same department of a company
or a university can belong to a single cell. From an administration point
of view, each cell has a different administrator.
iii) Security: Machines of those users who have greater trust in each
other should be put in the same cell. That is, users of machines of a
cell trust each other more than they trust the users of machines of other
cells. In such a design, cell boundaries act like firewalls in the sense
that accessing a resource that belongs to another cell requires more
sophisticated authentication than accessing a resource that belongs to
a user's own cell.
iv) Overhead: Several DCE operations, such as name resolution and user
authentication, incur more overhead when they are performed between
cells than when they are performed within the same cell. Therefore,
machines of users who frequently interact with each other and the
resources frequently accessed by them should be placed in the same
cell The need to access a resource of another cell should arise
infrequently for better overall system performance.
1.7 Summary
A distributed computing system is a collection of processors interconnected
by a communication network in which each processor has its own local
memory and other peripherals and communication between any two
Page 41
Advanced Operating Systems (Distributed Systems) Unit 1
Sikkim Manipal University Page No. 41
processors of the system takes place by message passing over the
communication network.
The existing models for distributed computing systems can be broadly
classified into five models: minicomputer, workstation, workstation-server,
processor-pool, and hybrid.
Distributed computing systems are much more complex and difficult to build
than the traditional centralized systems. Despite the increased complexity
and the difficulty of building, the installation and use of distributed computing
systems are rapidly increasing. This is mainly because the advantages of
distributed computing systems outweigh its disadvantages. The main
advantages of distributed computing systems are (a) suitability for inherently
distributed applications, (b) sharing of information among distributed users,
(c) sharing of resources, (d) better price-performance ratio, (e) shorter
response times and higher throughput, (f) higher reliability, (g) extensibility
and incremental growth, and (h) better flexibility in meeting users' needs.
The operating systems commonly used for distributed computing systems
can be broadly classified into two types: network operating systems and
distributed operating systems. As compared to a network operating system,
a distributed operating system has better transparency and fault tolerance
capability and provides the image of a virtual uniprocessor to the users.
The main issues involved in the design of a distributed operating system are
transparency, reliability, flexibility, performance, scalability, heterogeneity,
security, and emulation of existing operating systems.
DCE is an integrated set of services and tools that can be installed as a
coherent environment on top of existing operating systems and serve as a
platform for building and running distributed applications. A primary goal of
DCE is vendor independence. It runs on many different kinds of computers,
operating systems, and networks produced by different vendors.
Page 42
Advanced Operating Systems (Distributed Systems) Unit 1
Sikkim Manipal University Page No. 42
1.8 Terminal Questions
1. Discuss the relative advantages and disadvantages of the various
commonly used models for configuring distributed computing systems.
2. What are the main differences between a network operating system and
a distributed operating system?
3. What are the major issues in designing a distributed operating system?
4. Why is scalability an important feature in the design of a distributed
system?.
5. What are the main components of DCE?
Page 43
Advanced Operating Systems (Distributed Systems) Unit 2
Sikkim Manipal University Page No. 43
Unit 2 Message Passing
Structure:
2.1 Introduction
Objectives
2.2 Features of Message Passing
2.3 Issues in IPC by Message Passing
2.4 Synchronization
2.5 Buffering
2.6 Process Addressing
2.7 Failure Handling
2.8 Group Communication
2.9 Terminal Questions
2.1 Introduction
A process is a program in execution. When we say that two computers of a
distributed system are communicating with each other, we mean that two
processes, one running on each computer, are in communication with each
other. A distributed operating system needs to provide interprocess
communication (IPC) mechanisms to facilitate such communication
activities. A message passing system is a subsystem of the distributed
operating system which shields the details of complex network protocols
from the programmer. It enables processes to communicate by exchanging
messages and allows programs to be written by using simple
communication primitives such as send and receive. Interprocess
communication basically requires information sharing among two or more
processes. The two basic methods for information sharing are as follows:
Page 44
Advanced Operating Systems (Distributed Systems) Unit 2
Sikkim Manipal University Page No. 44
i) Original Sharing or Shared Data approach
Message is placed in a common memory area that is accessible
to all processes.
This is not possible in a distributed system, unless it is a
distributed shared memory system (DSM).
ii) Copy Sharing or Message Passing approach
Message is physically copied from sender’s address space to the
receiver’s address space.
This is the basic IPC mechanism in distributed systems.
In the shared data approach, the information to be shared is placed in a
common memory area that is accessible to all the processes involved in an
IPC. The shared data paradigm gives the conceptual communication pattern
illustrated in figure 2.1 below:
Figure 2.1: Communications in Shared Data Paradigm
In the method of message passing, the information to be shared is
physically copied from the sender process’s address space to the address
space of all the receiver processes, and this is done by transmitting the data
to be copied in the form of messages (a message is a block of information).
The message passing paradigm gives the conceptual communication
Shared Common Memory Area
P1 P2 Pn-1 Pn …
Page 45
Advanced Operating Systems (Distributed Systems) Unit 2
Sikkim Manipal University Page No. 45
pattern as shown in figure 2.2 below. In this case the communicating
processes interact directly with each other.
Figure 2.2: Communication in Message Passing Paradigm
Since computers in a network do not share memory, processes in a
distributed system normally communicate by exchanging messages with
among them. Therefore message passing is the basic IPC mechanism in
distributed systems.
2.2 Features of a Message Passing System
Desirable features of a good message passing system are:
Simplicity
Efficiency
Reliability
Correctness
Flexibility
Security
Portability
Simplicity
The message passing system should be
– easy to use
– easy to develop new applications that communicate with the existing
ones
– able to hide the details of underlying network protocols used
Efficiency
– Should reduce the number of message exchanges (acks,..)
P1 P2
Page 46
Advanced Operating Systems (Distributed Systems) Unit 2
Sikkim Manipal University Page No. 46
– Avoid the costs of establishing and terminating connections between
the same pair of processes for each and every message
– Piggyback acknowledgments with the normal messages
– Send acknowledgments selectively
Reliability
– Should handle node and link failures
– Normally handled by acknowledgments, timeouts and
retransmissions.
– Should handle duplicate messages that arise due to retransmissions
(generally sequence numbers of the messages are used for this
purpose).
Correctness
– Atomicity: messages sent to a group of processes will be delivered
to all of them or none of them.
– Ordered delivery: Messages are received by all receivers in an
order acceptable to the application.
– Survivability: Guarantees messages will be delivered correctly in
spite of failures.
Flexibility
– IPC protocols should be flexible to cater to the various needs
different applications (i.e. some may not require atomicity others may
not require ordered delivery, etc)
– IPC primitives should be flexible to permit any kind of control flow
between cooperating processes, including synchronous and
asynchronous send and receive.
Security
– Message passing system should be capable of providing secure
end-to-end communication.
Page 47
Advanced Operating Systems (Distributed Systems) Unit 2
Sikkim Manipal University Page No. 47
– Support mechanisms for authentication of the receivers of a
message by a sender.
– Support mechanisms for authentication of the sender by its receivers
– Support encryption of a message before sending it over the network.
Portability: There are two different aspects of portability in a message-
passing system:
1. The message-passing system should itself be portable. It should be
possible to easily construct a new IPC facility on another system by
reusing the basic design of the existing message-passing system.
2. The applications written by using the primitives of the IPC protocols
of the message-passing system should be made portable. This
requires that heterogeneity must be considered while designing a
message – passing system. This may require the use of external
data representation format for the communication taking place
between two or more processes running on computers of different
architectures.
2.3 Issues in IPC (Inter-process Communication) by Message
Passing
A message is a meaningful formatted block of information sent by the
sender process to the receiver process. The message block consists of a
fixed length header followed by a variable size collection of typed data
objects.
The header block of a message may have the following elements:
Address: A set of characters that uniquely identify both the sender and
receiver.
Sequence Number: It is the Message Identifier to identify duplicate and
lost messages in case of system failures.
Page 48
Advanced Operating Systems (Distributed Systems) Unit 2
Sikkim Manipal University Page No. 48
Structural Information: It has two parts. The type part that specifies
whether the data to be sent to the receiver is included within the
message or the message only contains a pointer to the data. The
second part specifies length of the variable-size message.
In a message oriented IPC protocol, the users are fully aware of the
message formats used in the communication process and the mechanisms
used to send and receive messages.
The following are some important issues to be considered for the design of
an IPC protocol based message passing system:
The Sender’s Identity
The Receiver’s Identity
Number of Receivers
Guaranteed acceptance of sent messages by the receiver
Acknowledgement by the sender
Handling system crashes or link failures
Handling of buffers
Order of delivery of messages
The above issues are addressed by the semantics of the communication
primitives provided by the IPC Protocol. A general descriptionof the various
ways in which these issues are addressed by message oriented IPC
protocols is presented below.
2.4 Synchronization
A major issue in communication is the synchronization imposed on the
communicating processes by the communication primitives. There are two
types of communicating primitives: Blocking Semantics and Non-Blocking
Semantics.
Page 49
Advanced Operating Systems (Distributed Systems) Unit 2
Sikkim Manipal University Page No. 49
Blocking Semantics: A communication primitive is said to have
blocking semantics if its invocation blocks the execution of its invoker
(for example in the case of send, the sender blocks until it receives an
acknowledgement from the receiver.)
Non-blocking Semantics: A communication primitive is said to have
non-blocking semantics if its invocation does not block the execution of
its invoker.
The synchronization imposed on the communicating processes basically
depends on one of the two types of semantics used for the send and receive
primitives.
Blocking Primitives
Blocking Send Primitive: In this case, after execution of the send
statement, the sending process is blocked until it receives an
acknowledgement from the receiver that the message has been received.
Non-Blocking Send Primitive: In this case, after execution of the send
statement, the sending process is allowed to proceed with its execution as
soon as the message is copied to the buffer.
Blocking Receive Primitive: In this case, after execution of the receive
statement, the receiving process is blocked until it receives a message.
Non-Blocking Receive Primitive: In this case, the receiving process
proceeds with its execution after the execution of receive statement, which
returns the control almost immediately just after telling the kernel where the
message buffer is.
Handling non-blocking receives: The following are the two ways of doing
this:
– Polling: a test primitive is used by the receiver to check the buffer status
Page 50
Advanced Operating Systems (Distributed Systems) Unit 2
Sikkim Manipal University Page No. 50
– Interrupt: When a message is filled in the buffer, software interrupt is
used to notify the receiver. However, user level interrupts make
programming difficult.
Handling blocking receives: A timeout value may be used with a blocking
receive primitive to prevent a receiving process from getting blocked
indefinitely if the sender has failed.
Synchronous Vs Asynchronous Communication
When both send and receive primitives of a communication between two
processes use blocking semantics, the communication is said to be
synchronous. If one or both of the primitives is non-blocking, then the
communication is said to be asynchronous.
Synchronous communication is easy to implement. It contributes to the
reliable delivery of messages. Asynchronous communication limits
concurrency and is prone to communication deadlocks.
2.5 Buffering
The transmission of messages from one process to another can be done by
copying the body of the message from the sender’s address space to the
receiver’s address space. In some cases, the receiving process may not be
ready to receive the message but it wants the operating system to save that
message for later reception. In such cases, the operating system would rely
on the receiver’s buffer space in which the transmitted messages can be
stored prior to receiving process executing specific code to receive the
message.
The synchronous and asynchronous modes of communication correspond
to the two extremes of buffering: a null buffer, or no buffering, and a buffer
with unbounded capacity. Two other commonly used buffering strategies are
Page 51
Advanced Operating Systems (Distributed Systems) Unit 2
Sikkim Manipal University Page No. 51
single-message and finite-bound, or multiple message buffers. These four
types of buffering strategies are given below:
No buffering: In this case, message remains in the sender’s address
space until the receiver executes the corresponding receive.
Single message buffer: A buffer to hold a single message at the
receiver side is used. It is used for implementing synchronous
communication because in this case an application can have only one
outstanding message at any given time.
Unbounded - Capacity buffer: Convenient to support asynchronous
communication. However, it is impossible to support unbounded buffer.
Finite-Bound Buffer: Used for supporting asynchronous
communication.
Buffer overflow can be handled in one of the following ways:
Unsuccessful communication: send returns an error message to the
sending process, indicating that the message could not be delivered to
the receiver because the buffer is full.
Flow-controlled communication: The sender is blocked until the
receiver accepts some messages. This violates the semantics of
asynchronous send. This will also result in communication deadlock.
A message data should be meaningful to the receiving process. This implies
ideally that the structure of the program should be preserved while they are
being transmitted from the address space of the sending process to the
address space of the receiving process. It is not possible in heterogeneous
systems in which the sending and receiving processes are on computers of
different architectures. Even in homogeneous systems, it is very difficult to
achieve this goal mainly because of two reasons:
Page 52
Advanced Operating Systems (Distributed Systems) Unit 2
Sikkim Manipal University Page No. 52
1. An absolute pointer value has no meaning (more on this when we talk
about RPC). For example, a pointer to a tree or linked list. So, proper
encoding mechanisms should be adopted to pass such objects.
2. Different program objects, such as integers, long integers, short
integers, and character strings occupy different storage space. So, from
the encoding of these objects, the receiver should be able to identify the
type and size of the objects.
One of the following two representations may be used for the encoding and
decoding of a message data:
1. Tagged representation: The type of each program object as well as its
value is encoded in the message. In this method, it is a simple matter for
the receiving process to check the type of each program object in the
message because of the self-describing nature of the coded data format.
2. Untagged representation: The message contains only program
objects, no information is included in the message about the type of
each program object. In this method, the receiving object should have a
prior knowledge of how to decode the received data because the coded
data format is not self-describing.
The untagged representations used in SUN’s XDR format and tagged
representation is used in Mach distributed operating system.
2.6 Process Addressing
A message passing system generally supports two types of addressing:
Explicit Addressing: The process with which communication is desired
is explicitly specified as a parameter in the communication primitive. e.g.
send (pid, msg), receive (pid, msg).
Page 53
Advanced Operating Systems (Distributed Systems) Unit 2
Sikkim Manipal University Page No. 53
Implicit Addressing: A process does not explicitly name a process for
communication. For example, a process can specify a service instead of
a process. e.g. send any (service id, msg), receive any (pid, msg)
Methods for process addressing:
machine id@local id: UNIX uses this form of addressing (IP address,
port number).
Advantages: No global coordination needed for process addressing.
Disadvantages: Does not allow process migration.
machine id1@local id@machine id2: machine id1 identifies the node
on which the process is created. local id is generated by the node on
which the process is created.
machine id2 identifies the last known location of the process. When a
process migrates to another node, the link information (the machine id to
which the process migrates) is left with the current machine. This
information is used for forwarding messages to migrated processes.
Disadvantages:
– Overhead involved in locating a process may be large.
– If the node on which the process was executing is down, it may not
be possible to locate the process.
2.7 Failure Handling
While a distributed system may offer potential for parallelism, it is also prone
to partial failures such as a node crash or a communication link failure.
During Interprocess communication, such failures may lead to the following
problems:
Loss of request Message: This can be due to link failure or receiver
node is down.
Page 54
Advanced Operating Systems (Distributed Systems) Unit 2
Sikkim Manipal University Page No. 54
Loss of response message: This may be due to link failure or the
sender is down when the response reaches it.
Unsuccessful execution of request: This may be due to receiver node
crash while processing the request.
A Solution to overcome the above said problem may be done using the
following methods:
1. A four message reliable IPC:
In this method there are four messages involved: Request and
Acknowledgement from the client machine, Reply and Acknowledgement
from the Server machine. In this case, the kernels of both the client and
server will continue to retransmit after timeout until an acknowledgement is
received from both, i.e the client machine sends a request message to the
server machine and waits for an acknowledgement from the server. If the
acknowledgement is not received within the specified timeout period, the
client retransmits its request to the server and waits for an
acknowledgement. This process continues till an acknowledgement is
received. The same process occurs even at the server side.
The server sends a reply message to the client and waits for the
acknowledgement until the specified timeout period. On non-receipt of the
acknowledgement within the timeout period, it resends the reply back to the
client machine and the process continues till the client responds with an
acknowledgement.
2. Three message reliable IPC:
As mentioned in point number 1 above, the scenario here slightly varies,
wherein the client machine does not wait for an acknowledgement to be
received from the server machine. The client machine just sends the
request to the specified server. But here the server machine expects an
acknowledgement from the client machine when it responds to the client’s
Page 55
Advanced Operating Systems (Distributed Systems) Unit 2
Sikkim Manipal University Page No. 55
request message. The server now waits for an acknowledgement from the
client and on non-receipt of the acknowledgement within the specified time
period, it retransmits the reply message to the client and this cycle continues
until the client responds with an acknowledgement.
In this method, the server may use the concept of piggybacking, wherein it
may attach the acknowledgement to the client with a message in the form of
a reply to the client.
3. Two message reliable IPC:
In this method there is no requirement either from the client or the server for
receiving acknowledgements from each other. They just exchange the
messages in the form of requests and replies or responses to each other
assuming that their messages have been sent (ideal scenario), but which
may be impractical in real time situations.
Idempotency and handling of duplicate request messages
Idempotency basically means “repeatability”. i.e. an Idempotent operation
produces the same results without any side effects no matter how many
times it is performed with the same arguments. For example, assume an
sqrt procedure for calculating the square root of a given number; sqrt (64)
always returns 8.
On the other hand, operations that do not necessarily produce the same
results when executed repeatedly with the same arguments are said to be
non-idempotent. For example a debit operation on a bank account.
An idempotent operation produces the same result without any side
effect no matter how many times it is executed.
Not all operations are idempotent
So, if requests can be retransmitted, then care should be taken to
implement its reply as an idempotent operation.
Page 56
Advanced Operating Systems (Distributed Systems) Unit 2
Sikkim Manipal University Page No. 56
Even if the same request is retransmitted several times, the server
should execute the request only once; or if it executes several times, the
net result should be equivalent to the result of exactly one execution.
This is called exactly once semantics. Primitives based on exactly-once
semantics are desirable but difficult to implement.
Implementation of exactly-once semantics:
– each request has a unique sequence number
– Kernel makes sure request is forwarded to server only once
– After receiving the reply from the server, Kernel caches a copy of the
reply and retransmits it when it receives the same request from client
2.8 Group Communication
The most elementary form of message-based interaction is one-to-one
communication in which a single-sender process sends a message to a
single receiver process. However, for performance and ease of
programming several highly parallel distributed applications require that a
message passing system should also provide group communication facility.
Depending on single or multiple senders and receivers, the following three
types of group communication are possible:
1. One-to-many (Single sender and multiple receivers)
2. Many-to-one (multiple senders and single receiver)
3. Many-to-many (multiple senders and multiple receivers)
The following are the One-to-Many Multicast issues in a group
communication to be addressed:
i) Group Management:
In case of one-to-many communication, receiver processes of a message
form a group. Such groups are of two types – closed and open. A closed
group is one in which only the members of the group can send a message
Page 57
Advanced Operating Systems (Distributed Systems) Unit 2
Sikkim Manipal University Page No. 57
to the group. An outside process cannot send a message to the group as a
whole, although it may send a message to an individual member of the
group. On the other hand an open group is one in which any process in the
system can send a message to the group as a whole.
Whether to use a closed group or an open group is application dependent. A
message passing system with group communication facility provides the
flexibility to create an delete groups dynamically and to allow a process to
join or leave a group at any time.
ii) Group Addressing:
A two-level naming scheme is normally used for group addressing. The
higher level group is an ASCII string that is independent of the location
information of the processes in the group. The low-level group name
depends to a large extent on the underlying hardware. For example, on
some networks it is possible to create a special network address to which
multiple machines can listen.
Create a special network address, called multicast address. A packet
sent to multicast address is delivered to all who have subscribed to that
group.
For example, on the Internet, class D IP addresses are used for
multicast. The format of class D IP addresses for IP multicasting:
--------------------------------------
|1|1|1|0| Group identification|
--------------------------------------
The first four bits contain 1110 and identify the address as multicast.
The remaining 28 bits specify a specific multicast group.
Broadcast address: A certain address is declared as a broadcast
address and packets sent to that address are delivered to all in the
network.
Page 58
Advanced Operating Systems (Distributed Systems) Unit 2
Sikkim Manipal University Page No. 58
If there is no facility to create multicast or broadcast addresses, then
underlying unicast is used. A disadvantage is for each member a
separate copy of each packet needs to be sent.
iii) Message Delivery Approach: The following are the two possible
approaches for message delivery.
Centralized approach: A centralized group server maintains
information about the groups and their members.
Decentralized approach: No central server keeps the information.
Buffered or Unbuffered: A multicast packet can be buffered until the
receiver is ready to receive. If unbuffered, packets could be lost. Multicast
send is inherently asynchronous:
It is unrealistic to expect sending process to wait until all the receiving
processes that belong to the multicast group are ready to receive.
The sending process may not be aware of all the receiving processes
Flexible Reliability in Multicast Communication: Different levels of reliability
O-reliable: No response is expected from any receivers.
1-reliable: Sender expects response from one receiver (may be the
multicast server can take the responsibility).
m-out-of-n-reliable: The sender expects response from m out of
n receivers.
All-reliable: The sender expects response from all receivers.
Atomic Multicast: A multicast message is received by all the members of
the group or none.
Different Implementation methods:
• The Kernel of the sender is responsible for retransmitting until everyone
receives. This method works only if the sender’s machine and none of
the receiver processes fail.
Page 59
Advanced Operating Systems (Distributed Systems) Unit 2
Sikkim Manipal University Page No. 59
• Each receiver of the multicast message performs an atomic multicast of
the same message to the same group. This method ensures all
surviving processes will receive the message even if some receivers fail
after receiving the message or the sender machine fails after sending
the message.
iv) Many-to-One Communication: In this type of communication, multiple
senders send messages to a single receiver. For example,
• A buffer process may receive messages from several consumers
and producers.
• Multicast recipients may be sending acknowledgements to the
sender.
• A database server may be receiving requests from several clients
v) Many-to-Many Communication: In this type of communication, multiple
senders send messages to multiple receivers. An important issue here is
that of ordered delivery of messages. Ordered delivery ensures that all
messages are delivered to all receivers in an order acceptable to the
application.
The following are the various message ordering semantics followed in case
of a Many-to-Many communication:
i) Absolute Ordering: In this type, all messages are delivered to all
processes in the exact order in which they were sent.
• Not possible to implement in the absence of global clock.
• Moreover, absolute ordering is not required by many applications.
ii) Consistent Ordering: In this type, all messages are received by all
processes in the same order.
iii) Causal Ordering: For some applications, consistent-ordering
semantics is not necessary and even weaker semantics is acceptable.
An application can have better performance if the message-passing
Page 60
Advanced Operating Systems (Distributed Systems) Unit 2
Sikkim Manipal University Page No. 60
system used supports a weaker ordering semantics that is acceptable
to the application. One such weak ordering semantics that is
acceptable to many applications is the causal ordering semantics.
This semantics ensures that if the event of sending one message is causally
related to the event of sending another message, the two messages are
delivered to all receivers in the correct order. Two message sending events
are said to be causally related if they are correlated by the happened-before
relation. i.e. two message sending events are causally related if there is any
possibility of the second one being influenced in any way by the first one.
The basic idea behind causal ordering semantics is that when it matters,
messages are always delivered in proper order, but when it does not matter,
they may be delivered in any arbitrary order.
2.9 Terminal Questions
1. What is a message passing system? Discuss the desirable features of a
message passing system.
2. Discuss the synchronization issues in a message passing system.
3. Discuss the issues of buffering and process addressing.
4. Discuss about group communication mechanisms in Message Passing.
Page 61
Advanced Operating Systems (Distributed Systems) Unit 3
Sikkim Manipal University Page No. 61
Unit 3 Remote Procedure Calls
Structure:
3.1 Introduction
Objectives
3.2 The RPC Model
3.3 Transparency of RPC
3.4 Implementation of RPC Mechanism
3.5 STUB Generation
3.6 RPC Messages
3.7 Marshaling Arguments and Results
3.8 Server Management
3.9 Parameter Passing, Call Semantics
3.10 Communication Protocol for RPCs
3.11 Complicated RPC
3.12 Client-Server Binding
3.13 Security
3.14 Terminal Questions
3.1 Introduction
Many distributed systems have been based on explicit message exchange
between processes. However, the procedures send and receive do not
conceal communication, which is important to achieve access transparency
in distributed systems. This problem has long been known, but little was
done about it until a paper by Birrell and Nelson (1984) introduced a
completely different way of handling communication. Although the idea is
refreshingly simple (once someone has thought of it), the implications are
often subtle. In this section we will examine the concept, its implementation,
its strengths, and its weaknesses.
Page 62
Advanced Operating Systems (Distributed Systems) Unit 3
Sikkim Manipal University Page No. 62
In a nutshell, what Birrell and Nelson suggested was allowing programs to
call procedures located on other machines. When a process on machine A
calls a procedure on machine B, the calling process on A is suspended, and
execution of the called procedure takes place on B. Information can be
transported from the caller to the callee in the parameters and can come
back in the procedure result. No message passing at all is visible to the
programmer. This method is known as Remote Procedure Call, or often
just RPC.
While the basic idea sounds simple and elegant, subtle problems exist. To
start with, because the calling and called procedures run on different
machines, they execute in different address spaces, which causes
complications. Parameters and results also have to be passed, which can
be complicated, especially if the machines are not identical. Finally, both
machines can crash and each of the possible failures causes different
problems. Still, most of these can be dealt with, and RPC is a widely-used
technique that underlies many distributed systems.
Objectives:
This unit deals with the remote procedure calling mechanisms in a
distributed system, where in the caller and callee are separated. It starts
introducing the RPC mechanism and discusses the RPC model in a
distributed environment. It discusses various implementation issues
concerned with RPC. It discusses the issues like Stub generation, Server
Management, Parameter Passing mechanisms, Communication protocols,
Client – Server Binding etc.
3.2 The RPC Model
The RPC mechanism is an extension of a normal procedure call
mechanism. It enables a call to be made to a procedure that does not reside
in the address space of the calling process. The called procedure may be on
Page 63
Advanced Operating Systems (Distributed Systems) Unit 3
Sikkim Manipal University Page No. 63
a remote machine or on the same machine. The caller and callee have
separate address space; so called procedure has no access to the caller’s
environment.
The RPC model is used for transfer of control and data within a program in
the following manner:
1. For making a procedure call, the caller places arguments to the
procedure in some well-specified location.
2. Control is then transferred to the sequence of instructions that
constitutes the body of the procedure.
3. The procedure body is executed in a newly created execution
environment that includes copies of the arguments given in the calling
instruction.
4. After the procedure’s execution is over, control returns to the calling
point, possibly returning a result.
When a remote procedure call is made, the caller and the callee processes
interact in the following manner:
The caller (also known as the client process) sends a call (request)
message to the callee (also known as the server process) and waits (blocks)
for a reply message. The server executes the procedure and returns the
result of the procedure execution to the client. After extracting the result of
the procedure execution, the client resumes execution. In the above model,
RPC calls are synchronous; however, an implementation may choose to
have RPC calls to be asynchronous to allow parallelism. Also, for each
request the server can create a thread to process the request so the server
can receive other requests.
Page 64
Advanced Operating Systems (Distributed Systems) Unit 3
Sikkim Manipal University Page No. 64
Figure 3.1: A Model of Remote Procedure Call
3.3 Transparency of RPC
A major issue in the design of an RPC facility is its transparency property. A
transparent RPC mechanism is one in which local procedures and remote
procedures are indistinguishable to programmers. This requires the
following two types of transparencies:
Syntactic Transparency: A remote procedure call should have the same
syntax as a local procedure call, which is not very difficult to achieve.
Semantic Transparency: Semantics of remote procedure calls are identical
to those of local procedure calls.
Achieving semantic transparency is not easy because:
Unlike local procedure calls, the called procedure is executed in an
address space that is disjoint from the calling program’s address space.
– called procedure has no access to the local environment.
Caller
(Client Process)
Callee
(Server Process)
Request Message (contains remote procedure’s
parameters)
Reply Message (Contains result of procedure
execution)
Call procedure and
wait for reply
Resume Execution
Receive request and start
procedure execution
Procedure Executes
Send reply and wait
for next request
Page 65
Advanced Operating Systems (Distributed Systems) Unit 3
Sikkim Manipal University Page No. 65
– Passing addresses (pointers) as arguments is meaningless.
– So, passing pointers as parameters is not attractive. An alternative
may be to send a copy of the value pointed.
– Call by reference can be replaced by copy in/copy out but at the cost
of slightly different semantics.
Remote procedure calls are more vulnerable to failure than local
procedure calls
– Programs that make use of RPC must have the capability to handle
this type of error.
– This makes it more difficult to make RPCs transparent.
RPCs consume much more time (100 to 1000 times) than local
procedure calls due to the involvement of communication network.
So, achieving semantic transparency is not easy.
3.4 Implementation of RPC Mechanism
To achieve the goal of semantic transparency, the implementation of RPC is
based on the concept of stubs. Stubs provide a perfectly normal local
procedure call abstraction. It conceals from programs the interface to the
underlying RPC system. On the client side and the server side, a separate
stub procedure is associated with each. To hide the existence of functional
details of the underlying network, an RPC communication package (called
RPC runtime) is used in both the client and server sides.
Thus implementation of an RPC mechanism involves the following five
elements:
1. The Client
2. The Client stub
3. The RPC Runtime
4. The server stub, and
5. The server
Page 66
Advanced Operating Systems (Distributed Systems) Unit 3
Sikkim Manipal University Page No. 66
Figure 3.2: Implementation of RPC Mechanism
The job of each of these elements is described below:
1. Client: To invoke a remote procedure, a client makes a perfectly local
call that invokes the corresponding procedure in the stub
2. Client Stub: The client stub is responsible for performing the following
tasks:
On receipt of a call request from the client, it packs the specification
of the target procedure and the arguments into a message and asks
the local runtime system to send it to the server stub.
On receipt of the result of procedure execution, it unpacks the result
and passes it to the client.
Return Call
Unpack Pack
Receive Send
Client
Client stub
RPC Runtime
Wait
Call Return
UnPack Pack
Receive Send
Server
Execute
Result Packet
Call
Packet
Page 67
Advanced Operating Systems (Distributed Systems) Unit 3
Sikkim Manipal University Page No. 67
3. RPCRuntime:
The RPC runtime handles the transmission of the messages across the
network between client and server machines. It is responsible for
retransmissions, acknowledgements, and encryption.
On the client side, it receives the call request from the client stub and
sends it to the server machine. It also receives reply message (result of
procedure execution) from the server machine and passes it to the client
stub.
On the server side, it receives the results of the procedure execution
from the server stub and sends it to the client machine. It also receives
the request message from the client machine and passes it to the server
stub.
4. Server Stub: The functions of server stub are similar to that of the client
stub. It performs the following two tasks:
The server stub unpacks the call receipt messages from local
RPCRuntime and makes a perfect local call to invoke the
appropriate procedure in the server.
The server stub packs the results of the procedure execution
received from server, and asks the local RPCRuntime to send it to
the client stub.
5. Server: On receiving the call request from the server stub, the server
executes the appropriate procedure and returns the result to the server
stub.
Page 68
Advanced Operating Systems (Distributed Systems) Unit 3
Sikkim Manipal University Page No. 68
3.5 STUB Generation
The stubs can be generated in the following two ways:
Manual Stub Generation: RPC implementer provides a set of translation
functions from which user can construct his own stubs. It is simple to
implement and can handle complex parameters.
Automatic Stub Generation: This is the most commonly used technique
for stub generation. It uses an Interface Definition Language (IDL), for
defining the interface between the client and server. An interface definition is
mainly a list of procedure names supported by the interface, together with
the types of their arguments and results, which helps the client and server to
perform compile-time type checking and generate appropriate calling
sequences. An interface definition also contains information to indicate
whether each argument is an input, output or both. This helps in
unnecessary copying input argument needs to be copied from client to
server and output needs to be copied from server to client. It also contains
information about type definitions, enumerated types, and defined
constants-so the clients do not have to store this information.
A server program that implements procedures in an interface is said to
export the interface. A client program that calls the procedures is said to
import the interface. When writing a distributed application, a programmer
first writes the interface definition using IDL, then can write a server program
that exports the interface and a client program that imports the interface.
The interface definition is processed using an IDL compiler (the IDL
compiler in Sun RPC is called rpcgen) to generate components that can be
combined with both client and server programs, without making changes to
the existing compilers. In particular, an IDL compiler generates a client stub
procedure and a server stub procedure for each procedure in the interface.
It generates the appropriate marshaling and un-marshaling operations in
Page 69
Advanced Operating Systems (Distributed Systems) Unit 3
Sikkim Manipal University Page No. 69
each sub procedure. It also generates a header file that supports the data
types in the interface definition to be included in the source files of both
client and server. The client stubs are compiled and linked with the client
program and the server stubs are compiled and linked with server program.
3.6 RPC Messages
Any remote procedure call involves a client process and a server process
that are possibly located on different computers. The mode of interaction
between the client and server is that the client asks the server to execute a
remote procedure and the server returns the result of execution of the
concerned procedure to the client. Based on this mode of interaction, the
two types of messages involved in the implementation of an RPC system
are as follows:
i) Call messages sent by the client to server for requesting execution of
particular remote procedure.
Components of a call message:
Since a call message is used to request execution of a particular remote
procedure, the basic components in a call message are as follows:
identification information of the remote procedure to be executed – such
as program number, version number, and procedure number
arguments necessary for the execution of the procedure
a message identification field that consists of a sequence number
a message type to distinguish call and reply messages
a client identification field
ii) Reply messages sent by the server to the client for returning the result.
When the server of an RPC receives a call message from a client, it could
be faced with one of the following conditions:
Page 70
Advanced Operating Systems (Distributed Systems) Unit 3
Sikkim Manipal University Page No. 70
The message is not intelligible to it. Could be because the call message
violates the RPC protocol. Server needs to discard such calls.
If the server finds the client is not authorized to use the service, the
requested service is not available, or an exception condition such as
division by 0 occurs then it will return an appropriate unsuccessful reply.
If the specified remote procedure is executed successfully, then it sends
a reply
3.7 Marshaling Arguments and Results
Implementation of Remote Procedure calls involves the transfer of
arguments from the client process to the server process and the transfer of
results from the server process to the client process. These arguments and
results are basically language – level data structures (program objects),
which are transferred in the form of message data between the two
computers involved in the call. The transfer of message data between two
computers requires encoding and decoding of message data. In case of
RPCs this operation is known as Marshalling and involves the following
actions:
1. Taking the arguments (of a client process) or the result (of a server
process) that will form the message data to be sent to the remote
process.
2. Encoding the message data of step 1 on the sender’s computer. This
encoding process involves the conversion of program objects into a
stream form that is suitable for transmission and placing them into a
message buffer.
3. Decoding of the message data on the receiver’s computer. This
decoding process involves the reconstruction of program objects from
the message data that was received in the stream form.
Page 71
Advanced Operating Systems (Distributed Systems) Unit 3
Sikkim Manipal University Page No. 71
In order that encoding and decoding of an RPC message can be performed
successfully, the order and the representation method used to marshal
arguments and results must be known to both the client and the server of
the RPC. This provides a degree of type safety between a client and a
server because the server will not accept a call from a client until the client
uses the same interface definition as the server.
The marshalling process must reflect the structure of all types of program
objects used in the concerned language.
3.8 Server Management
In RPC based applications, two important issues that need to be considered
for server management are server implementation and server creation.
i) Server Implementation: Based on the style of implementation used
servers may be of two types: Stateful and Stateless.
Stateful servers: A stateful server maintains client’s state information from
one remote procedure call to the next. For example, let us consider a server
that supports the following operations for files:
Open (filename, mode): used to open filename in specified mode. When
the server executes this operation, it creates an entry for this file in a file-
table that is used for maintaining state information.
Read (fid, n, buffer): This operation returns n bytes of file data starting from
the byte currently addressed by the read-write pointer and then increments
the pointer by n.
Write (fid, n, buffer): The server takes n bytes of data from the buffer and
writes to the file identified by the read-write pointer.
Seek (fid, position): causes to change the value of the read pointer.
Page 72
Advanced Operating Systems (Distributed Systems) Unit 3
Sikkim Manipal University Page No. 72
Close (fid): causes the server to delete the file state information from the
file-table.
The file server mentioned is stateful because it maintains the current state
information of a file that has been opened for use by a client.
Stateless Server: A stateless server does not maintain any client state
information. So every request must accompany with all the necessary
parameters.
Some operations that a stateless file server can support.
Read (filename, position, n, buffer): Read n bytes from the file from
position.
Write (filename, position, n, buffer): Write n bytes from buffer to file
starting at position.
Merits of a stateful server: A stateful server provides an easier
programming paradigm. It is typically more efficient than stateless servers.
Demerits of stateful server: If the server crashes and restarts, the state
information it was holding may be lost and the client may produce
inconsistent results. If the client process crashes and restarts, the server will
have inconsistent information about the client.
Handling failures under stateless server: When server crashes and
restarts, it does not result in any inconsistencies. When client crashes
and restarts, it does not lead to any inconsistencies of how we choose
which approach to take while designing servers depends on the
application.
Server Creation Semantics
Based on the time duration for which the RPC servers survive, they can be
classified as follows:
Page 73
Advanced Operating Systems (Distributed Systems) Unit 3
Sikkim Manipal University Page No. 73
Instance-per-call-servers: They exist only for the duration of a single call.
Such a server is created by the RPCRuntime when the call arrives
The server is deleted when the call has been executed
Not a commonly used semantics because
– These servers are stateless. Any state that has to be preserved
across calls should be handled by the OS
– The overhead involved in the creation and destruction of servers is
expensive, especially if it is for the same type of service.
Instance-per-Session-Servers: Servers belonging to this category
exist for the entire session for which the client and server interact
These servers can maintain state information across calls
Overhead involved in the server creation for each call does not exist
Under this approach
– There is a server manager for each type of service
– All the server managers register with the binding agent ( ? later)
– Client first contacts the binding agent with the type of service needed
– The binding agent returns to the client the address of the server
manager that provides that type of service
– Client contacts the server manager to create a server for it
– Server manager spawns a server and returns the address of the
server to the client
– Client then interacts with this server for the entire session
– The server is destroyed when the client informs back the server
manager of the corresponding type that the server is no longer
needed.
Persistent Servers: This type of server generally remains in existence
indefinitely. It is shared by many clients. Servers of this type are created and
installed before the clients use them. Each server independently exports its
service by registering itself with the binding agent. When a client contacts
Page 74
Advanced Operating Systems (Distributed Systems) Unit 3
Sikkim Manipal University Page No. 74
the binding agent for a particular service, the binding agent selects a server
of that type and returns its address to the client. The client then interacts
with the server.
An advantage of this approach is it can improve performance, since it
interleaves requests of several clients. Care should be taken in designing
the procedures so that interleaved concurrent requests from different clients
do not interfere with each other.
3.9 Parameter Passing, Call Semantics
The choice of parameter passing semantics is crucial to the design of an
RPC mechanism. The two choices are call by value and call by reference.
i) Call-by-Value: All parameters are copied into a message that is
transmitted. It does not pose problem for simple data-types such as
integers, small arrays and so on. Passing large data-types like multi-
dimensional arrays, trees, etc. can consume much time for transmission of
data that may not be used.
ii) Call-by-Reference: This is possible only in a distributed shared memory
system. it is also possible in object-based systems, because in this case
client needs to pass the names of objects, which are like reference. In
object-based systems it is called call-by-object-reference. A remote
invocation operation may cause another remote invocation, etc. To avoid
many remote references, another parameter-passing mode, called call-by-
move was proposed; in this approach, the object to which a reference is
made is moved to the site of the callee first and then executed.
As we saw earlier, the following types of failures can occur
The call message gets lost
The response message gets lost
Page 75
Advanced Operating Systems (Distributed Systems) Unit 3
Sikkim Manipal University Page No. 75
The callee node crashes and is restarted
The caller node crashes and is restarted
Mechanisms for handling such failures are described below:
The RPCRuntime should be designed to provide flexibility to the
application programmers to select from different possible call semantics
supported by an RPC system .
Possibly or May-be Call Semantics: The weakest semantics. Not really
appropriate for RPC. Caller waits until a predetermined amount of time and
then continues with the execution. Suitable in an environment that has high
probability of successful transmission of messages.
Last-One Call Semantics: The calling of remote procedure by the caller,
execution of procedure by callee, and return of the result to the caller will
eventually be repeated until the result of the procedure execution is received
by the caller. i.e., the results of the last executed call are used by the caller.
Easy to achieve if only two processors are involved. For example, A process
P1 on N1 calls F1 on N2 and F1 calls F2 on N3; N1 fails and restarts; P1’s
call to F1 will be repeated which in turn will call F2 again; N3 is unaware of
N1’s failure; so N3 may send the result of the two executions in any order,
violating last-one semantics.
Above problem occurs due to orphan calls. An orphan call is one whose
parent is dead due to node crash. To achieve last-one semantics, these
parent calls must be terminated before restarting.
Last-of-Many Call Semantics: Similar to last-one semantics, except
orphan calls are neglected.
– Call identifiers are used to uniquely identify each call.
– When a call is repeated, it is assigned a new identifier
– Each response message has the corresponding call identifier
Page 76
Advanced Operating Systems (Distributed Systems) Unit 3
Sikkim Manipal University Page No. 76
– A caller accepts a response only if the call identifier matches with that of
the most recently repeated call.
At-Least-Once Call Semantics: Weaker than last-of-many. It just
guarantees that the call is executed more than once, but does not specify
which result will be returned.
Exactly-Once Call Semantics: The strongest and most desirable
semantics. This eliminates the possibility of a procedure being executed
more than once no matter how many times a call is retransmitted.
3.10 Communication Protocol for RPCs
Different systems developed on the basis of remote procedure calls have
different IPC requirements. Based on the needs of different systems,
several communication protocols have been proposes for RPCs. A brief
description of these protocols is given below:
i) The Request Protocol: Also known as the R protocol. It is useful for
RPCs in which the called procedure has nothing to return and the client
does not require confirmation for the procedure having been executed.
An RPC protocol that uses R protocol is also called asynchronous
RPC. For asynchronous RPC, the RPCRuntime does not take
responsibility for retrying a request in case of communication failure.
So, if an unreliable transport protocol such as UDP is used, then
request messages could be lost. Asynchronous RPCs with unreliable
transport protocols are generally useful for implementing periodic
updates. For example, a time server node in a distributed system, may
send synchronization messages every T seconds.
ii) Request/Reply Protocol (RR protocol): It is a basic idea to eliminate
acknowledgements.
A server’s reply message is regarded as an acknowledgment of the
client’s request. A subsequent call message is regarded as an
Page 77
Advanced Operating Systems (Distributed Systems) Unit 3
Sikkim Manipal University Page No. 77
acknowledgement for the server’s reply. The RR protocol does not
possess failure-handling capabilities. A timeout and retry is normally
used along with RR protocol, for taking care of lost messages. If
duplicate messages are not filtered, RR protocol provides at least once
semantics. Servers can support exactly-once semantics by keeping
records of replies in a reply cache. How long the reply needs to be
kept?
iii) The Request/Reply/Acknowledge-Reply Protocol (RRA): It is useful
for the design of systems involving simple RPCs. The server needs to
keep a copy of the reply only until it receives the acknowledgement for
reply from client. Exactly-once semantics can be implemented easily
using this protocol. In this protocol a server’s reply message is
regarded as an acknowledgement of the client’s request message. A
subsequent call packet from a client is regarded as an
acknowledgement of the server’s reply of the previous call made by the
client.
3.11 Complicated RPC
Birrell and Nelson categorized the following two types of complicated RPCs
and the methods to handle them.
i) RPCs involving long-duration calls or large gaps between calls. How to
handle such calls?
Periodic probing of the server by client: After a client sends a request, it
periodically sends a probe packet which the server acknowledges. It helps
the client to detect server crash or communication link failure. The
acknowledgements to probe message can also contain information about
lost request in which case the client can retransmit.
Page 78
Advanced Operating Systems (Distributed Systems) Unit 3
Sikkim Manipal University Page No. 78
Periodic generation of acknowledgements by server: The server itself
generates acknowledgements periodically and sends to client before
sending reply; longer the time it takes to send a reply, more will be the
number of acknowledgements generated.
ii) RPCs involving arguments and/or results that are too large to fit in a
single datagram packet.
How to handle such calls?
– Use several physical RPCs for one logical RPC
– Use multi-datagram messages. i.e., RPC argument is fragmented and
transmitted in multiple packets.
– For example, Sun RPC is limited to 8 KB. So RPC’s involving larger than
allowed limit must be handled by breaking into several physical RPCs.
3.12 Client-Server Binding
It is necessary for a client (A Client Stub) to know the location of the server
before a remote procedure call can take place. The process by which a
client becomes associated with a server so that calls can take place is
known as binding.
The Client-server binding involves handling of several issues:
– How does a client specify a server to which it wants to get bound?
– How does the binding process locate the specified server?
– When is it proper to bind a client to server?
– Is it possible for a client to change a binding during execution?
– Can a client be simultaneously bound to multiple servers that provide
the same service?
Server Naming: Birrell and Nelson’s proposal
The specification by a client of a server with which it wants to communicate
is primarily a naming issue. An interface name has two parts - a type and an
Page 79
Advanced Operating Systems (Distributed Systems) Unit 3
Sikkim Manipal University Page No. 79
instance. Type specifies the interface itself, and instance specifies a server
providing the services within that interface. For example, there may be an
interface type file server, and there may be many instances of servers
providing file service. Type part also has generally version number field to
distinguish between old and new versions of interface (which may provide
different sets of service). Interface names are created by users. The RPC
package only dictates the means by which an importer uses the interface
name to locate an exporter.
Server Locating:
The interface name of a server is its unique identifier. When the client
specifies the interface name of a server for making a remote procedure call,
the server must be located before the client’s request message can be sent
to it. This is primarily a locating issue and any locating mechanism can be
used for this purpose. The most common methods used for locating are
described below:
i) Broadcasting: A broadcast message is sent to locate the server. The
first server responding to this message is used by the client. OK for
small networks.
ii) Binding Agent: A binding agent is basically a name server used to
bind a client to a server by providing information about the desired
server. The binding agent maintains a binding table which is a mapping
of the server’s interface name to its locations. All servers register
themselves with the binding agent as a part of their initialization
process.
To register, the server gives the binder its identification information and a
handle to look at it, for example IP address. The Server can deregister when
it is no longer prepared to offer this service. The binding agent’s location is
known to all nodes. The binding agent interface has three primitives:
register, deregister, and lookup (used by client). The time when can a client
Page 80
Advanced Operating Systems (Distributed Systems) Unit 3
Sikkim Manipal University Page No. 80
be bound to a server is called the Binding Time. If the client and server
modules are programmed as if they were linked together, it is known as
Binding at Compile Time.
For example a server’s network address can be compiled into client’s code.
This scheme is very inflexible because if the server moves or the server is
replicated or the interface changes, all client programs need to be
recompiled. However, it is useful in an application whose configuration is
expected to last for a fairly long time.
iii) Binding at Link Time: A server exports its service by registering with
the binding agent as part of the initialization process
A client then makes an import request to the binding agent before
making a call
The binding agent binds the client and server by returning the
server’s handle.
The server’s handle is cached by client to avoid contacting the
binding agent.
iv) Binding at Call Time: A client is bound to a server at the time when it
calls the server for the first time during execution.
v) Indirect Call Method: When a client calls the server for the first time, it
passes the server’s interface name and the arguments of the RPC call
to the binding agent. The binding agent looks up the location of the
target’s server and forwards the RPC message to it. When the target
server returns the results to the binding agent, the binding agent
returns the result along with the handle of the target server to the client.
The client subsequently can call target server directly.
Page 81
Advanced Operating Systems (Distributed Systems) Unit 3
Sikkim Manipal University Page No. 81
3.13 Security
Some implementations of RPC includes facilities for client and server
authentication as well as for providing encryption-based security for calls.
The encryption techniques provide protection from eavesdropping and
detect attempts at modifications, replay, or creation of calls.
In other implementations of RPC that do not include security facilities, the
arguments and results of RPC are readable by anyone monitoring
communication between the caller and the callee. In this case if security is
desired, the user must implement his or her own authentication and data
encryption mechanisms.
The following security issues need to be addressed when the user designs a
security system for communication:
Is the authentication of the server by the client required?
Is the authentication of client by server required?
Is it alright if the arguments and results are accessible to users other
than the caller and the callee?
3.14 Terminal Questions
1. How does an RPC facility make the job of a distributed application
programmer easier? Mention the similarities and differences between
RPC model and ordinary procedure call.
2. What is stub? How stubs are generated? Explain how the use of stubs
help in making an RPC mechanism transparent?
3. Describe the following:
Parameter Passing Semantics
Communication protocols for RPCs
Client Server Binding
Page 82
Advanced Operating Systems (Distributed Systems) Unit 4
Sikkim Manipal University Page No. 82
Unit 4 Distributed Shared Memory
Structure:
4.1 Introduction
Objectives
4.2 Distributed Shared Memory Systems (DSM)
4.3 DSM – Design and Implementation Issues
4.4 Granularity – Block Size
4.5 Structure of Shared Memory Space in a DSM System
4.6 Memory Coherence (Consistency) Models
4.7 Memory Consistency models
4.8 Implementing Sequential Consistency
4.9 Centralized – Server Algorithm
4.10 Fixed Distributed – Server Algorithm
4.11 Dynamic Distributed Server Algorithm
4.12 Implementing under RNMBs Strategy
4.13 Thrashing
4.14 Terminal Questions
4.1 Introduction
Practice shows that programming multi-computers is much harder than
programming multiprocessors. The difference is caused by the fact that
expressing communication in terms of processes accessing shared data
and using simple synchronization primitives like semaphores and monitors
is much easier than having only message-passing facilities available. Issues
like buffering, blocking, and reliable communication only make things worse.
For this reason, there has been considerable research in emulating shared
memory on multi-computers. The goal is to provide a virtual shared memory
machine, running on a multicomputer, for which applications can be written
Page 83
Advanced Operating Systems (Distributed Systems) Unit 4
Sikkim Manipal University Page No. 83
using the shared memory model even though this is not present. The
multicomputer operating system plays a crucial role here.
One approach is to use the virtual memory capabilities of each individual
node to support a large virtual address space. This leads to what is called a
page based distributed shared memory (DSM). The principle of page-
based distributed shared memory is as follows. In a DSM system, the
address space is divided up into pages (typically 4 KB or 8 KB), with the
pages being spread over all the processors in the system. When a
processor references an address that is not present locally, a trap occurs,
and the operating system fetches the page containing the address and
restarts the faulting instruction, which now completes successfully. This
concept is illustrated in Fig. 4.1(a) for an address space with 16 pages and
four processors. It is essentially normal paging, except that remote RAM is
being used as the backing store instead of the local disk.
Figure 4.1 (a) Pages of address space distributed among four machines. (b) Situation after CPU 1 references page 10.
(c) Situation if page 10 is read only and replication is used.
In this example, if processor 1 reference instructions or data in pages 0, 2,
5, or 9, the references are done locally. References to other pages cause
Page 84
Advanced Operating Systems (Distributed Systems) Unit 4
Sikkim Manipal University Page No. 84
traps. For example, a reference to an address in page 10 will cause a trap to
the operating system, which then moves page 10 from machine 2 to
machine 1, as shown in Fig. 4.1(b).
One improvement to the basic system that can frequently improve
performance considerably is to replicate pages that are read only, for
example, pages that contain program text, read-only constants, or other
read-only data structures. For example, if page 10 in Fig. 4.1 is a section of
program text, its use by processor 1 can result in a copy being sent to
processor 1, without the original in processor 2’s memory being disturbed,
as shown in Fig. 4.1(c). In this way, processors 1 and 2 can both reference
page 10 as often as needed without causing traps to fetch missing memory.
Another possibility is to replicate not only read-only pages, but all pages. As
long as reads are being done, there is effectively no difference between
replicating a read-only page and replicating a read-write page. However, if a
replicated page is suddenly modified, special action has to be taken to
prevent having multiple, inconsistent copies in existence. Typically all copies
but one are invalidated before allowing the write to proceed.
Further performance improvements can be made if we let go of strict
consistency between replicated pages. In other words, we allow a copy to
be temporarily different from the others. Practice has shown that this
approach may indeed help, but unfortunately, can also make life much
harder for the programmer as he has to be aware of such inconsistencies.
Considering that ease of programming was an important reason for
developing DSM systems in the first place, weakening consistency may not
be a real alternative.
Another issue in designing efficient DSM systems is deciding how large
pages should be. Here, we are faced with similar trade-offs as in deciding
on the size of pages in uni-processor virtual memory systems. For example,
Page 85
Advanced Operating Systems (Distributed Systems) Unit 4
Sikkim Manipal University Page No. 85
the cost of transferring a page across a network is primarily determined by
the cost of setting up the transfer and not by the amount of data that is
transferred. Consequently, having large pages may possibly reduce the total
number of transfers when large portions of contiguous data need to be
accessed. On the other hand, if a page contains data of two independent
processes on different processors, the operating system may need to
repeatedly transfer the page between those two processors, as shown in
Fig. 4.2. Having data belonging to two independent processes in the same
page is called false sharing.
After almost 15 years of research on distributed shared memory, DSM
researchers are still struggling to combine efficiency and programmability.
To attain high performance on large-scale multi-computers, programmers
resort to message passing despite its higher complexity compared to
programming (virtual) shared memory systems. It seems therefore justified
to conclude that DSM for high-performance parallel programming cannot
fulfill its initial expectations.
Figure 4.2: False sharing of a page between two independent processes
Objectives:
This unit discusses the memory aspects of a Distributed System, wherein
sharing of the memory is done between the nodes of the system. It provides
an architectural specification of the DSM Memory Structure, and also
Page 86
Advanced Operating Systems (Distributed Systems) Unit 4
Sikkim Manipal University Page No. 86
discusses the Design and Implementation issues. It describes the Memory
Coherence (Consistency) models. It also describes various Server based
algorithms.
4.2 Distributed Shared Memory Systems (DSM)
This is also called DSVM (Distributed Shared Virtual Memory). It is a
loosely coupled distributed-memory system that has implemented a
software layer on top of the message passing system to provide a shared
memory abstraction for the programmers. The software layer can be
implemented in the OS kernel or in runtime library routines with proper
kernel support. It is an abstraction that integrates local memory of different
machines in a network environment into a single logical entity shared by
cooperating processes executing on multiple sites. Shared memory exists
only virtually.
DSM Systems: A comparison between message passing and tightly
coupled multiprocessor systems
DSM provides a simpler abstraction than the message passing model. It
relieves the burden from the programmer from explicitly using
communication primitives in their programs.
In message passing systems, passing complex data structures between two
different processes is difficult. Moreover, passing data structures containing
pointers is generally expensive in message passing model.
Distributed Shared Memory takes advantage of the locality of reference
exhibited by programs and improves efficiency.
Distributed Shared Memory systems are cheaper to build than tightly
coupled multiprocessor systems.
The large physical memory available facilitates running programs requiring
large memory efficiently.
Page 87
Advanced Operating Systems (Distributed Systems) Unit 4
Sikkim Manipal University Page No. 87
DSM can scale well when compared to tightly coupled multiprocessor
systems.
Message passing system allows processes to communicate with each other
while being protected from one another by having private address spaces,
whereas in DSM one can cause another to fail by erroneously altering data.
When message passing is used between heterogeneous computers
marshaling of data takes care of differences in data representation; how can
memory be shared between computers with different integer representation.
DSM can be made persistent - i.e. processes communicating via DSM may
execute with overlapping lifetimes.
A process can leave information in an agreed location to another process.
Processes communicating via message passing must execute at the same
time.
Which is better? Message passing or Distributed Shared Memory?
Distributed Shared Memory appears to be a promising tool if it can be
implemented efficiently.
Distributed Shared Memory Architecture
Page 88
Advanced Operating Systems (Distributed Systems) Unit 4
Sikkim Manipal University Page No. 88
As shown in the above figure, the DSM provides a virtual address space
shared among processes on loosely coupled processors. DSM is basically
an abstraction that integrates the local memory of different machines in a
network environment into a single local entity shared by cooperating
processes executing on multiple sites. The shared memory itself exists only
virtually. The application programs can use it in the same way as traditional
virtual memory, except that processes using it can run on different machines
in parallel.
Architectural Components:
Each node in a distributed system consists of one or more CPUs and a
memory unit. The nodes are connected by a communication network. A
simple message-passing system allows processes on different nodes to
exchange messages with each other. DSM abstraction presents a single
large shared memory space to the processors of all nodes. Shared memory
of DSM exists only virtually. Memory map manager running at each node
maps the local memory onto the shared virtual memory. To facilitate this
mapping, shared-memory space is partitioned into blocks. Data caching is
used to reduce network latency. When a memory block accessed by a
process is not resident in local memory:
a block fault is generated and control goes to the OS.
the OS gets this block from the remote node and maps it to the
application’s address space and the faulting instruction is restarted.
Thus data keeps migrating from one node to another node but no
communication is visible to the user processes.
Network traffic is highly reduced if applications show a high degree of
locality of data accesses.
Page 89
Advanced Operating Systems (Distributed Systems) Unit 4
Sikkim Manipal University Page No. 89
Variations of this general approach are used for different implementations
depending on whether the DSM allows replication and/or migration of
shared memory.
4.3 DSM – Design and Implementation Issues
The important issues involved in the design and implementation of DSM
systems are as follows:
Granularity: It refers to the block size of the DSM system, i.e. to the units of
sharing and the unit of data transfer across the network when a network
block fault occurs. Possible units are a few words, a page, or a few pages.
Structure of Shared Memory Space: The structure refers to the Lay out of
the shared data in memory. It is dependent on the type of applications that
the DSM system is intended to support.
Memory coherence and access synchronization: Coherence
(consistency) refers to memory coherence problem that deals with the
consistency of shared data that lies in the main memory of two or more
nodes. Synchronization refers to synchronization of concurrent access to
shared data using synchronization primitives such as semaphores.
Data Location and Access: A DSM system must implement mechanisms
to locate data blocks in order to service the network data block faults to
meet the requirements of the memory coherence semantics being used.
Block Replacement Policy: If the local memory of a node is full, a cache
miss at that node implies not only a fetch of the accessed data block from a
remote node but also a replacement. i.e. a data block of the local memory
must be replaced by the new data block. Therefore a block replacement
policy is also necessary in the design of a DSM system.
Page 90
Advanced Operating Systems (Distributed Systems) Unit 4
Sikkim Manipal University Page No. 90
Thrashing: In a DSM system, data blocks migrate between nodes on
demand. If two nodes compete for write access to a single data item, the
corresponding data block may be transferred back and forth at such a high
rate that no real work can get done. A DSM system must use a policy to
avoid this situation (known as Thrashing).
Heterogeneity: The DSM systems built in for homogenous systems need
not address the heterogeneity issue. However, if the underlying system
environment is heterogeneous, the DSM system must be designed to take
care of heterogeneity so that it functions properly with machines having
different architectures.
4.4 Granularity – Block Size
Choosing appropriate block size should take the following into
consideration:
Paging overhead: large block size would minimize the paging overhead
since it takes advantage of locality of reference.
Directory size: Larger the block size, smaller the directory size. Smaller
directory size reduces directory management.
Thrashing: Data items in the same data block may be updated by
multiple nodes at the same time, causing large number of block
transfers. Thrashing is more likely with large blocks.
False Sharing: Two different processes access two unrelated variables
that reside in the same data block. This can lead to thrashing.
Why not use page size of virtual memory system as block size? Some
advantages of such approach are:
• It allows the use of existing page-fault schemes to trigger DSM page-
fault.
Page 91
Advanced Operating Systems (Distributed Systems) Unit 4
Sikkim Manipal University Page No. 91
• If page size can fit into a packet, page size does not impose undue
communication overhead.
4.5 Structure of Shared Memory Space in a DSM System
Three commonly used approaches for structuring:
1. No structuring: Shared memory space is simply a linear array of words.
DSM system IVY uses this approach.
2. Structuring by data - type: Shared memory space is structured as a
collection of objects or as a collection of variables in a source language.
Since size of objects and variables vary, one has to use variable grain
size which complicates the design and implementation.
3. Structuring as a database: Structure the shared memory as a
database.
• Shared memory space is ordered as an associative memory, called
a tuple space, which is a collection of tuples with data items in their
fields.
4.6 Memory Coherence (Consistency) Models
What is a memory Consistency Model?
• A set of rules that the applications must obey if they want the DSM
system to provide the degree of consistency guaranteed by the
consistency model.
• Weaker the consistency model, better the concurrency.
• Researchers try to invent new consistency models which are weaker
than the existing ones in such a way that a set of applications will
function correctly under the new consistency model.
Page 92
Advanced Operating Systems (Distributed Systems) Unit 4
Sikkim Manipal University Page No. 92
• Note that an application written for a DSM that implements a stronger
consistency model may not work correctly under a DSM that implements
a weaker consistency model.
4.7 Memory Consistency models
i) Strict consistency: Each read operation returns the most recently
written value. This is possible to implement only in systems with the
notion of global time. So, this model is impossible to implement. Hence,
DSM systems based on underlying distributed systems have to use
weaker consistency models.
ii) Sequential consistency: Proposed by Lamport (1979). All
processes in the system observe the same order of all memory access
operations on the shared memory. i.e., if three operations read(r1),
write(w1) and read(r2) are performed on a memory address in that
order, then any of the six orderings (r1,w1, r2), (r2,w1, r1), (w1, r2,
r1).... is acceptable provided all processes see the same ordering. It
can be implemented by serializing all requests on a central server
node. This model is weaker than the strict consistency model. This
model provides one-copy/single-copy semantics because all processes
sharing a memory location always see exactly the same contents
stored in it. Sequential consistency is the most intuitively expected
semantics for memory coherence. So, sequential consistency is
acceptable for most applications.
iii) Causal consistency model: Proposed by Hutto and Ahamad (1990).
In this model, all write operations that are potentially causally related
are seen by all processes in the same (correct) order. For example, if a
process did a read operation and then performed a write operation,
then the value written may have depended in some way on the value
read. A write operation performed by one process P1 is not causally
Page 93
Advanced Operating Systems (Distributed Systems) Unit 4
Sikkim Manipal University Page No. 93
related to the write operation performed by another process P2 if P1
has read neither the value written by P2 or any memory variable that
was directly or indirectly derived from the value written by P2 and vice
versa. For implementing DSMs that support causal consistency one
has to keep track of which memory operation is dependent on which
other operation.
This model is weaker than Sequential consistency model
iv) Pipelined Random - Access Memory (PRAM) consistency model
This model was proposed by Lipton and Sandberg (1988). In this
model, all write operations performed by a single process are seen by
all other processes in the order in which they were performed. This
model can be implemented easily by sequencing the write operations
performed by each node independently.
This model is weaker than all the above consistency models.
v) Processor Consistency Model: Proposed by Goodman (1989).
In addition to PRAM consistency, for any memory location, all
processes agree on the same order of all write operations to that
location.
vi) Weak Consistency Model: Proposed by Dubois et al. (1988).
This model distinguishes between ordinary accesses and
synchronization accesses. It requires that memory become consistent
only on synchronization accesses. A DSM that supports weak
consistency model uses a special variable, called synchronization
variable. The operations on it are used to synchronize memory. For
supporting weak consistency, the following should be satisfied:
All accesses to synchronization variables must obey sequential
consistency semantics.
Page 94
Advanced Operating Systems (Distributed Systems) Unit 4
Sikkim Manipal University Page No. 94
All previous write operations must be completed everywhere before
an access to synchronization variable is allowed.
All previous access to synchronization variables must be completed
before access to a non – synchronization variable is allowed.
vii) Release Consistency Model: In the weak consistency model, the
entire shared memory is synchronized when a synchronization variable
is accessed by a process i.e.
• All changes made to the memory are propagated to other nodes.
• All changes made to the memory by other processes are propagated
from other nodes to the process’s node.
This is not really necessary because the first operation needs to be
performed only when a process exits from critical section and the
second operation needs to be performed only when the process enters
critical section. So, instead of one synchronization variable, two
synchronization variables, called acquire and release have been
proposed.
– Acquire is used by a process to tell the system that it is about to
enter a critical section.
– Release is used to tell the system that it had exited critical section.
If processes use appropriate synchronization accesses properly, a
release consistency DSM system will produce the same results for an
application as that if the application was executed on a sequentially
consistent DSM system.
viii) Lazy Release consistency model: It is a variation of release
consistency model. In this approach, when a process does a release
access, the contents of all the modifications are not immediately sent to
other nodes but they are sent only on demand. i.e. When a process
Page 95
Advanced Operating Systems (Distributed Systems) Unit 4
Sikkim Manipal University Page No. 95
does an acquire access, all modifications of other nodes are acquired
by the process’s node. It minimizes network traffic.
4.8 Implementing Sequential Consistency
Sequential consistency supports the intuitively expected semantics. So, this
is the most preferred choice for designers of DSM system. The replication
and migration strategies for DSM design include:
i) Non-replicated, non-migrating blocks (NRNMBs)
ii) Non-replicated, migrating blocks (NRMBs)
iii) Replicated, migrating blocks (RMBs)
iv) Replicated, non-migrating blocks (RNMBs)
i) Implementing under NRNMBs strategy:
Under this strategy, only one copy of each block of the shared memory is in
the system and its location is fixed. All requests for a block are sent to the
owner node of the block. Upon receiving a request from a client node, the
memory management unit (MMU) and the operating system of the owner
node perform the access request and return the result. Sequential
consistency can be trivially enforced, because the owner node needs to only
process all requests on a block in the order it receives.
Disadvantages: The serialization of data access creates a bottleneck.
Parallelism is not possible in this strategy.
Locating data in the NRNMB strategy: A mapping between blocks and
nodes need to be maintained at each node.
ii) Implementing under NRMBs strategy
Under this strategy, only the processes executing on one node can read or
write a given data item at any time, so sequential consistency is ensured.
The advantages of this strategy include:
– No communication cost for local data access.
Page 96
Advanced Operating Systems (Distributed Systems) Unit 4
Sikkim Manipal University Page No. 96
– Allows applications to take advantage of data access locality
The disadvantages of this strategy include:
– Prone to thrashing
– Parallelism cannot be achieved in this method also
– Locating a block in the NRMB strategy:
1. Broadcasting: Under this approach:
Each node maintains a owned blocks table
– When a block fault occurs, the fault handler broadcasts a request on the
network.
– The node that currently owns the block responds by transferring the
block.
– This approach does not scale well.
2. Centralized Server Algorithm: A central server maintains a block table
that contains the location information for all blocks in the shared memory
space
– When a block fault occurs, the fault handler sends a request to the
central server.
– The central server forwards the request to the node holding block
and updates its block table.
– Upon receiving the request, the owner transfers the block to the
requesting node.
– Drawbacks:
Central node is a bottleneck.
If the central node fails, the DSM stops functioning.
3. Fixed Distributed – Server Algorithm: Under this scheme:
• Several nodes have block managers, each block manager manages a
predetermined set of blocks
• Each node maintains a mapping from data blocks to block managers
Page 97
Advanced Operating Systems (Distributed Systems) Unit 4
Sikkim Manipal University Page No. 97
• When a block fault occurs, the fault handler sends a request to the
corresponding block manager
• The block manager forwards the request to the corresponding node
and updates its table to reflect the new owner (the node requesting
the block)
• Upon receiving the request, the owner transfers the block to the
requesting node.
4. Dynamic Distributed Server Algorithm: Under this approach there is
no block manager. Each node maintains information about the probable
owner of each block. When a block fault occurs, the fault handler sends
a request to the probable owner of the block. Upon receiving the
request, if the receiving node is the owner of the block, it updates its
block table and transfers the block to the requesting node; otherwise, it
forwards the request to the probable owner of the block as indicated by
its block table.
Implementing under RMBs strategy
A major disadvantage of non replication strategies is lack of parallelism
because only the processes on one node can access data contained in any
given block at any given time. To increase parallelism, virtually all DSM
systems replicate blocks. With replicated blocks, read operations can be
carried out in parallel at multiple nodes by accessing the local copy of the
data. Therefore the average cost of read operations is reduced because no
communication overhead is involved if a replica of the data exists at the
local node. However, replication tends to increase the cost of write
operations because for a write to a block all its replicas must be invalidated
or updated to maintain consistency.
The two basic protocols that may be used for ensuring sequential
consistency in this case are as follows:
Page 98
Advanced Operating Systems (Distributed Systems) Unit 4
Sikkim Manipal University Page No. 98
1. Write – Invalidate: In this scheme, all copies of a piece of data except
one are invalidated before a write can be performed on it. Therefore, when a
write fault occurs, at a node, its fault handler copies the accessed block from
one of the block’s current nodes to its own node, invalidates all other copies
of the block by sending an invalidate message containing the block address
to the nodes having a copy of the block, changes the access of the local
copy of the block to write, an returns the faulting instruction.
After returning, the node “owns” that block and can proceed with the write
operation and other read/write operations until the block ownership is
relinquished to some other node.
Protocols for implementing Sequential Consistency
i) Write-Invalidate Protocol: All copies of a data block except one are
invalidated before a write can be performed on it. If one of the nodes that
had a copy of the block before invalidation tries to perform a memory access
operation on the block after invalidation, a block fault will occur and the fault
handler will fetch the block again from a node having a valid copy, thus
achieving sequential consistency.
ii) Write-Update Protocol: Under this scheme, a write operation is carried
out by updating all copies of the data on which the write is performed. When
a write fault occurs at a node, the fault handler copies the accessed block
from a node having a valid copy, updates all copies and the local copy and
then returns to the faulting instruction. In this method, sequential
consistency can be achieved by using a mechanism to totally order the write
operations of all the nodes. One way to accomplish this is through a global
sequencer. The set of reads that take place between any two writes is well
defined and their order is immaterial to sequential consistency.
Page 99
Advanced Operating Systems (Distributed Systems) Unit 4
Sikkim Manipal University Page No. 99
Demerit: This protocol is very expensive for use with loosely coupled
systems because every write operation requires network access locating a
block in the RMB strategy:
iii) Broadcasting: Under this approach, each node maintains a owned
blocks table. Each entry in the table has a copy-set field containing the list of
nodes that have a valid copy of the corresponding block. When a read fault
for a block occurs at node N, the fault handler at node N broadcasts a read
request for the block. Upon receiving the request, the node that currently
owns the block adds N to the copy-set field and transfers the block to
node N.
When a write fault for a block occurs at node N, the fault handler at node N
broadcasts a write request for the block The node that currently owns the
block relinquishes its ownership of the block to node N and transfers the
block to node N along with copy-set node N, upon receiving the block sends
invalidation message to all nodes in the copy-set adds an entry in the local
owned block table for the block to reflect N is the owner initializes the copy-
set to {N}. This approach does not scale well.
4.9 Centralized-Server Algorithm
A central server maintains a block table containing owner-node and copy-
set information for each block. When a read/write fault for a block occurs at
node N, the fault handler at node N sends a read/write request to the central
server.
Upon receiving the request, the central-server does the following:
If it is a read request:
• adds N to the copy-set field and
• sends the owner node information to node N
Page 100
Advanced Operating Systems (Distributed Systems) Unit 4
Sikkim Manipal University Page No. 100
• upon receiving this information, N sends a request for the block to
the owner node.
• upon receiving this request, the owner returns a copy of the block
to N.
If it is a write request:
It sends the copy-set and owner information of the block to node N
and initializes copy-set to {N}
Node N sends a request for the block to the owner node and an
invalidation message to all blocks in the copy-set.
Upon receiving this request, the owner sends the block to node N
4.10 Fixed Distributed-Server Algorithm
Under this scheme
Several nodes have block managers, each block manager manages a
predetermined set of blocks
When a read/write fault occurs, request for the block is sent to the
corresponding block manager.
• Upon receiving this request
• The actions taken by the block manager are similar to that of the central-
server approach.
4.11 Dynamic Distributed Server Algorithm
Under this approach, there is no block manager. Each node maintains
information about the probable owner of each block, and also the copy-set
information for each block for which it is a owner. When a block fault occurs,
the fault handler sends a request to the probable owner of the block.
Upon receiving the request
if the receiving node is not the owner, it forwards the request to the
probable owner of the block according to its table.
Page 101
Advanced Operating Systems (Distributed Systems) Unit 4
Sikkim Manipal University Page No. 101
if the receiving node is the owner, then
If the request is a read request, it adds the entry N to the copy-set
field of the entry corresponding to the block and sends a copy of the
block to node N.
If the request is a write request, it sends the block and copy-set
information to the node N and deletes the entry corresponding to the
block from its block table.
Node N, upon receiving the block, sends invalidation request to all
nodes in the copy-set, and updates its block table to reflect the fact
that it is the owner of the block
4.12 Implementing under RNMBs Strategy
Under this strategy
• Blocks are replicated, and blocks do not migrate to other nodes.
Replicas can be kept consistent by using write-update protocol.
Sequential consistency can be achieved by using a global sequencer.
For locating data, each node should have a block table containing
information about the location of the blocks.
Block Replacement Policy
The following are different approaches that may be used for block
replacement:
1. Usage-Based Replacement Policy: Least recently used (LRU), Most
recently used (MRU).
2. Non-Usage Based Replacement: Do not take usage into consideration,
example First in First out (FIFO), Random.
3. Fixed-Space versus Variable Space Approach: Fixed space
algorithms assume cache size is fixed; under variable space
replacement, cache size can change.
Page 102
Advanced Operating Systems (Distributed Systems) Unit 4
Sikkim Manipal University Page No. 102
Which approach is suitable for DSM systems? Variable space algorithms
are not suitable for a DSM system because each node’s memory that acts
as cache is fixed.
IVY uses a priority based scheme for block replacement.
The following are the two most commonly used approaches of placing a
block to be replaced:
1. Using Secondary Store: The block is transferred to a local disk.
2. Using Memory Space at Other Nodes: The block is transferred to
another node which has free memory space. The node needs to know which
nodes have free memory space.
4.13 Thrashing
Thrashing is said to occur when the system spends a large amount of time
transferring shared data blocks from one node to another, compared to the
time spent doing the useful work of executing application processes. It is a
serious performance problem with DSM systems that allow data blocks to
migrate from one node to another. Thrashing may occur due to the following
reasons:
Interleaved data access by two or more processes on different nodes
that causes a data block to move back and forth from one node to
another in quick succession. (Ping-Pong Effect).
Blocks with read only permissions are repeatedly invalidated soon after
they are replicated.
Such situations indicate poor (node) locality in references. If not properly
handled, thrashing degrades system performance considerably.
Page 103
Advanced Operating Systems (Distributed Systems) Unit 4
Sikkim Manipal University Page No. 103
The following are some of the proposed solutions for handling thrashing:
1. Application - Controlled Locks: Applications are allowed to lock data
for a short period of time. An application controlled block can be
associated with each data block to implement this method.
2. Nailing a block to a node for a minimum amount of time: In this
method, disallow a block to be taken away until a minimum amount of
time, say t, elapses after its allocation ot that node. How to determine t?
The time t can be fixed statically or dynamically on the basis of access
patterns.
3. Tailor coherence algorithm to the shared-data usage patterns: Use
different coherence protocols for shared data with different
characteristics.
4.14 Terminal Questions
1. Explain the general architecture of a DSM system
2. Discuss the design and implementation issues of a DSM system
3. Discuss the following:
Consistency Models
Thrashing
Page 104
Advanced Operating Systems (Distributed Systems) Unit 5
Sikkim Manipal University Page No. 104
Unit 5 Synchronization
Structure:
5.1 Introduction
Objectives
5.2 Clock Synchronization
5.3 Clock Synchronization Algorithms
5.4 Distributed Algorithms
5.5 Event Ordering
5.6 Mutual Exclusion
5.7 Deadlock
5.8 Election Algorithms
5.9 Terminal Questions
5.1 Introduction
A Distributed System is a collection of distinct processes which are spatially
separated and run concurrently. In systems with multiple concurrent
processes, it is economical to share the system resources among the
concurrently executing processes. The sharing of resources may be
cooperative or competitive. Since the number of available resources in a
computing system is restricted, one process must necessarily influence the
action of other concurrently running processes as it competes for resources.
Sometimes, concurrent processes must cooperate either to achieve the
desired performance of the computing system or due to the nature of the
computation being performed. For example, a client process and a server
process must cooperate when performing file access operations. Both
cooperative and competitive sharing require adherence to certain rules of
behavior that guarantee that correct interaction occurs. The rules for
enforcing correct interactions are implemented in the form of
Page 105
Advanced Operating Systems (Distributed Systems) Unit 5
Sikkim Manipal University Page No. 105
synchronization mechanisms. This unit focuses on synchronization
mechanisms that are suitable for distributed systems.
Objectives:
This unit introduces synchronizing the disparate systems present on a
distributed network in the process of message transfers. It tells about the
various ways of synchronizing the clocks on both the sender and receiver
machines along with algorithms showing the implementation parts of
synchronization. It also talks about Event Ordering when multiple messages
are being sent from multiple senders to multiple receivers. It also discusses
the situation of deadlocks occurring in case of resource sharing among
distributed systems. It speaks about the election algorithms used to elect a
process or a node for message sending and receiving.
5.2 Clock Synchronization
Time is an important concept when dealing with synchronisation and
coordination. In particular it is often important to know when events occurred
and in what order they occurred. In a non-distributed system dealing with
time is trivial as there is a single shared clock. All processes see the same
time. In a distributed system, on the other hand, each computer has its own
clock. Because no clock is perfect each of these clocks has its own skew
which causes clocks on different computers to drift and eventually become
out of sync.
There are several notions of time that are relevant in a distributed system.
First of all, internally a computer clock simply keeps track of ticks that can
be translated into physical time (hours, minutes, seconds, etc.). This
physical time can be global or local. Global time is a universal time that is
the same for everyone and is generally based on some form of absolute
time.1 Currently Coordinated Universal Time (UTC), which is based on
oscillations of the Cesium-133 atom, is the most accurate global time.
Page 106
Advanced Operating Systems (Distributed Systems) Unit 5
Sikkim Manipal University Page No. 106
Besides global time, processes can also consider local time. In this case the
time is only relevant to the processes taking part in the distributed system
(or algorithm). This time may be based on physical or logical clocks.
Physical Clocks
Physical clocks keep track of physical time. In distributed systems that rely
on actual time it is necessary to keep individual computer clocks
synchronized. The clocks can be synchronized to global time (external
synchronization), or to each other (internal synchronization). Cristian’s
algorithm and the Network Time Protocol (NTP) are examples of algorithms
developed to synchronize clocks to an external global time source (usually
UTC). The Berkeley Algorithm is an example of an algorithm that allows
clocks to be synchronized internally.
Cristian’s algorithm requires clients to periodically synchronize with a central
time server (typically a server with a UTC receiver). One of the problems
encountered when synchronizing clocks in a distributed system is that
unpredictable communication latencies can affect the synchronization. For
example, when a client requests the current time from the time server, by
the time the server’s reply reaches the client the time will have changed.
The client must, therefore, determine what the communication latency was
and adjust the server’s response accordingly. Cristian’s algorithm deals with
this problem by attempting to calculate the communication delay based on
the time elapsed between sending a request and receiving a reply.
The Network Time Protocol is similar to Cristian’s algorithm in that
synchronization is also performed using time servers and an attempt is
made to correct for communication latencies.
Unlike Cristian’s algorithm, however, NTP is not centralised and is designed
to work on a wide area scale. As such, the calculation of delay is somewhat
more complicated. Furthermore, NTP provides a hierarchy of time servers,
Page 107
Advanced Operating Systems (Distributed Systems) Unit 5
Sikkim Manipal University Page No. 107
with only the top layer containing UTC clocks. The NTP algorithm allows
client-server and peer-to-peer (mostly between time servers)
synchronization. It also allows clients and servers to determine the most
reliable servers to synchronize with. NTP typically provides accuracies
between 1 and 50 msec depending on whether communication is over a
LAN or WAN.
Unlike the previous two algorithms, the Berkeley algorithm does not
synchronize to a global time. Instead, in this algorithm, a time server polls
the clients to determine the average of everyone’s time. The server then
instructs all clients to set their clocks to this new average time. Note that in
all the above algorithms a clock should never be set backward. If time needs
to be adjusted backward, clocks are simply slowed down until time
’catches up’.
Logical Clocks
For many applications, the relative ordering of events is more important than
actual physical time. In a single process the ordering of events (e.g., state
changes) is trivial. In a distributed system, however, besides local ordering
of events, all processes must also agree on ordering of causally related
events (e.g., sending and receiving of a single message). Given a system
consisting of N processes pi, i {1, . . . ,N}, we define the local event
ordering → i as a binary relation, such that, if pi observes e before e′, we
have e → i e′. Based on this local ordering, we define a global ordering as a
happened before relation →, as proposed by Lamport [Lam78]: The relation
→ is the smallest relation, such that
1. e →i e′ implies e → e′,
2. for every message m, send(m) → receive(m), and
3. e → e′ and e′ → e′′ implies e → e′′ (transitivity).
Page 108
Advanced Operating Systems (Distributed Systems) Unit 5
Sikkim Manipal University Page No. 108
The relation → is almost a partial order (it lacks reflexivity). If a → b, then we
say a causally affects b. We consider unordered events to be concurrent if
they are unordered; i.e., a 6 → b and b 6 → a implies a k b.
As an example, consider Figure 1. We have the following causal relations:
E11 → E12,E13,E14,E23,E24, . . .
E21 → E22,E23,E24,E13,E14, . . .
Figure 5.1: Example of Event Ordering
Moreover, the following events are concurrent: E11kE21, E12kE22,
E13kE23, E11kE22, E13kE24, E14kE23, and so on.
How Computer Clocks are implemented?
A computer clock usually consists of three components – a quartz crystal
oscillates at a well defined frequency, a counter register, and a constant
register. The constant register is used to store a constant value that is
decided based on the frequency of oscillation of the quartz crystal. The
counter register is used to keep track of the oscillations of the quartz crystal.
i.e. the value in the counter register is decremented by 1 for each oscillation
of the quartz crystal. When the value of the counter register becomes zero,
an interrupt is generated and its value is reinitialized to the value in the
constant register. Each interrupt is called a clock tick.
Page 109
Advanced Operating Systems (Distributed Systems) Unit 5
Sikkim Manipal University Page No. 109
Clock Synchronization Issues
No two clocks can be perfectly synchronized. Two clocks are said to be
synchronized at a particular instance of time if the difference in time values
of the two clocks is less than some specified constant . The difference in
time values of two clocks is called Clock Skew. Therefore, a set of clocks
are said to be synchronized if the clock skew of any two clocks in this set is
less than .
Clock synchronization requires each node to read other nodes’ clock values.
Regardless of the clock reading mechanism, a node can obtain only an
approximate view of its clock skew with respect to other nodes’ clocks in the
system.
Errors occur mainly because of unpredictable communication delays during
message passing used to deliver a clock signal or a clock message from
one node to another.
An important issue in clock synchronization is that time must never run
backward because this could cause serious problems, such as the repetition
of certain operations that may be hazardous in certain cases. We know that
during synchronization a fast clock has to be slowed down. But if the time of
a fast clock is readjusted to the actual time all at once, it may lead to running
the time backward for that clock. Therefore, clock synchronizing algorithms
are normally designed to gradually introduce such a change in the fast
running clock instead of readjusting it to the correct time all at once.
5.3 Clock Synchronization Algorithms
Clock synchronization algorithms may be broadly classified as Centralized
and Distributed:
Centralized Algorithms
In centralized clock synchronization algorithms one node has a real-time
receiver. This node, called the time server node whose clock time is
Page 110
Advanced Operating Systems (Distributed Systems) Unit 5
Sikkim Manipal University Page No. 110
regarded as correct and used as the reference time. The goal of these
algorithms is to keep the clocks of all other nodes synchronized with the
clock time of the time server node. Depending on the role of the time server
node, centralized clock synchronization algorithms are again of two types –
Passive Time Sever and Active Time Server.
1. Passive Time Server Centralized Algorithm: In this method each
node periodically sends a message to the time server. When the time
server receives the message, it quickly responds with a message (“time
= T”), where T is the current time in the clock of the time server node.
Assume that when the client node sends the “time = ?” message, its
clock time is T0, and when it receives the “time = T” message, its clock
time is T1. Since T0 and T1 are measured using the same clock, in the
absence of any other information, the best estimate of the time required
for the propagation of the message “time = T” from the time server node
to the client’s node is (T1-T0)/2. Therefore, when the reply is received at
the client’s node, its clock is readjusted to T + (T1-T0)/2.
2. Active Time Server Centralized Algorithm: In this approach, the time
server periodically broadcasts its clock time (“time = T”). The other
nodes receive the broadcast message and use the clock time in the
message for correcting their own clocks. Each node has a priori
knowledge of the approximate time (Ta) required for the propagation of
the message “time = T” from the time server node to its own node,
Therefore, when a broadcast message is received at a node, the node’s
clock is readjusted to the time T+Ta. A major drawback of this method is
that it is not fault tolerant. If the broadcast message reaches too late at a
node due to some communication fault, the clock of that node will be
readjusted to an incorrect value. Another disadvantage of this approach
is that it requires broadcast facility to be supported by the network.
Page 111
Advanced Operating Systems (Distributed Systems) Unit 5
Sikkim Manipal University Page No. 111
Another active time server algorithm that overcomes the drawbacks of the
above algorithm is the Berkeley algorithm proposed by Gusella and Zatti for
internal synchronization of clocks of a group of computers running the
Berkeley UNIX. In this algorithm, the time server periodically sends a
message (“time = ?”) to all the computers in the group. On receiving this
message, each computer sends back its clock value to the time server. The
time server has a priori knowledge of the approximate time required for the
propagation of a message from each node to its own node. Based on this
knowledge, it first readjusts the clock values of the reply messages, It then
takes a fault-tolerant average of the clock values of all the computers
(including its own). To take the fault tolerant average, the time server
chooses a subset of all clock values that do not differ from one another by
more than a specified amount, and the average is taken only for the clock
values in this subset. This approach eliminates readings from unreliable
clocks whose clock values could have a significant adverse effect if an
ordinary average was taken.
The calculated average is the current time to which all the clocks should be
readjusted, The time server readjusts its own clock to this value, Instead of
sending the calculated current time back to other computers, the time server
sends the amount by which each individual computer’s clock requires
adjustment, This can be a positive or negative value and is calculated based
on the knowledge the time server has about the approximate time required
for the propagation of a message from each node to its own node.
Centralized clock synchronization algorithms suffer from two major
drawbacks:
1. They are subject to single – point failure. If the time server node fails,
the clock synchronization operation cannot be performed. This makes
the system unreliable. Ideally, a distributed system, should be more
Page 112
Advanced Operating Systems (Distributed Systems) Unit 5
Sikkim Manipal University Page No. 112
reliable than its individual nodes. If one goes down, the rest should
continue to function correctly.
2. From a scalability point of view it is generally not acceptable to get all
the time requests serviced by a single time server. In a large system,
such a solution puts a heavy burden on that one process.
Distributed algorithms overcome these drawbacks:
5.4 Distributed Algorithms
We know that externally synchronized clocks are also internally
synchronized. That is, if each node’s clock is independently synchronized
with real time, all the clocks of the system remain mutually synchronized.
Therefore, a simple method for clock synchronization may be to equip each
node of the system with a real time receiver so that each node’s clock can
be independently synchronized with real time. Multiple real time clocks (one
for each node) are normally used for this purpose.
Theoretically, internal synchronization of clocks is not required in this
approach. However, in practice, due to inherent inaccuracy of real-time
clocks, different real time clocks produce different time. Therefore, internal
synchronization is normally performed for better accuracy. One of the
following two approaches is used for internal synchronization in this case.
1. Global Averaging Distributed Algorithms: In this approach, the clock
process at each node broadcasts its local clock time in the form of a
special “resync” message when its local time equals T0+iR for some
integer I, where T0 is a fixed time in the past agreed upon by all nodes
and R is a system parameter that depends on such factors as the total
number of nodes in the system, the maximum allowable drift rate, and
so on. i.e. a resync message is broadcast from each node at the
beginning of every fixed length resynchronization interval. However,
Page 113
Advanced Operating Systems (Distributed Systems) Unit 5
Sikkim Manipal University Page No. 113
since the clocks of different nodes run slightly different rates, these
broadcasts will not happen simultaneously from all nodes.
After broadcasting the clock value, the clock process of a node waits for
time T, where T is a parameter to be determined by the algorithm. During
this waiting period, the clock process records the time, according to its own
clock, when the message was received. At the end of the waiting period, the
clock process estimates the skew of its clock with respect to each of the
other nodes on the basis of the times at which it received resync messages.
It then computes a fault-tolerant average of the next resynchronization
interval.
The global averaging algorithms differ mainly in the manner in which the
fault-tolerant average of the estimated skews is calculated. Two commonly
used algorithms are:
1. The simplest algorithm is to take the average of the estimated skews
and use it as the correction for the local clock. However, to limit the
impact of faulty clocks on the average value, the estimated skew with
respect to each node is compared against a threshold, and skews
greater than the threshold are set to zero before computing the average
of the estimated skews.
2. In another algorithm, each node limits the impact of faulty clocks by first
discarding the m highest and m lowest estimated skews and then
calculating the average of the remaining skews, which is then used as
the correction for the local clock. The value of m is usually decided
based on the total number of clocks (nodes).
Localized Averaging Distributed Algorithms: In this approach, the nodes
of a distributed system are logically arranged in some kind of pattern, such
as a ring or a grid. Periodically, each node exchanges its clock time with its
neighbors in the ring, grid, or other structure and then sets its clock time to
Page 114
Advanced Operating Systems (Distributed Systems) Unit 5
Sikkim Manipal University Page No. 114
the average of its own clock time to the average of its own clock time and
the clock times of its neighbors.
5.5 Event Ordering
Lamport observed that for most applications it is not necessary to keep the
clocks in a distributed system synchronized. Rather, it is sufficient to ensure
that all events that occur in a distributed system be totally ordered in a
manner that is consistent with an observed behavior.
For partial ordering of event, Lamport defined a new relation called
happened-before and introduced the concept of logical clocks for ordering of
events based on the happened-before relation. He then gave a distributed
algorithm extending his idea of partial ordering to a consistent total ordering
of all the events in a distributed system. His idea is given below:
Happened – Before Relation
The happened before relation (denoted by →) on a set of events satisfies
the following conditions:
1. If a and b are the events in the same process and a occurs before b,
then a → b.
2. If a is the event of sending a message by one process and b is the event
of the receipt of the same message by another process, then a → b.
This condition holds by the law of causality because a receiver cannot
receive a message until the sender sends it, and the time taken to
propagate a message from its sender to its receiver is always positive.
3. If a → b and b → c, then a → c. i.e. happened – before is a transitive
relation.
In a happened – before relation, two events a and b are said to be
concurrent if they are not related by the happened – before relation.
i.e. neither a → b nor b → a is true. This is possible if the two events occur
Page 115
Advanced Operating Systems (Distributed Systems) Unit 5
Sikkim Manipal University Page No. 115
in different processes that do not exchange messages either directly or
indirectly via other processes. i.e. two events are concurrent if neither can
causally affect the other.
Given a system consisting of N processes pi, i {1, . . . ,N}, we define the
local event ordering → i as a binary relation, such that, if pi observes e
before e′, we have e → i e′. Based on this local ordering, we define a global
ordering as a happened before relation →, as proposed by Lamport
[Lam78]: The relation → is the smallest relation, such that
1. e →i e′ implies e → e′,
2. for every message m, send(m) → receive(m), and
3. e → e′ and e′ → e′′ implies e → e′′ (transitivity).
The relation → is almost a partial order (it lacks reflexivity). If a → b, then we
say a causally affects b. We consider unordered events to be concurrent if
they are unordered; i.e., a 6 → b and b 6 → a implies a b.
As an example, consider Figure 1. We have the following causal relations:
E11 → E12,E13,E14,E23,E24, . . .
E21 → E22,E23,E24,E13,E14, . . .
Figure 5.2: Example of Event Ordering
Page 116
Advanced Operating Systems (Distributed Systems) Unit 5
Sikkim Manipal University Page No. 116
Moreover, the following events are concurrent:
E11 E21, E12 E22, E13 E23, and so on.
Lamport Clocks
Lamport’s logical clocks can be implemented as a software counter that
locally computes the happened-before relation →. This means that each
process pi maintains a logical clock Li. Given such a clock, Li(e) denotes a
Lamport timestamp of event e at pi and L(e) denotes a timestamp of event e
at the process it occurred at. Processes now proceed as follows:
1. Before time stamping a local event, a process pi executes Li:= Li + 1.
2. Whenever a message m is sent from pi to pj :
Process pi executes Li := Li + 1 and sends the new Li with m.
Process pj receives Lj with m and executes Lj := max(Lj ,Li) + 1.
receive(m) is annotated with the new Lj.
In this scheme, a → b implies L(a) < L(b), but L(a) < L(b) does not
necessarily imply a → b. As an example, consider Figure 5.3. In this figure
E12 → E23 and L1(E12) < L2(E23) (i.e., 2 < 3), however we also have
E13 → E24 while L1(E13) < L2(E24) (i.e., 3 < 4).
Figure 5.3: Example of the use of a Lamport’s Clocks
Page 117
Advanced Operating Systems (Distributed Systems) Unit 5
Sikkim Manipal University Page No. 117
In some situations (e.g., to implement distributed locks), a partial ordering
on events is not sufficient and a total ordering is required. In these cases,
the partial ordering can be completed to total ordering by including process
identifiers. Given local time stamps Li(e) and Lj(e′), we define global time
stamps hLi(e), ii and hLj(e), ji. We, then, use standard lexicographical
ordering, where hLi(e), ii < hLj(e), ji iff Li(e) < Lj(e), or Li(e) = Lj(e) and i < j.
Vector Clocks
Figure 5.4: Example of the lack of causality with Lamport’s clocks
The main shortcoming of Lamport’s clocks is that L(a) < L(b) does not imply
a → b; hence, we cannot deduce causal dependencies from time stamps.
For example, in Figure 5.3, we have L1(E11) < L3(E33), but E11 → E33. The
root of the problem is that clocks advance independently or via messages,
but there is no history as to where advance comes from.
This problem can be solved by moving from scalar clocks to vector clocks,
where each process maintains a vector clock Vi. Vi is a vector of size N,
where N is the number of processes. The component Vi[j] contains the
process pi’s knowledge about pj’s clock. Initially, we have Vi[j] := 0 for i, j Є
{1, . . . ,N}. Clocks are advanced as follows:
1. Before pi timestamps an event, it executes Vi[i] := Vi[i] + 1.
Page 118
Advanced Operating Systems (Distributed Systems) Unit 5
Sikkim Manipal University Page No. 118
2. Whenever a message m is sent from pi to pj :
Process pi executes Vi[i] := Vi[i] + 1 and sends Vi with m.
Process pj receives Vi with m and merges the vector clocks Vi and Vj
as follows:
Vj [k] := _ max(Vj [k], Vi[k]) + 1 , if j = k (as in scalar clocks)
max(Vj [k], Vi[k]) , otherwise.
This last part ensures that everything that subsequently happens at pj is
now causally related to everything that previously happened at pi.
Under this scheme, we have, for all i, j, Vi[i] ≥ Vj [i] (i.e., pi always has the
most up-to-date version of its own clock); moreover, a → b iff V (a) < V (b),
where
• V = V ′ iff V [i] = V ′[i] for all i Є {1, . . . ,N},
• V ≥ V ′ iff V [i] ≥ V ′[i] for all i Є {1, . . . ,N},
• V > V ′ iff V ≥ V ′ V 6= V ′; and
• V || V ′ iff V V ′ V ′ 6> V
5.6 Mutual Exclusion
There are several resources within a system that must not be used
simultaneously by multiple processes if program operation is to be correct.
For example, a file must not be simultaneously updated by multiple
processes. Exclusive access to shared resources by a process must be
ensured. This exclusiveness of access is called Mutual Exclusion between
processes. The sections of a program that need exclusive access to shared
resources are referred to as critical sections. For mutual exclusion, means
are introduced to prevent processes from executing concurrently within their
associated critical sections.
Page 119
Advanced Operating Systems (Distributed Systems) Unit 5
Sikkim Manipal University Page No. 119
Requirements for Mutual Exclusion
Any facility or capability that is to provide support for mutual exclusion
should meet the following requirements:
1. Mutual exclusion must be enforced: Only one process at a time is
allowed into its critical section, among all processes that have critical
sections for the same resource or shared object.
2. A process that halts in its non-critical section must do so without
interfering with other processes.
3. It must not be possible for a process requiring access to a critical section
to be delayed indefinitely: no deadlock or starvation.
4. When no process is in a critical section, any process that requests entry
to its critical section must be permitted to enter without delay.
5. No assumptions are made about relative process speeds or number of
processors.
6. A process remains inside its critical section for a finite time only.
There are a number of ways in which the requirements for mutual exclusion
can be satisfied. One way is to leave the responsibility with the processes
that wish to execute concurrently. Thus processes, whether they are system
programs or application programs, would be required to coordinate with one
another to enforce mutual exclusion, with no support from the programming
language or the OS. We can refer to these as software approaches.
Although this approach is prone to high processing overhead and bugs, it is
nevertheless useful to examine such approaches to gain a better
understanding of the complexity of concurrent processing.
An algorithm for implementing mutual exclusion must satisfy the following
requirements:
1. Mutual Exclusion: Given a shared resource accessed by multiple
concurrent processes, at any time only one process should access the
Page 120
Advanced Operating Systems (Distributed Systems) Unit 5
Sikkim Manipal University Page No. 120
resource. i.e. a process that has been granted the resource must
release it before it can be granted to another process.
2. No Starvation: If every process that is granted the resource eventually
releases it, every request must be eventually granted.
In uni-processor systems, mutual exclusion is implemented using
semaphores, monitors, and similar constructs. The three basic approaches
used by different algorithms for implementing mutual exclusion in distributed
systems are described below:
1. Centralized Approach:
In this approach, one of the processes in the system is elected as as the
coordinator and coordinates the entry to the critical sections. Each process
that wants to enter a critical section must first seek permission from the
coordinator. If no other process is currently in that critical section, the
coordinator can immediately grant the permission to the requesting process.
If two or more processes concurrently ask for permission to enter the same
critical section, the coordinator grants permission to only one process at a
time in accordance with some scheduling algorithm.
After executing a critical section, when a process exits the critical section, it
must notify the coordinator so that the coordinator can grant permission to
another process (if any) that has also asked permission to enter the same
critical section.
2. Distributed Approach:
In this approach, the decision making for mutual exclusion is distributed
across the entire system. i.e. all processes that want to enter the critical
section cooperate with each other before reaching a decision on which
process will enter the critical section next. The first such algorithm was
presented by Lamport [1978] based on his event – ordering scheme.
Page 121
Advanced Operating Systems (Distributed Systems) Unit 5
Sikkim Manipal University Page No. 121
When a process wants to enter a critical section, it sends a request
message to all other processes. The message contains the following
information:
1. The process identifier of the process.
2. The name of the critical section that the process wants to enter.
3. A unique timestamp generated by the process for the request message.
On receiving a request message, a process either immediately sends back
a reply message to the sender or defers sending a reply based on the
following rules:
1. If the receiver process is itself currently executing in the critical section, it
simply queues the request message and defers sending a reply.
2. If the receiver process is currently not executing in the critical section but
is waiting for its turn to enter the critical section, it compares the
timestamp in the received request message with the timestamp in its
own request message that it has sent to other processes. If the
timestamp of the received request message is lower, it means that the
sender process made a request before the receiver process to enter the
critical section. Therefore, the receiver process immediately sends back
a reply message to the sender. On the other hand, if the receiver
process’s own request message has a lower timestamp, the receiver
queues the received request message and defers sending a reply
message.
3. If the receiver process is neither in the critical section nor is waiting for
its turn to enter the critical section, it immediately sends back a reply
message.
A process that sends out a request message keeps waiting for reply
messages from other processes. It enters the critical section as soon as it
has received reply messages from all processes. After it finishes executing
Page 122
Advanced Operating Systems (Distributed Systems) Unit 5
Sikkim Manipal University Page No. 122
in the critical section, it sends reply messages to all processes in its queue
and deletes them from its queue.
3. Token – Passing Approach:
In this method, mutual exclusion is achieved by using a single token that is
circulated among the processes in the system. A token is a special type of
message that entitles its holder to enter a critical section. For fairness, the
processes in the system are logically organized in a ring structure, and the
token is circulated from one process to another around the ring always in
the same direction (clockwise or anticlockwise).
The algorithm works as follows. When a process receives the token, it
checks if it wants to enter a critical section and acts as follows:
If it wants to enter a critical section, it keeps the token, enters the critical
section, and exits from the critical section after finishing its work in the
critical section. It then passes the token along the ring to its neighbor
process. Note that the process can enter only one critical section when it
receives the token. If it wants to enter another critical section, it must
wait until it gets the token again.
If it does not want to enter a critical section, it just passes the token
along the ring to its neighbor process. Therefore, if none of the
processes is interested in entering a critical section, the token simply
keeps circulating around the ring.
5.7 Deadlock
There are several resources in a system for which the resource allocation
policy must ensure exclusive access by a process. Since a system consists
of a finite number of units of each resource type, multiple concurrent
processes normally.
Page 123
Advanced Operating Systems (Distributed Systems) Unit 5
Sikkim Manipal University Page No. 123
Principles of Deadlock
Deadlock can be defined as the permanent blocking of a set of processes
that either compete for system resources or communicate with each other. A
set of processes is deadlocked when each process in the set is blocked
awaiting an event (typically the freeing up of some requested resource) that
can only be triggered by another blocked process in the set. Deadlock is
permanent because none of the events is ever triggered. Unlike other
problems in concurrent process management, there is no efficient solution in
the general case. All deadlocks involve conflicting needs for resources by
two or more processes.
Let us now look at a depiction of deadlock involving processes and
computer resources. Figure 5.5 below, which we refer to as a joint
progress diagram, illustrates the progress of two processes competing for
two resources. Each process needs exclusive use of both resources for a
certain period of time.
Figure 5.5: Example of a Deadlock
Page 124
Advanced Operating Systems (Distributed Systems) Unit 5
Sikkim Manipal University Page No. 124
Two processes, P and Q, have the following general form:
Process P Process Q
• • • • • •
Get A Get B
• • • • • •
Get B Get A
• • • • • •
Release A Release B
• • • • • •
Release B Release A
• • • • • •
In Figure 5.5, the x-axis represents progress in the execution of P and the
y-axis represents progress in the execution of Q. The joint progress of the
two processes is therefore represented by a path that progresses from the
origin in a northeasterly direction. For a uniprocessor system, only one
process at a time may execute, and the path consists of alternating
horizontal and vertical segments, with a horizontal segment representing a
period when P executes and Q waits and a vertical segment representing a
period when Q executes and P waits. The Figure 5.5 indicates areas in
which both P and Q require resource A (upward slanted lines); both P and Q
require resource B (downward slanted lines); and both P and Q require both
resources.
Because we assume that each process requires exclusive control of any
resource, these are all forbidden regions; that is, it is impossible for any path
representing the joint execution progress of P and Q to enter these regions.
The figure shows six different execution paths. These can be summarized
as follows:
Page 125
Advanced Operating Systems (Distributed Systems) Unit 5
Sikkim Manipal University Page No. 125
1. Q acquires B and then A and then releases B and A. When P resumes
execution, it will be able to acquire both resources.
2. Q acquires B and then A. P executes and blocks on a request for A.Q
releases B and A. When P resumes execution, it will be able to acquire
both resources.
3. Q acquires B and then P acquires A. Deadlock is inevitable, because as
execution proceeds, Q will block on A and P will block on B.
4. P acquires A and then Q acquires B. Deadlock is inevitable, because as
execution proceeds, Q will block on A and P will block on B.
5. P acquires A and then B.Q executes and blocks on a request for B. P
releases A and B. When Q resumes execution, it will be able to acquire
both resources.
6. P acquires A and then B and then releases A and B.When Q resumes
execution, it will be able to acquire both resources.
The gray-shaded area of Figure 5.5, which can be referred to as a fatal
region, applies to the commentary on paths 3 and 4. If an execution path
enters this fatal region, then deadlock is inevitable. Note that the existence
of a fatal region depends on the logic of the two processes. However,
deadlock is only inevitable if the joint progress of the two processes creates
a path that enters the fatal region.
Whether or not deadlock occurs depends on both the dynamics of the
execution and on the details of the application. For example, suppose that
P does not need both resources at the same time so that the two processes
have the following form:
Process P Process Q
• • • • • •
Get A Get B
• • • • • •
Page 126
Advanced Operating Systems (Distributed Systems) Unit 5
Sikkim Manipal University Page No. 126
Release A Get A
• • • • • •
Get B Release B
• • • • • •
Release B Release A
• • • • • •
This situation is reflected in Figure 5.6 below. Some thought should
convince you that regardless of the relative timing of the two processes,
deadlock cannot occur. As shown, the joint progress diagram can be used
to record the execution history of two processes that share resources. In
cases where more than two processes may compete for the same resource,
a higher-dimensional diagram would be required. The principles concerning
fatal regions and deadlock would remain the same.
Figure 5.6: Example of No Deadlock
Page 127
Advanced Operating Systems (Distributed Systems) Unit 5
Sikkim Manipal University Page No. 127
Resource Allocation Graphs
A useful tool in characterizing the allocation of resources to processes is the
resource allocation graph, introduced by Holt [HOLT72]. The resource
allocation graph is a directed graph that depicts a state of the system of
resources and processes, with each process and each resource
represented by a node.
A graph edge directed from a process to a resource indicates a resource
that has been requested by the process but not yet granted Figure 5.7 (a).
Within a resource node, a dot is shown for each instance of that resource.
Examples of resource types that may have multiple instances are
I/O devices that are allocated by a resource management module in the OS.
A graph edge directed from a reusable resource node dot to a process
indicates a request that has been granted Figure 5.7 (b); that is, the process
has been assigned one unit of that resource. A graph edge directed from a
consumable resource node dot to a process indicates that the process is the
producer of that resource.
Figure 5.7(c ) shows an example deadlock. There is only one unit each of
resources Ra and Rb. Process P1 holds Rb and requests Ra, while P2
holds Ra but requests Rb. Figure 5.6(d) has the same topology as
Figure 5.7(c), but there is no deadlock because multiple units of each
resource are available.
The resource allocation graph of Figure 5.7 corresponds to a deadlock
situation. Note that in this case, we do not have a simple situation in which
two processes each have one resource the other needs. Rather, in this
case, there is a circular chain of processes and resources that results in
deadlock.
Page 128
Advanced Operating Systems (Distributed Systems) Unit 5
Sikkim Manipal University Page No. 128
Table 5.1: Summary of Deadlock Detection, Prevention, and Avoidance
Approaches for Operating Systems
Figure 5.7: Examples of Resource Allocation Graphs
Page 129
Advanced Operating Systems (Distributed Systems) Unit 5
Sikkim Manipal University Page No. 129
The Conditions for Deadlock
Three conditions of policy must be present for a deadlock to be possible:
1. Mutual exclusion: Only one process may use a resource at a time. No
process may access a resource unit that has been allocated to another
process.
2. Hold and wait: A process may hold allocated resources while awaiting
assignment of other resources.
3. No preemption: No resource can be forcibly removed from a process
holding it. In many ways these conditions are quite desirable. For
example, mutual exclusion is needed to ensure consistency of results
and the integrity of a database. Similarly, preemption should not be done
arbitrarily. For example, when data resources are involved, preemption
must be supported by a rollback recovery mechanism, which restores a
process and its resources to a suitable previous state from which the
process can eventually repeat its actions. The first three conditions are
necessary but not sufficient for a deadlock to exist. For deadlock to
actually take place, a fourth condition is required.
4. Circular wait: A closed chain of processes exists, such that each
process holds at least one resource needed by the next process in the
chain (e.g., Figure 5.7 (c)).
The fourth condition is, actually, a potential consequence of the first three.
That is, given that the first three conditions exist, a sequence of events may
occur that lead to an un-resolvable circular wait. The un-resolvable circular
wait is in fact the definition of deadlock. The circular wait listed as condition
4 is un-resolvable because the first three conditions hold. Thus, the four
conditions, taken together, constitute necessary and sufficient conditions for
deadlock. Recall that we defined a fatal region as on such that once the
processes have progressed into that region, those processes will deadlock.
Page 130
Advanced Operating Systems (Distributed Systems) Unit 5
Sikkim Manipal University Page No. 130
A fatal region exists only if all of the first three conditions listed above are
met. If one or more of these conditions are not met, there is no fatal region
and deadlock cannot occur. Thus, these are necessary conditions for
deadlock. For deadlock to occur, there must not only be a fatal region, but
also a sequence of resource requests that has led into the fatal region. If a
circular wait condition occurs, then in fact the fatal region has been entered.
Thus, all four conditions listed above are sufficient for deadlock. To
summarize:
Three general approaches exist for dealing with deadlock. First, one can
prevent deadlock by adopting a policy that eliminates one of the conditions
(conditions 1 through 4). Second, one can avoid deadlock by making the
appropriate dynamic choices based on the current state of resource
allocation. Third, one can attempt to detect the presence of deadlock
(conditions 1 through 4 hold) and take action to recover.
Deadlock Prevention
The strategy of deadlock prevention is, simply put, to design a system in
such a way that the possibility of deadlock is excluded. We can view
deadlock prevention methods as falling into two classes. An indirect method
of deadlock prevention is to prevent the occurrence of one of the three
necessary conditions listed previously (items 1 through 3). A direct method
of deadlock prevention is to prevent the occurrence of a circular wait
(item 4). We now examine techniques related to each of the four conditions.
Page 131
Advanced Operating Systems (Distributed Systems) Unit 5
Sikkim Manipal University Page No. 131
Mutual Exclusion
In general, the first of the four listed conditions cannot be disallowed. If
access to a resource requires mutual exclusion, then mutual exclusion must
be supported by the OS. Some resources, such as files, may allow multiple
accesses for reads but only exclusive access for writes. Even in this case,
deadlock can occur if more than one process requires write permission.
Hold and Wait
The hold-and-wait condition can be prevented by requiring that a process
request all of its required resources at one time and blocking the process
until all requests can be granted simultaneously. This approach is inefficient
in two ways.
First, a process may be held up for a long time waiting for all of its resource
requests to be filled, when in fact it could have proceeded with only some of
the resources.
Second, resources allocated to a process may remain unused for a
considerable period, during which time they are denied to other processes.
Another problem is that a process may not know in advance all of the
resources that it will require.
There is also the practical problem created by the use of modular
programming or a multithreaded structure for an application. An application
would need to be aware of all resources that will be requested at all levels or
in all modules to make the simultaneous request.
No Preemption
This condition can be prevented in several ways. First, if a process holding
certain resources is denied a further request, that process must release its
original resources and, if necessary, request them again together with the
additional resource.
Page 132
Advanced Operating Systems (Distributed Systems) Unit 5
Sikkim Manipal University Page No. 132
Alternatively, if a process requests a resource that is currently held by
another process, the OS may preempt the second process and require it to
release its resources. This latter scheme would prevent deadlock only if no
two processes possessed the same priority. This approach is practical only
when applied to resources whose state can be easily saved and restored
later, as is the case with a processor.
Circular Wait
The circular-wait condition can be prevented by defining a linear ordering of
resource types. If a process has been allocated resources of type R, then it
may subsequently request only those resources of types following R in the
ordering. To see that this strategy works, let us associate an index with each
resource type. Then resource Ri precedes Rj in the ordering if i < j. Now
suppose that two processes, A and B, are deadlocked because A has
acquired Ri and requested Rj, and B has acquired Rj and requested Ri. This
condition is impossible because it implies i < j and j < i.
As with hold-and-wait prevention, circular-wait prevention may be inefficient,
slowing down processes and denying resource access unnecessarily.
Deadlock Avoidance
An approach to solving the deadlock problem that differs subtly from
deadlock prevention is deadlock avoidance. In deadlock prevention, we
constrain resource requests to prevent at least one of the four conditions of
deadlock. This is either done indirectly, by preventing one of the three
necessary policy conditions (mutual exclusion, hold and wait, no
preemption), or directly by preventing circular wait. This leads to inefficient
use of resources and inefficient execution of processes. Deadlock
avoidance, on the other hand, allows the three necessary conditions but
makes judicious choices to assure that the deadlock point is never reached.
As such, avoidance allows more concurrency than prevention. With
Page 133
Advanced Operating Systems (Distributed Systems) Unit 5
Sikkim Manipal University Page No. 133
deadlock avoidance, a decision is made dynamically whether the current
resource allocation request will, if granted, potentially lead to a deadlock.
Deadlock avoidance thus requires knowledge of future process resource
requests.
In this section, we describe two approaches to deadlock avoidance:
Do not start a process if its demands might lead to deadlock.
Do not grant an incremental resource request to a process if this
allocation might lead to deadlock.
Process Initiation Denial
Consider a system of n processes and m different types of resources. Let us
define the following vectors and matrices:
Table 5.2: Vector and Matrix Representations
The matrix Claim gives the maximum requirement of each process for each
resource, with one row dedicated to each process. This information must be
declared in advance by a process for deadlock avoidance to work. Similarly,
the matrix allocation gives the current allocation to each process. The
following relationships hold:
1. Rj =
n
1i
Aij, for all j All resources are either available or allocated.
Page 134
Advanced Operating Systems (Distributed Systems) Unit 5
Sikkim Manipal University Page No. 134
2. Cij Rj, for all i, j No process can claim more than the total amount of
resources in the system.
3. Aij Cij, for all i, j No process is allocated more resources of any type
than the process originally claimed to need.
With these quantities defined, we can define a deadlock avoidance policy
that refuses to start a new process if its resource requirements might lead to
deadlock. Start a new process Pn+1 only if
Rj C(n+1) j +
n
1i
Cij for all j
That is, a process is only started if the maximum claim of all current
processes plus those of the new process can be met. This strategy is hardly
optimal, because it assumes the worst: that all processes will make their
maximum claims together.
Resource Allocation Denial
The strategy of resource allocation denial, referred to as the banker’s
algorithm, was first proposed in [DIJK65]. Let us begin by defining the
concepts of state and safe state. Consider a system with a fixed number of
processes and a fixed number of resources. At any time a process may
have zero or more resources allocated to it. The state of the system reflects
the current allocation of resources to processes. Thus, the state consists of
the two vectors, Resource and Available, and the two matrices, Claim and
Allocation, defined earlier. A safe state is one in which there is at least one
sequence of resource allocations to processes that does not result in a
deadlock (i.e., all of the processes can be run to completion). An unsafe
state is, of course, a state that is not safe.
Deadlock Detection
Deadlock prevention strategies are very conservative; they solve the
problem of deadlock by limiting access to resources and by imposing
Page 135
Advanced Operating Systems (Distributed Systems) Unit 5
Sikkim Manipal University Page No. 135
restrictions on processes. At the opposite extreme, deadlock detection
strategies do not limit resource access or restrict process actions. With
deadlock detection, requested resources are granted to processes
whenever possible. Periodically, the OS performs an algorithm that allows it
to detect the circular wait condition.
5.8 Election Algorithms
Several distributed algorithms require that there be a coordinator process in
the entire system that performs some type of coordination activity needed
for the smooth running of other processes in the system. Two examples of
such coordinator processes encountered in this unit are the coordinator in
the centralized algorithm for mutual exclusion and the central coordinator in
the centralized deadlock algorithm. Since all other processes in the system
have to interact with the coordinator, they all must unanimously agree on
who the coordinator is. Furthermore, if the coordinator process fails due to
the failure of the site on which it is located, a new coordinator process must
be elected to take up the job of the failed coordinator. Election algorithms
are meant for electing a coordinator process from among the currently
running processes in such a manner that at any instance of time there is a
single coordinator for all processes in the system.
Election algorithms are based on the following assumptions:
1. Each process in the system has a unique priority number.
2. Whenever an election is held, the process having the highest priority
number among the currently active processes is elected as the
coordinator.
3. On recovery, a failed process can take appropriate to rejoin the set of
active processes.
Page 136
Advanced Operating Systems (Distributed Systems) Unit 5
Sikkim Manipal University Page No. 136
Therefore, whenever initiated, an election algorithm basically finds out which
of the currently active processes has the highest priority number and then
informs this to all the active processes.
(i) The Bully Algorithm
This algorithm was proposed by Garcia-Molina. In this algorithm it is
assumed that every process knows the priority number of every other
process in the system. The algorithm works as follows:
When a process (say Pi) sends a request message to the coordinator and
does not receive a reply within a fixed timeout period, it assumes that the
coordinator has failed. It then initiates an election by sending an election
message to every process with a higher priority number than itself. If Pi does
not receive any response to its election message within a fixed timeout
period, it assumes that among the currently active processes it has the
highest priority number. Therefore it takes up the job of the coordinator and
sends a message (call it the coordinator message) to all processes having
lower priority numbers than itself, informing that from now on it is the new
coordinator. On the other hand, if Pi receives a response for its election
message, this means that some other process having higher priority number
is alive, Therefore Pi does not take any further action and just waits to
receive the final result (a coordinator message from the new coordinator) of
the election it initiated.
When a process (say Pj) receives an election message, it sends a response
message to the sender informing that it is alive and will take over the
election activity. Now Pj holds an election if it is not already holding one. In
this way, the election activity gradually moves on to the process that has the
highest priority number among the currently active processes and eventually
wins the election and becomes the new coordinator.
Page 137
Advanced Operating Systems (Distributed Systems) Unit 5
Sikkim Manipal University Page No. 137
(ii) A Ring Algorithm
This algorithm assumes that all the processes in the system are organized
in a logical ring. The ring is unidirectional in the sense that all the messages
related to the election algorithm are always passed only in one direction
(clockwise / anticlockwise). Every process in the system knows the structure
of the ring, so that while trying to circulate a message over the ring, if the
successor of the sender process is down, the sender can skip over the
successor, or the one after that, until an active member is located. The
algorithm works as follows:
When a process (say Pi) sends a request message to the current
coordinator and does not receive a reply within a fixed timeout period, it
assumes that the coordinator has crashed. Therefore it initiates an election
by sending an election message to its successor (actually to the first
successor that is currently active). This message contains the priority
number of process Pi. On receiving the election message, the successor
appends its own priority number to the message and passes it on to the next
active member in the ring. This member appends its own priority number to
the message and forwards it to its own successor. In this manner, the
election message circulates over the ring from one active process to another
and eventually returns back to process Pi. Process Pi recognizes the
message as its own election message by seeing that in the list of priority
numbers held within the message the first priority number is its own priority
number.
Note that when process Pi receives its own election message, the message
contains the list of priority numbers of all processes that are currently active.
Therefore, of the processes in this list, it elects the process having the
highest priority number as the new coordinator. It then circulates a
coordinator message over the ring to inform all the other active processes
who the new coordinator is. When the coordinator message comes back to
Page 138
Advanced Operating Systems (Distributed Systems) Unit 5
Sikkim Manipal University Page No. 138
process Pi after completing its one round along the ring, it is removed by
process Pi. At this point all the active processes know who the current
coordinator is.
When a process (say Pj), recovers after failure, it creates an inquiry
message and sends it to its successor. The message contains the identity of
process Pj. If the successor is not the current coordinator, it simply forwards
the enquiry message to its own successor. In this way, the inquiry message
moves forward along the ring until it reaches the current coordinator. On
receiving an inquiry message, the current coordinator sends a reply to
process Pj informing that it is the current coordinator.
Notice that in this algorithm two or more processes may almost
simultaneously discover that the coordinator has crashed and then each one
may circulate an election message over the ring. Although this results in a
little waste of network bandwidth, it does not cause any problem because
every process that initiated an election will receive the same list of active
processes, and all of them will choose the same process as the new
coordinator.
5.9 Terminal Questions
1. Discuss Clock synchronization issues in a DSM system.
2. Discuss the following synchronization issues in a DSM system:
Event Ordering
Mutual Exclusion
Deadlocks
3. Discuss the Election Algorithms in a Synchronized DSM system.
Page 139
Advanced Operating Systems (Distributed Systems) Unit 6
Sikkim Manipal University Page No. 139
Unit 6 Resource Management
Structure:
6.1 Introduction
Objectives
6.2 Desirable Features of a Good Global Scheduling Algorithm
6.3 Task assignment Approach
6.4 Load – Balancing Approach
6.5 Load – Sharing Approach
6.6 Terminal Questions
6.1 Introduction
Every distributed system consists of a number of resources interconnected
by a network. Besides providing communication facilities, a network
facilitates resource sharing by migrating a local process and executing it at a
remote node of the network. A process may be migrated because the local
node does not have the required resources or the local node has to be shut
down. A process may also be executed remotely if the expected turnaround
time will be better. From a user’s point of view the set of available resources
in a distributed system acts like a single virtual system.
A resource can be logical, such as a shared file, or physical such as CPU.
For this unit, we consider a resource to be a processor of the system and
assume that each processor forms a node of the distributed system.
Page 140
Advanced Operating Systems (Distributed Systems) Unit 6
Sikkim Manipal University Page No. 140
Figure 6.1: A Distributed System Connected by a Local Area Network
A resource manager schedules the processes in a distributed system to
make use of the system resources in such a manner that resource usage,
response time, network congestion, and scheduling overhead are optimized.
The following are different approaches for Process Scheduling:
1. Task Assignment Approach: Each process is viewed as a collection of
tasks. These tasks are scheduled to suitable processors to improve
performance. This is not a widely used approach because
It requires characteristics of all the processes to be known in
advance.
This approach does not take into consideration the dynamically
changing state of the system.
2. Load Balancing Approach: Processes are distributed among nodes to
equalize the load among all nodes.
3. Load-Sharing Approach: No node is allowed to be idle, while
processes are waiting to be served at other nodes. Requires the
knowledge of the load in the entire system.
Page 141
Advanced Operating Systems (Distributed Systems) Unit 6
Sikkim Manipal University Page No. 141
Objectives:
This unit discusses the management of various resources present at
different locations on a distributed network. For effective utilization of
resources, there should be proper management of these resources which
could be done through scheduling. The various scheduling algorithms for
resource management are discussed here. The topics of Task Assignment,
Load Balancing, and Load Sharing are discussed in detail.
6.2 Desirable Features of a Good Global Scheduling Algorithm
i) No a priori Knowledge about the process: A good process
scheduling algorithm should operate with absolutely no a priori
knowledge about the processes.
ii) Dynamic in Nature: It is intended that a good process-scheduling
algorithm should be able to take care of the dynamically changing load
at various nodes. The process assignment decisions should be based
on the current load of the system and not on some fixed static policy.
iii) Quick Decision Making: A good process scheduling algorithm must
be capable of taking quick decisions regarding node assignment for
processes.
iv) Scheduling overhead: The general observation is that as overhead is
increased in an attempt to obtain more information regarding the
global state of the system, the usefulness of the information is
decreased due to both the aging of the information gathered and the
low scheduling frequency as a result of the cost of gathering and
processing that information. Hence algorithms that provide near
optimal system performance with a minimum of global state
information gathering overhead are desirable.
Page 142
Advanced Operating Systems (Distributed Systems) Unit 6
Sikkim Manipal University Page No. 142
v) Stability: The algorithm should be stable: i.e., the system should not
enter a state in which nodes spend all their time migrating processes
or exchanging control messages without doing any useful work.
vi) Scalable: The algorithm should be scalable i.e. the system should be
able to handle small and large networked systems. A simple
approach to make an algorithm scalable is to probe only m of
N nodes for selecting a host. The value of m can be dynamically
adjusted depending on the value of N.
vii) Fault Tolerance: The algorithm should not be affected by the crash
of one or more nodes in the system. At any instance of time, it should
continue functioning for nodes that are up at that time. Algorithms
that have decentralized decision making a capability and consider
only available nodes in their decision making approach have better
fault tolerance capability.
viii) Fairness of service: How fairly a service is allocated is a common
concern. For example, two users simultaneously initiating equivalent
processes should receive the same quality of service. What is
desirable is a fair strategy that will improve response time to the
former without unduly affecting the latter. For this the concept of load
balancing has to be replaced by load-sharing, i.e., a node will share
some of its resources as long as its users are not significantly
affected.
6.3 Task Assignment Approach
In this approach, a process is considered to be composed of multiple tasks
and the goal is to find an optimal assignment policy for the tasks of an
individual process. The following are typical assumptions for the task
assignment approach:
A process is already split into pieces, called tasks
Page 143
Advanced Operating Systems (Distributed Systems) Unit 6
Sikkim Manipal University Page No. 143
The amount of computation required for each task and the speed of the
processors are known
Cost of processing each task at every node is known
The interprocess communication between any two processes is known
Resource requirements of each task
Reassignment of tasks is generally not possible
Some of the goals of a good task assignment algorithm are:
Minimize IPC cost (this problem can be modeled using network flow
model)
Efficient resource utilization
Quick turnaround time
A high degree of parallelism
Why do we need Load Balancing or Load Sharing?
Consider a set of N identical servers (i.e. with the same task arrival rate and
same service rate)
Let þ = utilization of the server
Let P0 = 1 − þ, the probability a server is idle
Let P = probability that at least one task is waiting for service and at least
one server is idle
P =
n
1i
(N
i
) QiHN – i
Qi = The probability that a given set of i-servers are idle
= P iO
, from independence of servers
HN – i = the probability that a given set of N − i servers are not idle and at one
or more of them a task is waiting for service
= {probability that all N − i servers have at least one task} - {probability that
all (N − i) servers have exactly one task}
Page 144
Advanced Operating Systems (Distributed Systems) Unit 6
Sikkim Manipal University Page No. 144
= (1 − P0)N−i − [(1 − P0)P0]
N−i
P=1 - þN(1 - (1 - þ)N) - (1 - þ2)N
Binomial Theorem (BT): (a+b)N =
N
i 0
(
N
i
) aibN−i
6.4 Load-Balancing Approach
The scheduling algorithms that use this approach are known as Load
Balancing or Load-Leveling Algorithms. These algorithms are based on the
intuition that for better resource utilization, it is desirable for the load in a
distributed system to be balanced evenly. Thus a load balancing algorithm
tries to balance the total system load by transparently transferring the
workload from heavily loaded nodes to lightly loaded nodes in an attempt to
ensure good overall performance relative to some specific metric of system
performance.
We can have the following categories of load balancing algorithms:
1. Static: Ignore the current state of the system. e.g. If a node is heavily
loaded, it picks up a task randomly and transfers it to a random node.
These algorithms are simpler to implement but performance may not be
good.
2. Dynamic: Use the current state information for load balancing. There is
an overhead involved in collecting state information periodically; they
perform better than static algorithms.
3. Deterministic: Algorithms in this class use the processor and process
characteristics to allocate processes to nodes.
4. Probabilistic: Algorithms in this class use information regarding static
attributes of the system such as number of nodes, processing capability,
etc.
Page 145
Advanced Operating Systems (Distributed Systems) Unit 6
Sikkim Manipal University Page No. 145
5. Centralized: System state information is collected by a single node.
This node makes all scheduling decisions.
6. Distributed: Most desired approach. Each node is equally responsible
for making scheduling decisions based on the local state and the state
information received from other sites.
7. Cooperative: A distributed dynamic scheduling algorithm. In these
algorithms, the distributed entities cooperate with each other to make
scheduling decisions. Therefore they are more complex and involve
larger overhead than non-cooperative ones. But the stability of a
cooperative algorithm is better than that of a non-cooperative one.
8. Non-cooperative: A distributed dynamic scheduling algorithm. In these
algorithms, individual entities act as autonomous entities and make
scheduling decisions independently of the action of other entities.
Load Estimation Policy: This policy makes an effort to measure the load at
a particular node in a distributed system according to the following criteria:
The number of processes running at a node as a measure of the load at
the node.
The CPU utilization as a measure of load
None of the above fully captures the load at a node, other parameters such
as resource demands of these processes, architecture and speed of the
processor total remaining execution time of the processes, etc should be
taken into consideration as well.
Process Transfer Policy: The strategy of load balancing algorithms is
based on the idea of transferring some processes from the heavily loaded
nodes to lightly loaded nodes. To facilitate this, it is necessary to devise a
policy to decide whether or not a node is lightly or heavily loaded. The
threshold value of a node is the limiting value of its workload and is used to
decide whether a node is lightly or heavily loaded.
Page 146
Advanced Operating Systems (Distributed Systems) Unit 6
Sikkim Manipal University Page No. 146
The threshold value of a node may be determined by any of the following
methods:
1. Static Policy: Each node has a predefined threshold value. If the
number of processes exceed the predefined threshold value, a process
is transferred. Can cause process thrashing under heavy load, thus
causing instability.
2. Dynamic Policy: In this method, the threshold value is dynamically
calculated. It is increased under heavy load and decreased under light
load. Thus process thrashing does not occur.
3. High-low Policy: Each node has two threshold values, high and low.
Thus, the state of a node can be overloaded, under-loaded or normal
depending on the number of processes greater than high, less than low
or otherwise.
Location Policies:
Once a decision has been made through the transfer policy to transfer a
process from a node, the next step is to select the destination node for that
process’ execution. This selection is made by the location policy of a
scheduling algorithm. The main location policies proposed are as follows:
1. Threshold: A random node is polled to check its state and the task is
transferred if it will not be overloaded; polling is continued until a suitable
node is found or a threshold number of nodes have been polled.
Experiment shows polling 3 to 5 five nodes performs as good as polling
large number of nodes, like 20 nodes. This also has substantial
performance over no load balancing at all.
2. Shortest: A predetermined number of nodes are polled and the node
with minimum load among these is picked for the task transfer; if that
node is overloaded the task is executed locally.
3. Bidding: In this method, each node acts as a manager (the one who
tries to transfer a task) and a contractor, the one that is able to accept a
Page 147
Advanced Operating Systems (Distributed Systems) Unit 6
Sikkim Manipal University Page No. 147
new task. In this the Manager broadcasts a request-for-bids to all the
nodes. A contractor returns bids (quoted price based on the processor
capability, memory size, resource availability, etc). A Manager chooses
the best bidder for transferring the task. Problems that could arise as a
result of broadcasts of two or more managers concurrently need to be
addressed.
4. Pairing: This approach tries to reduce the variance in load between
pairs of nodes. In this approach, two nodes that differ greatly in load are
paired with each other so they can exchange tasks. Each node asks a
randomly picked node if it will pair with it. After a pairing is formed, one
or more processes are transfered from heavily loaded node to the lightly
loaded node.
State Information Exchange Policies:
The dynamic policies require frequent change of state information among
the nodes of the system. In fact, a dynamic load-balancing algorithm faces a
transmission dilemma because of the two opposing impacts the
transmission of a message has on the overall performance of the system.
On one hand, transmission improves the ability of the algorithm to balance
the load. On the other hand, it raises the expected queuing time of
messages because of the increase in the utilization of the communication
channel. Thus proper selection of the state information exchange policy is
essential. The proposed load balancing algorithms use one of the following
policies for the purpose:
1. Periodic Broadcast: Each node broadcasts its state information
periodically, say every t time units. It does not scale well and causes
heavy network traffic. May result in fruitless messages.
2. Broadcast When State Changes: This avoids fruitless messages. A
node broadcasts its state only when its state changes. For example,
when the state changes from normal to low or normal to high, etc.
Page 148
Advanced Operating Systems (Distributed Systems) Unit 6
Sikkim Manipal University Page No. 148
3. On-Demand Exchange: Under this approach
A node broadcasts a state information request when its state
changes from normal load region to high or low load.
Upon receiving this request, other nodes send their current state
information to the requesting node.
If the requesting node includes its state information in the request
then, only those nodes that can cooperate with the requesting
node need to send reply.
4. Exchange by Polling: In this approach the state information is
exchanged with a polled node only. Polling stops after a predetermined
number of polling or after a suitable partner is found, whichever happens
first.
5. Priority Assignment Policies: One of the following priority assignment
rules may be used to assign priorities to local and remote processes
(i.e. processes that have migrated from other nodes):
i) Selfish: Local processes are given higher priority than remote
processes.
Study shows this approach yields worst response time of the
three policies.
This approach penalizes processes that arrive at a busy node
because they will be transferred and hence will execute as low
priority processes. It favors the processes that arrive at lightly
loaded nodes.
ii) Altruistic: Remote processes are given higher priority than local
processes
Study shows this approach yields best response time of all the
three approaches.
Under this approach, remote processes incur lower delays than
local processes.
Page 149
Advanced Operating Systems (Distributed Systems) Unit 6
Sikkim Manipal University Page No. 149
iii) Intermediate: If the number of local processes are more, local
processes get higher priority; otherwise, remote processes get
higher priority.
Study shows that the overall response time performance under
this policy is much closer to that of the altruistic policy.
Under this policy, local processes are treated better than the
remote processes for a wide range of loads.
iv) Migration – Limiting Policies: This policy is used to decide about
the total number of times a process should be allowed to migrate.
Uncontrolled: Remote process is treated like local process. So,
there is no limit on the number of nodes it can migrate.
Controlled: Most systems use controlled policy to overcome the
instability problem
Migrating a partially executed process is expensive; so, many
systems limit the number of migrations to 1. For long running
processes, it might be beneficial to migrate more than once.
6.5 Load Sharing Approach
Several researchers believe that load balancing, with its implication of
attempting to equalize workload on all the nodes of the system, is not an
appropriate objective. This is because the overhead involved in gathering
the state information to achieve this objective is normally very large,
especially in distributed systems having a large number of nodes. In fact, for
the proper utilization of resources of a distributed system, it is not required
to balance the load on all the nodes. It is necessary and sufficient to prevent
the nodes from being idle while some other nodes have more than two
processes. This rectification is called the Dynamic Load Sharing instead of
Dynamic Load Balancing.
Page 150
Advanced Operating Systems (Distributed Systems) Unit 6
Sikkim Manipal University Page No. 150
Issues in Load-Sharing Algorithms:
The design of a load sharing algorithm requires that proper decisions be
made regarding load estimation policy, process transfer policy, state
information exchange policy, priority assignment policy, and migration
limiting policy. It is simpler to decide about most of these policies in case of
load sharing, because load sharing algorithms do not attempt to balance the
average workload of all the nodes of the system. Rather, they only attempt
to ensure that no node is idle when a node is heavily loaded. The priority
assignment policies and the migration limiting policies for load-sharing
algorithms are the same as that of load-balancing algorithms.
1. Load Estimation Policies: In this an attempt is made to ensure that no
node is idle while processes wait for service at some other node. In general,
the following two approaches are used for estimation:
Use number of processes at a node as a measure of load
Use the CPU utilization as a measure of load
Process Transfer Policies: Load sharing algorithms are interested in busy
or idle states only and most of them employ the all-or-nothing strategy given
below:
All or Nothing Strategy: It uses a single threshold policy. A node becomes
a candidate to accept tasks from remote nodes only when it becomes idle. A
node becomes a candidate for transferring a task as soon as it has more
than one task. Under this approach, an idle process is not able to
immediately acquire a task, thus wasting processing power. To avoid this,
the threshold value can be set to 2 instead of 1.
Location Policies: Location Policy decides the sender node or the receiver
node of a process that is to be moved within the system for load sharing.
Depending on the type of node that takes the initiative to globally search for
a suitable node for the process, the location policies are of the following
types:
Page 151
Advanced Operating Systems (Distributed Systems) Unit 6
Sikkim Manipal University Page No. 151
1. Sender-Initiated Policy: Under this policy, heavily loaded nodes search
for lightly loaded nodes to which task may be transferred. The search
can be done by sending a broadcast message or probing randomly
picked nodes
An advantage of this approach is that sender can transfer the freshly
arrived tasks, so no preemptive task transfers occur.
A disadvantage of this approach is it can cause system instability
under high system load.
2. Receiver-Initiated Location Policy: Under this policy, lightly loaded
nodes search for heavily loaded nodes from which tasks may be
transferred
The search for a sender can be done by sending a broadcast
message or by probing randomly picked nodes.
An disadvantage of this approach is it may result in preemptive task
transfers because sender may not have any freshly arrived tasks.
Advantage is, this does not cause system instability, because under
high system loads a receiver will quickly find a sender; and under
low system loads, it is OK for processes to process some additional
control messages.
3. Symmetrically Initiated Location Policy: Under this approach, both
senders and receivers search for receivers and senders respectively.
4. State Information Exchange Policies: Since it is not necessary to
equalize load at all nodes under load sharing, state information is
exchanged only when the state changes.
5. Broadcast When State Changes: A node broadcasts a state
information request message when it becomes under-loaded or
overloaded.
In the sender-initiated approach a node broadcasts this message
only when it is overloaded.
Page 152
Advanced Operating Systems (Distributed Systems) Unit 6
Sikkim Manipal University Page No. 152
In the receiver-initiated approach, a node broadcasts this message
only when it is under-loaded.
6. Poll When State Changes: When a node’s state changes,
It randomly polls other nodes one by one and exchanges state
information with the polled nodes.
Polling stops when a suitable node is found or a threshold number of
nodes have been polled.
Under sender initiated policy, sender polls to find suitable receiver.
Under receiver initiated policy, receiver polls to find suitable sender.
The above Average Algorithm by Krueger and Finkel (A dynamic
load balancing algorithm) tries to maintain load at each node within
an acceptable range of the system average.
7. Transfer Policy: A threshold policy that uses two adaptive thresholds,
the upper threshold, and the lower threshold
A node with load lower than lower threshold is considered a receiver
A node with load higher than the higher threshold is considered a
sender.
A node’s estimated average load is supposed to lie in the middle of
the lower and upper thresholds.
6.6 Terminal Questions
1. Discuss the desirable features of a good global scheduling algorithm.
2. Discuss the Task Assignment approach.
3. Discuss the Load Sharing approach.
Page 153
Advanced Operating Systems (Distributed Systems) Unit 7
Sikkim Manipal University Page No. 153
Unit 7 Process Management
Structure:
7.1 Introduction
Objectives
7.2 Process Migration
7.3 Threads
7.4 Terminal Questions
7.1 Introduction
The notion of a process is central to the understanding of operating
systems. There are quite a few definitions presented in the literature, but no
"perfect" definition has yet appeared.
Definition
The term "process" was first used by the designers of the MULTICS in
1960's. Since then, the term process is used somewhat interchangeably
with 'task' or 'job'. The process has been given many definitions, for
instance:
A program in Execution.
An asynchronous activity.
The 'animated spirit' of a procedure in execution.
The entity to which processors are assigned.
The 'dispatchable' unit.
and many more definitions have been given. As we can see from the above
that there is no universally agreed upon definition, but the definition
"Program in Execution" seems to be most frequently used. And this is a
concept used in the present study of operating systems.
Page 154
Advanced Operating Systems (Distributed Systems) Unit 7
Sikkim Manipal University Page No. 154
Now that we have agreed upon the definition of process, the question is
what is the relation between process and program. It is same beast with
different name or when this beast is sleeping (not executing) it is called
program and when it is executing it becomes process. Well, to be very
precise, a Process is not the same as program. In the following discussion
we point out some of the differences between process and program.
Process is not the same as program. A process is more than a program
code. A process is an 'active' entity as opposed to program which is
considered to be a 'passive' entity. As we all know a program is an algorithm
expressed in some suitable notation, (e.g., programming language). Being
passive, a program is only a part of process. Process, on the other hand,
includes:
Current value of Program Counter (PC)
Contents of the processors registers
Value of the variables
The process stack (SP) which typically contains temporary data such as
subroutine parameter, return address, and temporary variables.
A data section that contains global variables.
A process is the unit of work in a system.
In Process model, all software on the computer is organized into a number
of sequential processes. A process includes PC, registers, and variables.
Conceptually, each process has its own virtual CPU. In reality, the CPU
switches back and forth among processes. (The rapid switching back and
forth is called multiprogramming).
Process Management
In a conventional (or centralized) operating system, process management
deals with mechanisms and policies for sharing the processor of the system
among all processes. In a Distributed Operating system, the main goal of
Page 155
Advanced Operating Systems (Distributed Systems) Unit 7
Sikkim Manipal University Page No. 155
process management is to make the best possible use of the processing
resources of the entire system by sharing them among all the processes.
Three important concepts are used in distributed operating systems to
achieve this goal:
1. Processor Allocation: It deals with the process of deciding which
process should be assigned to which processor.
2. Process Migration: It deals with the movement of a process from its
current location to the processor to which it has been assigned.
3. Threads: They deal with fine-grained parallelism for better utilization of
the processing capability of the system.
This unit describes the concepts of process migration and threads.
Issues in Process Management
Transparent relocation of processes
– Preemptive process migration – costly
– Non-preemptive process migration
Selecting the source and destination nodes for migration
Cost of migration – size of the address space and time taken to migrate
Address space transfer mechanisms – total freezing, pre-transfering,
transfer on reference
Message forwarding for migrated processes
– Resending the message
– The origin site mechanism
– Link traversal mechanism
– Link update mechanism
Process migration in heterogeneous systems
Objectives:
This unit introduces the reader management of processes present in a
distributed network. It discusses the differences between the processes
Page 156
Advanced Operating Systems (Distributed Systems) Unit 7
Sikkim Manipal University Page No. 156
running on a uni-processor system and processes running on a distributed
system in specific. It speaks about process migration mechanisms in which
the processes may be shifted or migrated to different machines on the
network depending on the availability of resources to complete the process
execution. It also discusses the concept of threads, their mechanisms, and
differences between a thread and a process on uni-processor system and a
distributed system.
7.2 Process Migration
Definition:
The relocation of a process from its current location (the source system) to
some other location (Destination).
A process may be migrated either before it starts executing on its source
node or during the course of its execution. The former is known as
pre-emptive process migration.
Process migration involves the following steps:
1. Selection of a process to be migrated
2. Selection of destination system or node
3. Actual transfer of the selected process to the destination system or node
The following are the desirable features of a good process migration
mechanism:
A good process migration mechanism must possess transparency, minimal
interferences, minimal residual dependencies, efficiency, and robustness.
i) Transparency: Levels of transparency:
Access to objects such as files and devices should be done in a
location-independent manner. To accomplish this, system should
provide a mechanism for transparent object naming.
Page 157
Advanced Operating Systems (Distributed Systems) Unit 7
Sikkim Manipal University Page No. 157
System calls should be location-independent. However, system
calls related to physical properties of node need not be location-
independent.
Interprocess communication should be transparent. Messages
sent to a migrated process should be delivered to the process
transparently; i.e. the sender doesn’t have to resend it.
ii) Minimal Interference: Migration of a process should involve minimal
interference to the progress of the process and to the system as a
whole. For example, minimize freezing time; can be done by partial
transfer.
iii) Minimal residual dependencies: Migrated process should not continue
to depend in any way on its previous node, because such dependency
can diminish the benefits of migrating and also the failure of the previous
node will cause the process to fail.
iv) Efficiency: Time required for migrating a process and cost of supporting
remote execution should be minimized.
v) Robustness: Failure of any node other than the one on which the
process is running should not affect the execution of the process.
Process Migration Mechanism
Migration of a process is a complex activity that involves proper handling of
several sub-activities in order to meet the requirements of a good process
migration mechanism. The four major subactivities involved in process
migration are as follows:
1. Freezing the process and restarting on another node.
2. Transferring the process’ address space from its source node to its
destination node
3. Forwarding messages meant for the migrant process
Page 158
Advanced Operating Systems (Distributed Systems) Unit 7
Sikkim Manipal University Page No. 158
4. Handling communication between cooperating processes that have
been separated as a result of process migration.
The commonly used mechanisms for handling each of these subactivities
are described below:
1. Mechanisms for freezing the process:
In pre-emptive process migration, the usual process is to take a “snapshot”
of the process’ on its source node and reinstate the snapshot on the
destination node. For this, at some point during migration, the process is
frozen on its source node, its state information is transferred to its
destination node, and the process is restarted on its destination node using
this state information. By freezing this process, we mean that the execution
of the process is suspended and all external interactions with the process
are deferred.
Some general issues involved in these operations are described below:
i) Immediate and delayed blocking: When can these two approaches be
used?
If the process is not executing a system call, it can be blocked
immediately.
If a process is executing a system call, it may or may not be
possible to block it immediately, depending on the situation and
implementation.
ii) Fast and slow I/O operations: It is feasible to wait for fast I/O
operations (e.g. disk I/O) after blocking. However, not feasible to wait
for slow I/O operations such as terminal. But proper mechanisms are
necessary for these I/O operations to continue.
iii) Information about open files: Names of files, file descriptors, current
modes, current position of their file pointers, etc need to preserved and
Page 159
Advanced Operating Systems (Distributed Systems) Unit 7
Sikkim Manipal University Page No. 159
transferred. Also, temporary files would more efficiently be created at
the node on which the process is executing.
iv) Reinstating the process on the destination node: This involves
creating an empty process on the destination node, and the state of
the transferred process is copied into the empty process and is
unfrozen.
v) Address Transfer mechanisms: Migration of a process involves the
transfer of the process state (includes contents of registers, memory
tables, I/O states, process identifiers, etc.) and the process’s address
space (i.e., code, data, and the program stack).
There are three ways to transfer the address space:
a) Total freezing: Process execution is stopped while the address
space is being transferred. It is simple but inefficient
b) Pre-transferring: The address space is transferred while the
process is still running on the source node. Pre-transfer is
followed by repeated transfer of pages modified during the
transfer.
c) Transfer on reference: Only part of the address space is
transferred. The rest of the address space is transferred only on
demand.
vi) Message forwarding mechanisms: After the process has been
migrated, messages bound for that process should be forwarded to its
current node. The following are the three types of messages:
a) messages received at the source after the process execution is
stopped at the source but the process was not started at the new
node;
b) messages received at the source node after the process has
started executing at the destination;
Page 160
Advanced Operating Systems (Distributed Systems) Unit 7
Sikkim Manipal University Page No. 160
c) messages to be sent to the migrant process from any other node
after the process started executing at the destination.
Message Forwarding Mechanisms
In moving a message, it must be ensured that all pending, en-route, and
future messages arrive at the process’s new location. The messages to be
forwarded to the migrant process’s new location can be classified into the
following:
Type 1: Messages received at the source node after the process’s
execution has been stopped on its source node and the process’s execution
has not yet been started on its destination node.
Type 2: Messages received at the source node after the process’s
execution has started on its destination node.
Type 3: Messages that are to be sent to the migrant process from any other
node after it has started executing on the destination node.
The different mechanisms used for message forwarding in existing
distributed systems are described below:
1. Resending the message: Instead of the source node forwarding the
messages received for the migrated process, it notifies the sender about
the status of the process. The sender locates the process and resends
the message.
2. Origin site mechanism: Process’s origin site is embedded in the
process identifier.
Each site is responsible for keeping information about the current
locations of all the processes created on it.
Messages are always sent to the origin site. The origin site then
forwards it to the process’s current location.
Page 161
Advanced Operating Systems (Distributed Systems) Unit 7
Sikkim Manipal University Page No. 161
A drawback of this approach is that the failure of the origin site will
disrupt the message forwarding .
Another drawback is that there is continuous load on the origin site.
3. Link traversal mechanism: A forwarding address is left at the source
node
The forwarding address has two components
– The first component is a system-wide unique process identifier,
consisting of (id of the node on which the process was created,
local pid)
– The second component is the known location of the process.
This component is updated when the corresponding process is
accessed form the node.
Co-processes Handling Mechanisms
In systems that allow process migration, an important issue is the necessity
to provide efficient communication between a process (parent) and its
sub-processes (children), which might have been migrated and placed on
different nodes. The two different mechanisms used by existing distributed
operating systems to take care of this problem are described below:
1. Disallowing separation of co-processes: There are two ways to do
this
Disallow migration of processes that wait for one or more of their
children to complete.
Migrate children processes along with their parent process.
2. Home node or origin site concept: This approach.
Allows the processes and sub-processes to migrate independently.
All communication between the parent and children processes take
place via the home node.
Page 162
Advanced Operating Systems (Distributed Systems) Unit 7
Sikkim Manipal University Page No. 162
Process Migration in Heterogeneous Systems
Following are the ways to handle heterogeneous systems
Use external data representation mechanism to handle this.
Issues related to handling floating point representation need to be
addressed. i.e., number of bits allocated to mantissa and exponent
should be at least as large as the largest representation in the system.
Signed infinity and signed 0 representation: Not all nodes in the system
may support this.
Process Migration Merits
Reducing the average response time of the processes
Speeding up individual jobs
Gaining higher throughput
Utilizing resources effectively
Reducing network traffic
Improving system reliability
Improving system security
7.3 Threads
Threads are a popular way to improve application performance through
parallelism. In traditional operating systems the basic unit of CPU utilization
is a process. Each process has its own program counter, register states,
stack, and address space. In operating systems with threads facility, the
basic unit of CPU utilization is a thread. In these operating systems, a
process consists of an address space and one or more threads of control.
Each thread of a process has its own program counter, register states, and
stack. But all the threads of a process share the same address space.
Hence they also share the same global variables. In addition, all threads of
a process also share the same set of operating system resources such as
Page 163
Advanced Operating Systems (Distributed Systems) Unit 7
Sikkim Manipal University Page No. 163
open files, child processes, semaphores, signals, accounting information,
and so on. Threads share the CPU in the same way as processes do. i.e. on
a uni-processor system, threads run in a time-sharing mode, whereas on a
shared memory multi-processor, as many threads can run simultaneously
as there are processors. Akin to traditional processes, threads can create
child threads, can block waiting for system calls to complete, and can
change states during their course of execution. At a particular instance of
time, a thread can be in any one of several states: Running, Blocked,
Ready, or Terminated. In operating systems with threading facility, a
process having a single thread corresponds to a process of a traditional
operating system. Threads are referred to as lightweight processes and
traditional processes are referred to as heavyweight processes.
Why Threads?
Some of the limitations of the traditional process model are listed below:
1. Many applications wish to perform several largely independent tasks
that can run concurrently, but must share the same address space and
other resources.
For example, a database server or file server UNIX’s make facility allows
users to compile several files in parallel, using separate processes for
each.
2. Creating several processes and maintaining them involves lot of
overhead. When a context switch occurs, the state information of the
process (register values, page tables, file descriptors, outstanding I/O
requests, etc) need to be saved.
3. On UNIX systems, new processes are created using the fork system
call. fork is an expensive system call.
4. Processes cannot take advantage of multiprocessor architectures,
because a process can only use one processor at a time. An application
Page 164
Advanced Operating Systems (Distributed Systems) Unit 7
Sikkim Manipal University Page No. 164
must create a number of processes and dispatch them to the available
processors.
5. Switching between threads sharing the same address space is
considerably cheaper than switching between processes The traditional
UNIX process is single-threaded.
Consider a set of single threaded processes executing on a Uni - processor
machine. The first three processes were spawned by a server in response
to three clients. The lower two processes run some other server application
Figure 7.1: Traditional UNIX system – Uniprocessor with
single-threaded processes
Two servers running on a uni – processor system. Each server runs as a
single process, with multiple threads sharing a single address space. Inter-
thread context-switching can be handled by either the OS kernel or a user-
level threads library.
Eliminating multiple nearly identical address spaces for each application
reduces the load on the memory subsystem.
Page 165
Advanced Operating Systems (Distributed Systems) Unit 7
Sikkim Manipal University Page No. 165
Disadvantage: Multithreaded processes must be concerned with
synchronizing the access to the objects by several of their own threads.
Two Multithreaded processes running on a multiprocessor. All threads of
one process share the same address space, but run on different processors.
We get improved performance but synchronization is more complicated.
Figure 7.2: Multithreaded Processes in a Multiprocessor System
To summarize:
A Process can be divided into two components – a set of threads and a
collection of resources. The collection of resources include an address
space, open files, user credentials, quotas, etc, that are shared by all
threads in the process.
A Thread
is a dynamic object that represents a control point in the process and
that executes a sequence of instructions.
has its private objects, program counter, stack, and a register context.
Page 166
Advanced Operating Systems (Distributed Systems) Unit 7
Sikkim Manipal University Page No. 166
User-level thread libraries.
IEEE POSIX standards group generated several drafts of a threads
package known as pthreads.
Sun’s Solaris OS supports pthreads library. It also has implemented its
own threads library.
Models for Organizing Threads
The following are some ways of organizing threads:
Dispatcher-workers model: Dispatcher thread accepts requests from
clients and dispatches it to one of the appropriate free worker threads for
further processing of the request.
Team Model: All threads behave equal in this model. Each thread gets and
process’s client’s request on its own Pipline model: In this model, threads
are arranged in a pipeline so that output data generated by the first thread is
used for processing by the second thread, output by second thread is used
by the third....
User-level Threads Libraries
The interface provided by the threads package must include several
important facilities such as for:
Creating and terminating threads
Suspending and resuming threads
Assigning priorities to the individual threads
Thread scheduling and context switching
Synchronizing activities through facilities such as semaphores and
mutual exclusion locks
Sending messages from one thread to another
Page 167
Advanced Operating Systems (Distributed Systems) Unit 7
Sikkim Manipal University Page No. 167
Case Study – DCE threads
DCE threads comply with IEEE POSIX (Portable OS interface) standard
known as P-Threads.
DCE provides a set of user-level library procedures for the creation and
maintenance of threads.
To access the thread services DCE provides an API that is compatible to
the POSIX standard.
If a system supporting DCE has no intrinsic support for threads, the API
provides an interface to the thread library that is linked to the application.
If the system supporting DCE has OS kernel support for threads, DCE is
set up to use this facility. In this case the API serves as an interface to
kernel supported threads facility.
7.4 Terminal Questions
1. Differentiate between pre-emptive and non-preemptive process
migration. Mention their advantages and disadvantages.
2. Discuss the issues involved in freezing a migrant process on its source
node and restarting it on its destination node.
3. Discuss the threading issues with respect to process management in a
DSM system.
Page 168
Advanced Operating Systems (Distributed Systems) Unit 8
Sikkim Manipal University Page No. 168
Unit 8 Distributed File Systems
Structure:
8.1 Introduction
Objectives
8.2 The Key Challenges of Distributed Systems
8.3 Client’s Perspective: File Services
8.4 File Access Semantics
8.5 Server’s Perspective Implementation
8.6 Stateful Versus Stateless Servers
8.7 Replication
8.8 Caching
8.9 Ceph
8.10 Terminal Questions
8.1 Introduction
In a distributed file system (DFS), multiple clients share files provided by a
shared file system. In the DFS paradigm communication between processes
is done using these shared files. Although this is similar to the DSM and
distributed object paradigms (in that communication is abstracted by shared
resources) a major difference between these paradigms and the DFS
paradigm is that the resources (files) in DFS are much longer lived. This
makes it, for example, much easier to provide asynchronous and persistent
communication using shared files than using DSM or distributed objects.
The basic model provided by distributed file systems is that of clients
accessing files and directories that are provided by one or more file servers.
A file server provides a client with a file service interface and a view of the
file system. Note that the view provided to different clients by the same
server may be different, for example, if clients only see files that they are
Page 169
Advanced Operating Systems (Distributed Systems) Unit 8
Sikkim Manipal University Page No. 169
authorised to access. Access to files is achieved by clients performing
operations from the file service interface (such as create, delete, read, write,
etc.) on a file server. Depending on the implementation the operations may
be executed by the servers on the actual files, or by the client on local
copies of the file. We will return to this issue later.
Objectives:
This unit aims at teaching the students the key aspects of Distributed File
systems. It deals with the design concepts, client and server perspectives of
the file systems, and so on. It presents various examples of distributed file
systems in use.
8.2 The Key Challenges of Distributed Systems
A good distributed file system should have the features described below:
i) Transparency
Location: a client cannot tell where a file is located
Migration: a file can transparently move to another server
Replication: multiple copies of a file may exist
Concurrency: multiple clients access the same file
ii) Flexibility
In a flexible DFS it must be possible to add or replace file servers.
Also, a DFS should support multiple underlying file system types
(e.g., various Unix file systems, various Windows file systems, etc.)
iii) Reliability
In a good distributed file system, the probability of loss of stored data
should be minimized as far as possible. i.e. users should not feel
compelled to make backup copies of their files because of the
unreliability of the system. Rather, the file system should automatically
generate backup copies of critical files that can be used in the event of
Page 170
Advanced Operating Systems (Distributed Systems) Unit 8
Sikkim Manipal University Page No. 170
loss of the original ones. Stable storage is a popular technique used by
several file systems for higher reliability.
iv) Consistency:
Employing replication and allowing concurrent access to files may
introduce consistency problems.
v) Security:
Clients must authenticate themselves and servers must determine
whether clients are authorised to perform requested operation.
Furthermore communication between clients and the file server must
be secured.
vi) Fault tolerance:
Clients should be able to continue working if a file server crashes.
Likewise, data must not be lost and a restarted file server must be able
to recover to a valid state.
vii) Performance:
In order for a DFS to offer good performance it may be necessary to
distribute requests across multiple servers. Multiple servers may also
be required if the amount of data stored by a file system is very large.
viii) Scalability:
A scalable DFS will avoid centralised components such as a
centralised naming service, a centralised locking facility, and a
centralised file store. A scalable DFS must be able to handle an
increasing number of files and users. It must also be able to handle
growth over a geographic area (e.g., clients that are widely spread
over the world), as well as clients from different administrative
domains.
Page 171
Advanced Operating Systems (Distributed Systems) Unit 8
Sikkim Manipal University Page No. 171
8.3 Client’s Perspective: File Services
The File Service Interface represents files as an uninterpreted sequence of
bytes that are associated with a set of attributes (owner, size, creation date,
permissions, etc.) including information regarding protection (i.e., access
control lists or capabilities of clients). Moreover, there is a choice between
the upload/download model and the remote access model. In the first
model, files are downloaded from the server to the client. Modifications are
performed directly at the client after which the file is uploaded back to the
server. In the second model all operations are performed at the server itself,
with clients simply sending commands to the server.
There are benefits and drawbacks to both models. The first model, for
example, can avoid generating traffic every time it performs operations on a
file. Also, a client can potentially use a file even if it cannot access the file
server. A drawback of performing operations locally and then sending an
updated file back to the server is that concurrent modification of a file by
different clients can cause problems. The second approach makes it
possible for the file server to order all operations and therefore allow
concurrent modifications to the files. A drawback is that the client can only
use files if it has contact with the file server. If the file server goes down, or
the network connection is broken, then the client loses access to the files.
8.4 File Access Semantics
Ideally, the client would perceive remote files just like local ones.
Unfortunately, the distributed nature of a DFS makes this goal hard to
achieve. In the following discussion, we present the various file access
semantics available, and discuss how appropriate they are to a DFS.
Page 172
Advanced Operating Systems (Distributed Systems) Unit 8
Sikkim Manipal University Page No. 172
The first type of access semantics that we consider are called Unix
semantics and they imply the following:
A read after a write returns the value just written.
When two writes follow in quick succession, the second persists.
In the case of a DFS, it is possible to achieve such semantics if there is only
a single file server and no client-side caching is used. In practice, such a
system is unrealistic because caches are needed for performance and write-
through caches (which would make Unix semantics possible to combine
with caching) are expensive. Furthermore deploying only a single file server
is bad for scalability. Because of this it is impossible to achieve Unix
semantics with distributed file systems.
Alternative semantic models that are better suited for a distributed
implementation include:
1. Session semantics,
2. Immutable files, and
3. Atomic transactions.
1. Session Semantics:
In the case of session semantics, changes to an open file are only
locally visible. Only after a file is closed, are changes propagated to the
server (and other clients). This raises the issue of what happens if two
clients modify the same file simultaneously. It is generally up to the
server to resolve conflicts and merge the changes. Another problem with
session semantics is that parent and child processes cannot share file
pointers if they are running on different machines.
2. Immutable Files:
Immutable files cannot be altered after they have been closed. In order
to change a file, instead of overwriting the contents of the existing file a
new file must be created. This file may then replace the old one as a
Page 173
Advanced Operating Systems (Distributed Systems) Unit 8
Sikkim Manipal University Page No. 173
whole. This approach to modifying files does require that directories
(unlike files) be updatable. Problems with this approach include a race
condition when two clients try to replace the same file as well as the
question of what to do with processes that are reading a file at the same
time as it is being replaced by another process.
3. Atomic Transactions:
In the transaction model, a sequence of file manipulations can be
executed indivisibly, which implies that two transactions can never
interfere. This is the standard model for databases, but it is expensive to
implement.
8.5 Server’s Perspective: Implementation
Observations about the expected use of a file system can be used to guide
the design of a DFS. For example, a study by Satyanarayanan found the
following usage patterns for Unix systems at a university:
Most files are small – less than 10k
Reading is much more common than writing
Usually access is sequential; random access is rare
Most files have a short lifetime
File sharing is unusual
Most processes use only a few files
Distinct files classes with different properties exist
These usage patterns (small files, sequential access, high read-write ratio)
would suggest that an update/download model for a DFS would be
appropriate. Note, however, that different usage patterns may be observed
at different kinds of institutions. In situations where the files are large, and
are updated more often it may make more sense to use a DFS that
implements a remote access model.
Page 174
Advanced Operating Systems (Distributed Systems) Unit 8
Sikkim Manipal University Page No. 174
Besides the usage characteristics, implementation tradeoffs may depend on
the requirements of a DFS. These include supporting a large file system,
supporting many users, the need for high performance, and the need for
fault tolerance. Thus, for example, a fault tolerant DFS may sacrifice some
performance for better reliability guarantees, while a high performance DFS
may sacrifice security and wide-area scalability in order to achieve extra
performance.
8.6 Stateful Vs Stateless Servers
The file servers that implement a distributed file service can be stateless or
stateful. Stateless file servers do not store any session state. This means
that every client request is treated independently, and not as part of a new
or existing session. Stateful servers, on the other hand, do store session
state. They may, therefore, keep track of which clients have opened which
files, current read and write pointers for files, which files have been locked
by which clients, etc.
The main advantage of stateless servers is that they can easily recover from
failure. Because there is no state that must be restored, a failed server can
simply restart after a crash and immediately provide services to clients as
though nothing happened. Furthermore, if clients crash the server is not
stuck with abandoned opened or locked files. Another benefit is that the
server implementation remains simple because it does not have to
implement the state accounting associated with opening, closing, and
locking of files.
The main advantage of stateful servers, on the other hand, is that they can
provide better performance for clients. Because clients do not have to
provide full file information every time they perform an operation, the size of
messages to and from the server can be significantly decreased. Likewise
Page 175
Advanced Operating Systems (Distributed Systems) Unit 8
Sikkim Manipal University Page No. 175
the server can make use of knowledge of access patterns to perform read-
ahead and do other optimisations. Stateful servers can also offer clients
extra services such as file locking, and remember read and write positions.
8.7 Replication
The main approach to improving the performance and fault tolerance of a
DFS is to replicate its content. A replicating DFS maintains multiple copies
of files on different servers. This can prevent data loss, protect a system
against down time of a single server, and distribute the overall workload.
There are three approaches to replication in a DFS:
1. Explicit replication: The client explicitly writes files to multiple servers.
This approach requires explicit support from the client and does not
provide transparency.
2. Lazy file replication: The server automatically copies files to other
servers after the files are written. Remote files are only brought up to
date when the files are sent to the server. How often this happens is up
to the implementation and affects the consistency of the file state.
3. Group file replication: write requests are simultaneously sent to a
group of servers. This keeps all the replicas up to date, and allows
clients to read consistent file state from any replica.
8.8 Caching
Besides replication, caching is often used to improve the performance of a
DFS. In a DFS, caching involves storing either a whole file, or the results of
file service operations. Caching can be performed at two locations: at the
server and at the client. Server-side caching makes use of file caching
provided by the host operating system. This is transparent to the server and
helps to improve the server’s performance by reducing costly disk accesses.
Page 176
Advanced Operating Systems (Distributed Systems) Unit 8
Sikkim Manipal University Page No. 176
Client-side caching comes in two flavours: on-disk caching, and in-memory
caching. On-disk caching involves the creation of (temporary) files on the
client’s disk. These can either be complete files (as in the upload/download
model) or they can contain partial file state, attributes, etc. In-memory
caching stores the results of requests in the client-machine’s memory. This
can be process-local (in the client process), in the kernel, or in a separate
dedicated caching process.
The issue of cache consistency in DFS has obvious parallels to the
consistency issue in shared memory systems, but there are other tradeoffs
(for example, disk access delays come into play, the granularity of sharing is
different, sizes are different, etc.). Furthermore, because write-through
caches are too expensive to be useful, the consistency of caches will be
weakened. This makes implementing Unix semantics impossible.
Approaches used in DFS caches include, delayed writes where writes are
not propagated to the server immediately, but in the background later on,
and write-on-close where the server receives updates only after the file is
closed. Adding a delay to write-on-close has the benefit of avoiding
superfluous writes if a file is deleted shortly after it has been closed.
1. Example: Network File System (NFS)
NFS is a remote access DFS that was introduced by Sun in 1985. The
currently used version is version 3, however a new version (4) has also
been defined. NFS integrates well into Unix’s model of mount points, but
does not implement Unix semantics. NFS servers are stateless (i.e., NFS
does not provide open & close operations). It supports caching, but no
replication. NFS has been ported to many platforms and, because the NFS
protocol is independent of the underlying file system, supports many
different underlying file systems. On Unix, an NFS server runs as a daemon
and reads the file /etc/export to determine what directories are exported to
whom under which policy (for example, who is allowed to mount them, who
Page 177
Advanced Operating Systems (Distributed Systems) Unit 8
Sikkim Manipal University Page No. 177
is allowed to access them, etc.). Server-side caching makes use of file
caching as provided by the underlying operating system and is, therefore,
transparent.
On the client side, exported file systems can be explicitly mounted or
mounted on demand (called automounting). NFS can be used on diskless
workstations so does not require local disk space for caching files. It does,
however, support client-side caching, and allows both file contents as well
as file attributes to be cached. Although NFS allows caching, it leaves the
specifics up to the implementation. As such, file caching details are
implementation specific. Cache entries are generally discarded after a fixed
period of time and a form of delayed write-through is employed.
Traditionally, NFS trusts clients and servers and thus has only minimal
security mechanisms in place. Typically, the clients simply pass Unixuser ID
and group ID of the process performing a request to the server. This implies
that NFS users must not have root access on the clients, otherwise they
could simply switch their identity to that of another user and then access that
user’s files. New security mechanisms have been put in place, but they also
have their drawbacks:
Secure NFS using Diffie-Hellman public key cryptography is more complex
to implement and to manage the keys, and the key lengths used are too
short to provide security in practice. Using Kerberos would turn NFS more
secure, but it has high entry costs.
Example: Andrew File System (AFS)
The Andrew File System (AFS) is a DFS that came out of the Andrew
research project at Carnegie Mellon University (CMU). Its goal was to
develop a DFS that would scale to all computers on the university’s campus.
It was further developed into a commercial product and an open source
branch was later released under the name “OpenAFS”. AFS offers the same
Page 178
Advanced Operating Systems (Distributed Systems) Unit 8
Sikkim Manipal University Page No. 178
API as Unix, implements Unix semantics for processes on the same
machine, but implements write-on-close semantics globally. All data in AFS
is mounted in the /afs directory and organised in cells (e.g. /afs/cs.cmu.edu).
Cells are administrative units that manage users and servers.
Files and directories are stored on a collection of trusted servers called Vice.
Client processes accessing AFS are redirected by the file system layer to a
local user-level process called Venice (the AFS daemon), which then
connects to the servers. The servers serve whole files, which are cached as
a whole on the clients’ local disks. For cached files a callback is installed on
the corresponding server. After a process finishes modifying a file by closing
it, the changes are written back to the server. The server then uses the
callbacks to invalidate the file in other clients’ caches. As a result, clients do
not have to validate cached files on access (except after a reboot) and
hence there is only very little cache validation traffic. Data is stored on
flexible volumes, which can be resized and moved between the servers of a
cell. Volumes can be marked as read only, e.g. for software installations.
AFS does not trust Unix user IDs and instead uses its own IDs which are
managed at a cell level. Users have to authenticate with Kerberos by using
the klog command. On successful authentication, a token will be installed in
the client’s cache managers. When a process tries to access a file, the
cache manager checks if there is a valid token and enforces the access
rights. Tokens have a time stamp and expire, so users have to renew their
token from time to time. Authorisation is implemented by directory-based
ACLs, which allow finer grained access rights than Unix.
2. Example: Coda
Coda is an experimental DFS developed at CMU by M. Satyanarayanan’s
group, it is the successor of the Andrew File System (AFS) but supports
disconnected, mobile operation of clients. Its design is much more ambitious
than that of NFS.
Page 179
Advanced Operating Systems (Distributed Systems) Unit 8
Sikkim Manipal University Page No. 179
Coda has quite a number of similarities with AFS. On the client side, there is
only a single mount point /coda. This means that the name space appears
the same to all clients (and files therefore have the same name at all
clients). File names are location transparent (servers cannot be
distinguished). Coda provides client-side caching of whole files. The caching
is implemented in a user-level cache process called Venus. Coda provides
Unix semantics for files shared by processes on one machine, but applies
write-on-close (session) semantics globally. Because high availability is one
of Coda’s goals access to a cached copy of a file is only denied if it is known
to be inconsistent.
In contrast to AFS, Coda supports disconnected operation, which works as
follows. While disconnected (a client is disconnected with regards to a file if
it cannot contact any servers that serve copies of that file) all updates are
logged in a client modification log (CML). Upon reconnection, the operations
registered in the CML are replayed on the server. In order to allow clients to
work in disconnected mode, Coda tries to make sure that a client always
has up-to-date cached copies of files that they might require. This process is
called file hoarding. The system builds a user hoard database which it uses
to update frequently used files using a process called a hoard walk.
Conflicts upon reconnection are resolved automatically where possible,
otherwise, manual intervention becomes necessary.
Files in Coda are organised in organisational units called volumes. A volume
is a small logical unit of files (e.g., the home directory of a user or the source
tree of a program). Volumes can be mounted anywhere below the /coda
mount point (in particular, within other volumes). Coda allows files to be
replicated on read/write servers. Replication is organised on a per volume
basis, that is, the unit of replication is the volume. Updates are sent to all
replicas simultaneously using multicast RPCs (Coda defines its own RPC
Page 180
Advanced Operating Systems (Distributed Systems) Unit 8
Sikkim Manipal University Page No. 180
protocol that includes a multicast RPC protocol). read operations can be
performed at any replica.
3. Example: Google File System
The Google File System (GFS) is a distributed file system developed to
support a system with very different requirements than traditionally assumed
when developing file systems. GFS was designed and built to support
operations (both production and research) at Google that typically involve
large amounts of data, run distributed over very large clusters, and include
much concurrent access to files. GFS assumes that most data operations
are large sequential reads and large concurrent appends. One of the key
assumptions driving the design is that, because very large clusters (built
from commodity parts) are used, failure (of hardware or software resulting in
crashes or corrupt data) is a regular occurrence rather than an anomaly.
8.9 Ceph
Ceph is a scalable, high-performance research DFS. It targets systems with
huge amounts of data (“petascale systems”) and, like GFS, assumes that
node failures are the norm, not an exception. It assumes that such systems
are built incrementally, that they are inherently dynamic and that workloads
shift over the lifetime of the system. If has three key design features. First,
Ceph decouples data and metadata by using a mapping function that maps
from a file’s unique ID to intelligent object storage devices (OSDs) which
store the file’s data, thus eliminating the need to store explicit allocation lists.
Secondly, Ceph adaptively and intelligently distributes responsibility of
metadata to a cluster of metadata servers. It can thus adapt to changing
workloads which require access to different parts of the metadata and
prevents hot spots from becoming potential bottlenecks. Thirdly, Ceph uses
intelligent OSDs to reliably and autonomically store data. A cluster of OSDs
collectively manages data migration, replication, failure detection and failure
recovery.
Page 181
Advanced Operating Systems (Distributed Systems) Unit 8
Sikkim Manipal University Page No. 181
8.10 Terminal Questions
1. In what aspects is the design of a distributed file system different from
that of a centralized file system?
2. Name the main components of a distributed file system. What might be
the reasons for separating the various functions of a distributed file
system into these components.
3. Discuss the clients and servers perspective of a distributed file system.
4. Discuss any two example network file systems in use.
Page 182
Advanced Operating Systems (Distributed Systems) Unit 9
Sikkim Manipal University Page No. 182
Unit 9 Naming
Structure:
9.1 Introduction
Objectives
9.2 Desirable Features of a Good Naming system
9.3 Fundamental Terminologies and Concepts
9.4 System Oriented Names
9.5 Object – Locating Mechanisms
9.6 Human – Oriented Names
9.7 Name Caches
9.8 Naming and Security
9.9 Terminal Questions
9.1 Introduction
In this unit, we first concentrate on different kinds of names, and how names
are organized into name spaces. We then continue with a discussion of the
important issue of how to resolve a name such that the entity it refers to can
be accessed. Also, we explain various options for distributing and
implementing large name spaces across multiple machines. The Internet
Domain Name System and OSI’s X.500 will be discussed as examples of
large-scale naming services.
Names, Identifiers, and Addresses
A name in a distributed system is a string of bits or characters that is used to
refer to an entity. An entity in a distributed system can be practically
anything. Typical examples include resources such as hosts, printers, disks,
and files. Other well-known examples of entities that are often explicitly
named are processes, users, mailboxes, newsgroups, Web pages,
graphical windows, messages, network connections, and so on.
Page 183
Advanced Operating Systems (Distributed Systems) Unit 9
Sikkim Manipal University Page No. 183
Entities can be operated on. For example, a resource such as a printer
offers an interface containing operations for printing a document, requesting
the status of a print job, and the like. Furthermore, an entity such as a
network connection may provide operations for sending and receiving data,
setting quality-of-service parameters, requesting the status, and so forth.
To operate on an entity, it is necessary to access it, for which we need an
access point. An access point is yet another, but special, kind of entity in a
distributed system. The name of an access point is called an address. The
address of an access point of an entity is also simply called an address of
that entity.
An entity can offer more than one access point. As a comparison, a
telephone can be viewed as an access point of a person, whereas the
telephone number corresponds to an address. Indeed, many people
nowadays have several telephone numbers, each number corresponding to
a point where they can be reached. In a distributed system, a typical
example of an access point is a host running a specific server, with its
address formed by the combination of, for example, an IP address and port
number (i.e., the server’s transport-level address).
An entity may change its access points in the course of time. For example,
when a mobile computer moves to another location, it is often assigned a
different IP address than the one it had before. Likewise, when a person
moves to another city or country, it is often necessary to change telephone
numbers as well. In a similar fashion, changing jobs or Internet Service
Provider, means changing your e-mail address.
An address is thus just a special kind of name: it refers to an access point of
an entity. Because an access point is tightly associated with an entity, it
would seem convenient to use the address of an access point as a regular
name for the associated entity. Nevertheless, this is hardly ever done.
Page 184
Advanced Operating Systems (Distributed Systems) Unit 9
Sikkim Manipal University Page No. 184
Objectives:
This unit discusses the naming structures used in addressing the individual
systems located within a network. It also describes the features useful for
designing a distributed system. It also addresses the various issues
concerned with human naming mechanisms, object locating mechanisms,
and security aspects.
9.2 Desirable Features of a Good Naming System
A good naming system for a distributed system should have the following
features:
i) Location transparency
The name of an object should not reveal any hint about the physical
location of the object
ii) Location independency
Name of an object should not be required to be changed when the
object’s location changes. Thus
A location independent naming system must support a dynamic
mapping scheme
An object at any node can be accessed without the knowledge of its
physical location
An object at any node can issue an access request without the
knowledge of its own physical location
iii) Scalability
Naming system should be able to handle the dynamically changing
scale of a distributed system
iv) Uniform naming convention
Should use the same naming conventions for all types of objects in the
system
Page 185
Advanced Operating Systems (Distributed Systems) Unit 9
Sikkim Manipal University Page No. 185
v) Multiple user-defined names for the same object
Naming system should provide the flexibility to assign multiple user-
defined names for the same object.
vi) Grouping name
Naming system should allow many different objects to be identified by
the same name.
vii) Meaningful names
A naming system should support at least two levels of subject identifiers,
one convenient for human users and the other convenient for machines.
9.3 Fundamental Terminologies and Concepts
i) Name Server
Name servers manage the name spaces. A name server binds an object to
its location. Partitioned name spaces are easier to manage when compared
to flat name space, because each server needs to maintain information for
only one domain.
ii) Name agent
Name agents are known by various names.
e.g. In Internet domain name service (DNS) they are called “resolvers”, in
DCE directory service they are called “clerks”, A Name agent.
Acts between name servers and their clients
Maintains knowledge of existing name servers
Transfers user requests to proper name servers
iii) Context
A context is the environment in which a name is valid. Often contexts
represent a division of name space along regional, organizational or
functional boundaries. Contexts can be nested in an hierarchical name
space.
Page 186
Advanced Operating Systems (Distributed Systems) Unit 9
Sikkim Manipal University Page No. 186
iv) Name resolution
Process of mapping an object’s name to its properties such as location. It is
basically the process of mapping an object’s name to the authoritative name
servers of that object. In partitioned name space, the name resolution
mechanism traverses a resolution chain from one context to another until
the authoritative name servers of the named object are encountered.
v) Abbreviation/Alias
Users can define their own abbreviation for qualified names. Abbreviations
defined by a user form a private context for that user.
vi) Absolute and relative names
In a tree structured name space, the full qualified name of an object need
not be specified within current working context. e.g., Unix directory structure,
Internet domain names, etc.
vii) Generic and Multicast names
In generic naming facility, a name is mapped to any one of the set of objects
to which it is bound. In group or multicast naming facility, a name is mapped
to all members of the set of objects to which it is bound.
9.4 System Oriented Names
System oriented names normally have the following characteristic features:
i) Characteristics of System-oriented names
They are large integers or bit strings.
These are also called unique identifiers because they are unique in time
and space.
System oriented names are of the same size
Generally shorter than human-oriented names and are easy for
manipulations like hashing, sorting and so on.
Page 187
Advanced Operating Systems (Distributed Systems) Unit 9
Sikkim Manipal University Page No. 187
ii) Approaches for Generating System-Oriented names
1. Centralized approach: In this approach, a global identifier is generated
for each object by a centralized generator. The central node is the
bottleneck.
2. Distributed approach: In this approach, hierarchical concatenation is
used for creating global unique identifiers. Each identification domain is
identified by a unique identifier. Global identifier is obtained by
concatenating the identifier of domain with an identifier used within the
domain.
3. Generating Unique Identifiers in the event of crashes: A crash may
lead to loss of state information and hence may result in the generation
of non-unique identifiers. Two basic approaches to handle this problem:
– Using a clock that operates across failures: A clock is used at the
location of the unique identifier generator. The clock is guaranteed to
operate across failures.
– Using two or more levels of storage: In this approach, two or more
levels of storage are used and the unique identifiers are structured in
a hierarchical fashion with one field for each level.
9.5 Object-Locating Mechanisms
Object locating is mapping the system oriented names of objects to the
location of the object. Some object locating mechanisms are listed below:
i) Broadcasting
Object’s location is found by broadcasting a request from the client node.
Expanding ring broadcast: This approach is employed in an internetwork
consisting of LANs connected by gateways. A ring is a set of LANs that are
a certain distance (measured in terms of the number of gateways) away
from a processor. First a broadcast message is sent to the set of processors
at distance 0; if the object is not located, then the search goes to processors
Page 188
Advanced Operating Systems (Distributed Systems) Unit 9
Sikkim Manipal University Page No. 188
at distance 1 and so on until a copy of the object is found. Cost of locating
an object is proportional to the distance of the object from the client.
ii) Encoding Location of objects within UID (unique identifier)
One field of UID identifies the location of the object. It is easy for the client to
locate the object. Disadvantages of this approach are:
– An object is fixed to one node throughout its life time.
– Limited to distributed systems that do not support object migration
– Object naming is not location transparent
iii) Searching creator node first and then broadcasting
This approach is an extension of the above approach and based on the
assumption that objects do not migrate often. The UID contains the identifier
of the node on which the object was created. To locate an object, first a
request is sent to the node that created the object. If the object has
migrated, then a search is done using broadcast.
Using forward location pointers
This is an extension of the above scheme and avoids broadcast. Whenever
an object migrates to another node, a forward location pointer is left at the
node. To locate an object, the creator is contacted first, and the location
pointer is followed, if necessary, until the object is found.
Some disadvantages of this approach are:
The object-locating cost is directly proportional to the length of the chain
of pointers
It is difficult if an intermediate pointer is lost due to node failure
Using hint cache and broadcasting
In this method, each node contains a hint on the current location of a
number of recently referenced objects in the form of (UID, last known
location) pairs. Object request is sent to the node indicated by the hint. If the
object is found to have migrated, then a broadcast message is sent
Page 189
Advanced Operating Systems (Distributed Systems) Unit 9
Sikkim Manipal University Page No. 189
throughout the network requesting for current object location. This approach
is very efficient if high degree of locality is exhibited in locating objects from
a node. It is flexible since it can support object migration. The method of on-
use update of cache information avoids the expense and delay of having to
notify other nodes when an object migrates. If the hint has incorrect
information, broadcast will cause lot of overhead. This approach is widely
used approach in modern distributed OSs such as Amoeba, V-system,
Mach, etc.
9.6 Human-Oriented Names
System oriented names such as 31A5, 2B5F, etc. though useful for machine
handling, are not suitable for use by users. Users will have a tough time if
they are required to remember these names or type them in. Further, each
object has only a single system-oriented name, and therefore all the users
sharing an object must remember and use its only name. To overcome
these limitations, almost all naming systems provide the facility to the users
to define and use their own suitable names for the various objects in the
system. These user-defined object names, which form a name space on the
top of the name space for system-oriented names, are called Human-
Oriented Names.
i) Characteristics of human-oriented names
Character strings that are meaningful to the users
They are defined by the users
Different users can define their own suitable names for a shared object
They are variable in length and different names could be used for the
same object
Same name can be used by different users to refer to different objects.
So, human-oriented names are not unique in space or time
Page 190
Advanced Operating Systems (Distributed Systems) Unit 9
Sikkim Manipal University Page No. 190
ii) Human-Oriented Hierarchical Naming Schemes
Basically there are four approaches for assigning system wide unique
human oriented names to the various objects in a distributed system. They
are described below:
1. Combining an object’s local name with its host name:
In this approach, the naming scheme uses a name space that is
comprised of several isolated name spaces. Each isolated name space
corresponds to a node in the distributed system, and a name in this
name space uniquely identifies an object in the node. In the global
system, objects are named by some combination of their hostname and
local name such as host-name:local-name. Disadvantage of this
approach is it is neither location transparent nor location independent.
2. Interlinking isolated name spaces into a single name space
In this scheme, the Global name space consists of several isolated
name spaces, the isolated name spaces are joined together to form a
single naming structure. The position of the component name spaces in
the naming hierarchy is arbitrary. A component name space can be
placed below any other component name space either directly or
through some other component name space. There is no notion of
absolute path name. Each path name is relative to some context, either
to current working context or current component name space. An
advantage of this scheme is it is simple to join existing name spaces into
a single global name space.
iii) Sharing remote name spaces on explicit request: Used by Sun NFS
This scheme is based on the idea of attaching isolated name spaces of
various nodes to create a new name space. Unlike the above schemes
users are given the flexibility to attach a context of the remote name space
to one of the contexts of their local name space. So, the global view of the
resulting name structure is a forest of trees, not a single tree. In NFS, the
Page 191
Advanced Operating Systems (Distributed Systems) Unit 9
Sikkim Manipal University Page No. 191
mount protocol is used to attach a remote name space to a local directory. A
client can mount the directory using one of the following ways:
Manual mounting: Client uses the mount and umount command to to
mount and unmount remote server’s directories to client’s name space.
Static mounting: Allows clients to mount the directories automatically
without manual intervention. This is done by running a shell script at the
time the client machine is booted.
Automounting: Allows the servers’ directories to be mounted and
unmounted on a need basis.
iv) A single global name space
In this approach a single name space spans across all nodes in the system.
The same name space is visible to all users and an object’s absolute name
is always the same irrespective of the location of the object and the user
accessing it. This approach is used in many modern distributed operating
systems such as Sprite and V-System.
v) Issues involved in using a single global name space
Partitioning name space into contexts:
Storing complete naming information at one node or replicating it at every
node is not desirable. So, naming information should be kept decentralized
and replicated How to decompose and distribute the naming information
database among different servers:
The notion of context is used for partitioning name space into smaller
components
Partitioning into contexts is done by using clustering conditions
Three basic clustering methods used are:
Algorithmic clustering
Syntactic clustering
Attribute clustering
Page 192
Advanced Operating Systems (Distributed Systems) Unit 9
Sikkim Manipal University Page No. 192
vi) Issues in context binding
When the name server is presented with the name to be resolved
The server looks at the authoritative name servers for the named
object.
If the authority attribute does not contain the name server
corresponding to the given name, additional configuration data,
called context bindings is used for finding the authoritative name
servers.
A context binding associates the context within which it is stored to
another context that is more knowledgeable about the named object.
Two strategies commonly used for context binding in naming systems.
vii) Table-based strategy
Most commonly used approach in tree-structured name spaces
Each context is a table having two fields: the first field stores a
component name and the second field stores either the context binding
information or the authority attribute information.
viii) Procedure-Based strategy
In this method a context binding takes the form of a procedure, which, when
executed, supplies information about the next context to be consulted for the
named object.
ix) Distribution of context and name resolution mechanisms
Centralized approach
A single name server in the entire distributed system is located at a central
node
The location of the central server is known to all other nodes
The name server resolves a name by traversing the complete resolution
chain of contexts locally and finally returns the attributes of the named
object.
Page 193
Advanced Operating Systems (Distributed Systems) Unit 9
Sikkim Manipal University Page No. 193
Fully replicated approach: Each node has a name server:
Distribution based on physical structure of name space: A commonly
used approach for hierarchical tree-structured name spaces
Name space tree is divided into several subtrees, called zones, or
domains.
There are several name servers in the distributed system.
Each name server provides storage for one or more of these zones.
So name resolution involves sending the name resolution request to the
appropriate server.
To facilitate the mapping of names to servers, each client maintains
name prefix table that is built and updated dynamically.
This approach is used in Sprite file systems.
Advantages and disadvantages of this approach:
Number of prefix table entries will be small.
As opposed to global directory look up, in which all directories starting
from the root to the last component need to be searched one by one, the
prefix table helps in bypassing part of the directory lookup mechanism.
Bypassing upper level directories can have consequences of the
system’s security mechanisms.
Consistency of the prefix table is checked and updated if necessary only
when it is used, and there is no need to inform all clients when a table
entry they are storing becomes invalid.
Page 194
Advanced Operating Systems (Distributed Systems) Unit 9
Sikkim Manipal University Page No. 194
9.7 Name Caches
Caching can help increase the performance of name resolution operations
for the following reasons:
i) High degree of locality of name lookup: Due to locality of reference,
a reasonable size cache, used for caching the recently used naming
information can increase performance.
ii) Slow update of name information database: Cost of maintaining
consistency of cached data is very low because naming data does not
change fast. i.e., the read/write ratio of naming data is very high.
iii) On-use consistency of cached information is possible
Name cache consistency can be maintained by detecting and
discarding stale cache entries on use.
Issues related to Name Caches:
Types of name caches
Directory cache: All recently used directory pages that are brought to the
client node during name resolution are cached for a while.
Advantages and disadvantages of this approach
When a directory is accessed it is likely that the contents of the directory
pages are used for operations such as (ls,../, etc.).
For getting one useful entry, namely the directory entry, an entire page
of directory blocks large area of cache.
Prefix cache: Used in Zone-based context distribution mechanisms that we
saw earlier.
Full-name cache: In this type of cache, each entry consists of an object’s
full path name and the identifier and location of its authoritative name
server.
Page 195
Advanced Operating Systems (Distributed Systems) Unit 9
Sikkim Manipal University Page No. 195
Approaches for name cache implementation:
A cache per process: A separate cache is maintained for each process.
Advantages and disadvantages:
Since each cache is maintained in the process’s address space,
accessing is fast.
Every new process must create its own name cache from scratch.
Cache hit ratio will be small due to start-up misses. To minimize startup
misses, a process can inherit the name cache from its parent (V-system
uses this approach).
Possibility of naming information being duplicated unnecessarily at a
node.
A cache per node: All processes at a node share the same cache. Some of
the problems related to the above approach are overcome. However, cache
needs to be in the OS area and hence access could be slow.
Approaches for maintaining consistency of name caches:
1. Immediate invalidate: In this method, all related name cache entries
are immediately invalidated. This can be done in one of the following
ways.
Whenever a naming data update is done, an invalidate message
identifying the data to be invalidated is sent to all nodes so each
node can update its cache. This approach is expensive in large
systems.
Invalidation message is sent to only the nodes that have cached the
data.
2. On-Use update: When a client uses a stale cached data, it is informed
by the naming system that the data is stale so that the client can get the
updated data.
Page 196
Advanced Operating Systems (Distributed Systems) Unit 9
Sikkim Manipal University Page No. 196
9.8 Naming and Security
An important job of the naming system of several centralized and distributed
operating systems is to control unauthorized access to both the named
objects and the information in the naming database. This section describes
only those security issues that are pertinent to object naming. Three basic
naming-related access control mechanisms are described below:
i) Object Names as Protection Keys
In this method, an object’s name acts as protection key for the object. A
user who knows the name of an object (i.e. has the key for the object) can
access the object by using its name. An object may have several keys in
those systems that allow an object to have multiple names. In this case, any
of the keys can be used to access the object.
In systems using this method, users are not allowed by the system to define
a name for an object that they are not authorized to access. This scheme is
based on the assumption that object names cannot be forged or stolen. The
following are the limitations of this scheme:
The scheme does not guarantee a reliable access control mechanism.
It does not provide the flexibility of specifying the modes of access
control.
ii) Capabilities
This is a simple extension of the above scheme that overcome its
limitations. As shown below, a capability is a special type of object identifier
that contains additional information redundancy for protection.
Figure 9.1: The two basic parts of a capability
Object Identifier Rights Information
Page 197
Advanced Operating Systems (Distributed Systems) Unit 9
Sikkim Manipal University Page No. 197
It may be considered as an unforgettable ticket that allows its holders to
access the object (identified by its object identifier) in one or more
permission modes (specified by its access control information part).
When a process wants to perform an operation on an object, it must send to
the name server a message containing the object’s capability. The name
server verifies if the capability provided by the client allows the type of
operation requested by the client on the relevant object. If not a permission
denied message is returned to the client process. If allowed, the client’s
request is forwarded to the manager of the object.
iii) Associating Protection with Name Resolution Path
Protection can be associated with an object or with the name resolution path
of the name used to identify the object. The more common scheme provides
protection on the name resolution path.
Systems using this approach usually employ access control list (ACL) based
protection, which controls access dependent on the identity of the user. The
mechanism based on ACL requires, in addition to the object identifier,
another trusted identifier representing the accessing principal, the entity with
which access rights are associated. This trusted identifier might be a
password, address, or any other identifier form that cannot be forged or
stolen. An ACL is associated with an object and specifies the user name
(user identifier) and the types of access allowed for each user of that object.
When a user requests access to an object, the operating system checks the
ACL associated with that object. If the user is listed for the requested
access, the access is allowed. Otherwise, a protection violation occurs and
the user job is denied access to the object.
By associating an ACL with each context (directory) of the name space,
access can be controlled to both named objects and the information in the
naming database. When a name server receives an access request for a
Page 198
Advanced Operating Systems (Distributed Systems) Unit 9
Sikkim Manipal University Page No. 198
directory, it first verifies if the accessing process is authorized for the
requested access. With this approach, name servers do not provide
information to clients that are not authorized to have it, and at the same time
name servers do not accept unauthorized updates to naming information
stored in the context of name space.
9.9 Terminal Questions
1. List the main jobs performed by the naming subsystem of distributed
operating system.
2. Differentiate between the terms location transparency and location
independency. Which is more powerful and why?
3. Differentiate between human-oriented and system-oriented names used
in the operating system.
4. Discuss the Naming and Security mechanisms in a distributed operating
system.
Page 199
Advanced Operating Systems (Distributed Systems) Unit 10
Sikkim Manipal University Page No. 199
Unit 10 Security in Distributed Systems
Structure:
10.1 Introduction
Objectives
10.2 Potential attacks to Computer Systems
10.3 Cryptography
10.4 Authentication
10.5 Access Control
10.6 Digital Signatures
10.7 Design Principles
10.8 Terminal Questions
10.1 Introduction
Before we embark on our journey of understanding the various concepts
and technical issues related to security, it is essential to know what we are
trying to protect. What are the various dangers when we use computers,
computer networks, and the biggest network of them all, the Internet? What
can happen if we do not set up the right security policies, framework and
technology implementations?
Why is security required in the first place? People sometimes say that
security is like statistics: What it reveals is trivial, what it conceals is vital!
The right security infrastructure opens up just enough doors that are
mandatory.
We discuss the principles of security that help us identify various areas,
which are crucial while determining the security threats and possible
solutions to tackle them. Since electronic documents and messages are
now becoming equivalent to the paper documents in terms of their legal
validity and binding, we examine the various implications in this regard.
Page 200
Advanced Operating Systems (Distributed Systems) Unit 10
Sikkim Manipal University Page No. 200
This would be followed by a discussion of the types of attacks. There are
certain theoretical concepts associated with attacks, and there is a practical
side to it as well.
With the introduction of the computer, the need for automated tools for
protecting files and other information stored on the computer became
evident. This is especially the case for a shared system, such as a time-
sharing system, and the need is even more acute for systems that can be
accessed over a public telephone or data network. The generic name for the
collection of tools designed to protect data and to thwart hackers is
Computer Security.
The second major change that affected security is the introduction of
distributed systems and the use of networks and communication facilities for
carrying data between terminal user and computer and between computer
and computer. Network security measures are needed to protect data during
their transmission.
One of the most publicized types of attack on information systems is the
computer virus. A virus may be introduced into a system physically when it
arrives on a diskette and is subsequently loaded onto a computer. Viruses
may also arrive over an Internet. In either case, once the virus is resident on
a computer system, internal computer security tools are needed to detect
and recover from the virus.
This unit focuses on Internet security that consists of measures to deter,
prevent, detect, and correct security violations that involve the transmission
of information. This is a broad statement that covers the host of possibilities.
Security involving communications and networks is not as simple as it
appears for a layman to understand and implement. Most of the major
requirements for security services include:
Confidentiality
Page 201
Advanced Operating Systems (Distributed Systems) Unit 10
Sikkim Manipal University Page No. 201
Authentication
Non-Repudiation
Integrity
In developing a particular security mechanism or algorithm, one must
always consider potential counter measures. In most of the cases, counter
measures are designed by looking at the problem in a completely different
way, therefore exploiting an unexpected weakness in the mechanism.
Security mechanisms involve more than a particular algorithm or protocol.
They usually also require that participants be in possession of some secret
information (like an encryption key), which raises questions about creation,
distribution, and protection of that secret information.
A Model for Network Security
A message is to be transferred from one party to another across some sort
of Internet. The two parties, who are the principals in this transaction, must
cooperate for the exchange to take place. A logical information channel is
established by defining a route through the Internet from source to
destination and by the cooperative use of communication protocols
(like TCP / IP, HTTP) by the two principals.
Security aspects come into the play when it is necessary or desirable to
protect the information transmission from an opponent which may present a
threat to the confidentiality, authenticity, and so on. All the techniques for
providing security have two components:
1. A security related transformation on the information to be sent.
Examples include the encryption of message, which scrambles the
message so that it is unreadable by the opponent, and the addition of a
code based on the contents of the message, which can be used to verify
the identity of the sender.
Page 202
Advanced Operating Systems (Distributed Systems) Unit 10
Sikkim Manipal University Page No. 202
2. Some secret information shared by the two principals and, it is hoped,
unknown to the opponent. An example is an encryption key used in
conjunction with the transformation to scramble the message before
transmission and unscramble it on reception.
A trusted third party may be needed to achieve secure transmission. As an
example, a third party may be responsible for distributing the secret
information to the two principals while keeping it away from opponent. A
third party may also be necessary to arbitrate disputes between the two
principals concerning the authenticity of a message transmission.
The above stated theory regarding the general model shows that there are
four basic tasks in designing a particular security service:
1. Design an algorithm for performing security related transformations. The
algorithm should be such that an opponent cannot defeat its purpose.
2. Generate the secret information to be used with the algorithm.
3. Develop methods for the distribution and sharing of secret information.
4. Specify a protocol to be used by the two principals that make use of the
security algorithm and the secret information to achieve a particular
security service.
Page 203
Advanced Operating Systems (Distributed Systems) Unit 10
Sikkim Manipal University Page No. 203
Figure 10.1: Model for Network Security
Objectives:
This unit makes the user familiar with security issues to be taken up in case
of distributed systems. It discusses the types of possible attacks on nodes in
a distributed system and also the protection mechanisms to counter these
attacks. It diescribes the secured way of transmitting the messages i.e. the
aspects of encoding and decoding data and the underlying principles behind
them. It also describes the Authentication, Access Control mechanisms. It
describes the Digital Signatures, and design principles to be followed in
designing a secured distributed system.
10.2 Potential attacks to Computer Systems
Attacks on the security of a computer system or network are best
characterized by viewing the function of the computer system as providing
information. In general, there is a flow of information from a source, such as
a file or a region of main memory, to a destination, such as another file or
user.
Trusted Third Party
Principal
Message
Principal
Information
Channel
Opponent
Security Related
Transformation
Security Related
Transformation
Message
Secret
Information Secret
Information
Page 204
Advanced Operating Systems (Distributed Systems) Unit 10
Sikkim Manipal University Page No. 204
Figure 10.2: Security Threats & General Categories of Attacks
The following points describe the four general categories of attacks:
Interruption: An asset of the system is destroyed or becomes
unavailable or unusable. This is an attack on Availability.
Examples: Destruction of hard disk, cutting of communication lines, and
so on.
Interception: An unauthorized party gains access to an asset. This is an
attack on Confidentiality. The unauthorized person may be a person, a
program, or a computer.
Figure (a): Normal Flow
Information
Source
Information
Destination
Figure (b): Interruption
Figure (c): Interception
Figure (d): Modification Figure (e): Fabrication
Page 205
Advanced Operating Systems (Distributed Systems) Unit 10
Sikkim Manipal University Page No. 205
Examples: Wiretapping to capture data in a network, unauthorized
copying of files or programs.
Modification: An unauthorized party not only gains access to but
tampers with an asset. This is an attack on Integrity.
Examples: Changing values in a data file, altering a program so that it
performs differently, modification of contents of messages transmitted on
a network.
Fabrication: An unauthorized party inserts counterfeit objects into the
system. This is an attack on Authenticity.
Examples: Insertion of spurious messages in a network, Addition of
records to a file.
There are two types of possible attacks on a computer system:
1. Passive Attacks, and
2. Active Attacks
Figure 10.3: Possible attacks on a computer system
1. Passive Attacks
In these type of attacks, the attacker indulges in eavesdropping or
monitoring of data transmission, i.e. the attacker aims to obtain information
that is in transit. The term passive indicates that the attacker does not
attempt to perform any modification to the data. This is why passive attacks
Attacks
Passive Attacks Active Attacks
Page 206
Advanced Operating Systems (Distributed Systems) Unit 10
Sikkim Manipal University Page No. 206
are harder to detect. Therefore the general approach to deal with passive
attacks is to think about prevention, rather than detection or corrective
actions.
Figure 10.4: Categories of Passive Attacks
Release of Message Contents: When we send a confidential email
message to our friend, we desire that only he / she would be able to access
it. Otherwise, the contents of the message are released against our wishes
to someone else.
Traffic Analysis: If we can encode messages using a coding language, so
that only the desired parties understand the contents of a message,
because only they know the code language. If many such messages are
passing through, a passive attacker could try to figure out the similarities
between them to come up with some sort of pattern that provides her some
clues regarding the communication that is taking place. Such attempts of
analyzing (encoded) messages to come up with likely patterns are the work
of traffic analysis attack.
Passive Attacks (Interception)
Release of
Message Contents
Traffic Analysis
Page 207
Advanced Operating Systems (Distributed Systems) Unit 10
Sikkim Manipal University Page No. 207
2. Active Attacks
The Active attacks are based on modification of the original message in
some manner, or on creation of a false message. These attacks cannot be
prevented easily. However, they can be detected with some effort, and
attempts can be made to recover from them. These attacks can be in the
form of interruption, modification, and fabrication.
Figure 10.5: Active Attacks
Masquerade: Caused when an unauthorized entity pretends to be another
entity. A user C might pose as user A and send a message to user B. User
B might be led to believe that the message indeed came from user A.
Replay Attack: A user captures a sequence of events, or some data units,
and resends them. For instance, suppose user A wants to transfer some
amount to user C‟s bank account. Both users A and C have accounts with
bank B. User A might send an electronic message to the bank requesting for
funds transfer. User C could capture the message, and send a second copy
of the same to bank B. Bank B would have no idea that this is an
Active Attacks
Interruption
(Masquerade) Modification
Fabrication (Denial of
Service – DOS)
Replay Attacks Alteration
Page 208
Advanced Operating Systems (Distributed Systems) Unit 10
Sikkim Manipal University Page No. 208
unauthorized message, and would treat this as a second, and different,
funds transfer request from user A. Therefore, user C would get the benefit
of the funds transfer twice: once authorized, once through a replay attack.
Alteration of Messages: It involves some change to the original message.
For example, Assume that user A sends an electronic message Transfer
$1000 to D‟s account to bank B. User C might capture this, and change it to
Transfer $10000 to C‟s account. Note that both the beneficiary and the
amount have been changed.
Denial of Service (DOS): These attacks make an attempt to prevent
legitimate users from accessing some services, which they are eligible for.
For instance, an unauthorized user might send too many login requests to a
server using random user id‟s one after the other in quick succession, so as
to flood the network and deny other legitimate users an access to the
network.
10.3 Cryptography
Network security is mostly achieved through the use of Cryptography, a
science based on abstract algebra.
Definition: Cryptography, a word with Greek origins, means “Secret
Writing”. However, we use the term to refer to the science and art of
transforming messages to make them secure and immune to attacks. Figure
below shows the components involved in the cryptography:
Page 209
Advanced Operating Systems (Distributed Systems) Unit 10
Sikkim Manipal University Page No. 209
Figure 10.6: Components of Cryptography
The original message before being transformed is called Plaintext. After the
message is transformed, it is called Ciphertext. An Encryption algorithm
transforms the plain text into cipher text; A Decryption algorithm transforms
the cipher text back into plain text. The sender uses an encryption algorithm
and the receiver uses a decryption algorithm.
Cipher: The Encryption and Decryption algorithms are referred to as
Ciphers. It is also used to refer to different categories of algorithms in
cryptography. One cipher can serve millions of communicating pairs.
Key: It is a number (or a set of numbers) that the cipher, as an algorithm,
operates on. To encrypt a message, we need an encryption algorithm, an
encryption key, and the plaintext. These create the ciphertext. To decrypt a
message, we need a decryption algorithm, a decryption key, and the
ciphertext. These reveal the original plaintext.
Alice, Bob, and Eve
In cryptography, it is customary to use three characters in an information
exchange scenario: we use Alice, Bob, and Eve. Alice is the person who
needs to send secure data. Bob is the recipient of data. Eve is the person
SENDER RECEIVER
Encryption Decryption Plaintext Plaintext
Ciphertext
Page 210
Advanced Operating Systems (Distributed Systems) Unit 10
Sikkim Manipal University Page No. 210
who somehow disturbs the communication between Alice and Bob by
intercepting messages to uncover the data or by sending her own disguised
messages. These three names represent computers or processes that
actually send or receive data, or intercept or change data.
Cryptographic algorithms can be divided into two groups:
Symmetric (Also called Secret – Key)
Asymmetric (Also called Public – Key)
Symmetric Key Cryptography: In this both the parties use the same key.
The sender uses this key and an encryption algorithm to encrypt data; the
receiver uses the same key and the corresponding decryption algorithm to
decrypt the data.
Figure 10.7: Symmetric – Key Cryptography
Asymmetric Key Cryptography: (or Public Key Cryptography)
In this, there are two keys: a private key and a public key. The private key
is kept by the receiver. The public key is announced to the public. In the
figure shown below, assume that Alice wants to send a message to Bob.
Alice uses the public key to encrypt the message. When the message is
Plaintext
Encryption Decryption
Alice Bob
Plaintext
Ciphertext
Shared Secret Key
Page 211
Advanced Operating Systems (Distributed Systems) Unit 10
Sikkim Manipal University Page No. 211
received by Bob, the private key is used to decrypt the message. In this
method the public key used for encryption is different from the private key
used for decryption. The public key is available to the public; the private key
is available only to an individual.
Three types of Keys
There are three types of keys dealt in the context of cryptography:
1. Secret Key: A shared key used in Symmetric Key Cryptography
2. Public Key
3. Private Key
The second and third keys are the public and private keys used in
asymmetric-key cryptography.
Encryption can be thought of as electronic locking; decryption as electronic
unlocking. The sender puts the message in a box and locks the box by
using a key; the receiver unlocks the box with a key and takes out the
message. The difference lies in the mechanism of the locking and unlocking
and the type of keys used.
In symmetric key cryptography, the same key locks and unlocks the box. In
asymmetric key cryptography, one key locks the box, but another key is
needed to unlock it.
Figure 10.8: Symmetric Key Cryptography
Encryption Decryption
Alice Bob
Ciphertext
Plaintext
K1
K1
Page 212
Advanced Operating Systems (Distributed Systems) Unit 10
Sikkim Manipal University Page No. 212
Figure 10.9: Asymmetric Key Cryptography
Key management is the set of techniques and procedures supporting the
establishment and maintenance of keying relationships between authorized
parties.
Key management encompasses techniques and procedures supporting:
1. Initialization of system users within a domain,
2. Generation, distribution, and installation of keying material,
3. Controlling the use of keying material,
4. Update, revocation, and destruction of keying material, and
5. Storage, backup/recovery, and archival of keying material.
Point-to-point and centralized key management
Point-to-point communications and centralized key management, using key
distribution centers or key translation centers, are examples of simple key
distribution (communications) models relevant to symmetric-key systems.
Here “simple” implies involving at most one third party. These are illustrated
in Figure 10.10 and described below, where KXY denotes a symmetric key
shared by X and Y.
a) Point-to-Point Key distribution
Encryption Decryption Ciphertext Plaintext
Plaintext
K2
K1
Alice Bob
Page 213
Advanced Operating Systems (Distributed Systems) Unit 10
Sikkim Manipal University Page No. 213
b) Key Distribution Center (KDC)
c) Key Translation Center
Figure 10.10: Simple Key Distribution Models (Symmetric Key)
1. Point-to-point mechanisms. These involve two parties communicating
directly.
2. Key Distribution Centers (KDCs): KDCs are used to distribute keys
between users which share distinct keys with the KDC, but not with each
other.
A basic KDC protocol proceeds as follows. Upon request from A to
share a key with B, the KDC T generates or otherwise acquires a key K,
then sends it encrypted under KAT to A, along with a copy of K (for B)
encrypted under KBT. Alternatively, T may communicate K (secured
under KBT) to B directly.
Page 214
Advanced Operating Systems (Distributed Systems) Unit 10
Sikkim Manipal University Page No. 214
3. Key Translation Centers (KTCs): The assumptions and objectives of
KTCs are as for KDCs above, but here one of the parties (e.g., A)
supplies the session key rather than the trusted center.
A basic KTC protocol proceeds as follows. A sends a key K to the KTC T
encrypted under KAT. The KTC deciphers and re-enciphers K under KBT,
then returns this to A (to relay to B) or sends it to B directly.
KDCs provide centralized key generation, while KTCs allow distributed key
generation. Both are centralized techniques in that they involve an on-line
trusted server.
Note: (Initial keying requirements) Point-to-point mechanisms require that A
and B share a secret key a priori. Centralized key management involving a
trusted party T requires that A and B each share a secret key with T. These
shared long-term keys are initially established by non-cryptographic, out-of-
band techniques providing confidentiality and authenticity (e.g., in person, or
by trusted courier). By comparison, with public keys confidentiality is not
required; initial distribution of these need only guarantee authenticity.
Techniques for distributing public keys
Protocols involving public-key cryptography are typically described
assuming a priori possession of (authentic) public keys of appropriate
parties. This allows full generality among various options for acquiring such
keys. Alternatives for distributing explicit public keys with guaranteed or
verifiable authenticity, including public exponentials for Diffie-Hellman key
agreement (or more generally, public parameters), include the following:
1. Point-to-point delivery over a trusted channel: Authentic public keys
of other users are obtained directly from the associated user by personal
exchange, or over a direct channel, originating at that user, and which
(procedurally) guarantees integrity and authenticity (e.g., a trusted
courier or registered mail). This method is suitable if used infrequently
Page 215
Advanced Operating Systems (Distributed Systems) Unit 10
Sikkim Manipal University Page No. 215
(e.g., one-time user registration), or in small closed systems. A related
method is to exchange public keys and associated information over an
untrusted electronic channel, and provide authentication of this
information by communicating a hash thereof (using a collision-resistant
hash function) via an independent, lower bandwidth authentic channel,
such as a registered mail.
Drawbacks of this method include: inconvenience (elapsed time); the
requirement of non-automated key acquisition prior to secured
communications with each new party (chronological timing); and the cost
of the trusted channel.
2. Direct access to a trusted public file (public-key registry): A public
database, the integrity of which is trusted, may be set up to contain the
name and authentic public key of each system user. This may be
implemented as a public-key registry operated by a trusted party. Users
acquire keys directly from this registry.
While remote access to the registry over unsecured channels is
acceptable against passive adversaries, a secure channel is required for
remote access in the presence of active adversaries. One method of
authenticating a public file is by tree authentication of public keys.
3. Use of an on-line trusted server: An on-line trusted server provides
access to the equivalent of a public file storing authentic public keys,
returning requested (individual) public keys in signed transmissions;
confidentiality is not required. The requesting party possesses a copy of
the server‟s signature verification public key, allowing verification of the
authenticity of such transmissions.
Disadvantages of this approach include: the trusted server must be on-
line; the trusted server may become a bottleneck; and communications
Page 216
Advanced Operating Systems (Distributed Systems) Unit 10
Sikkim Manipal University Page No. 216
links must be established with both the intended communicant and the
trusted server.
4. Use of an off-line server and certificates: In a one-time process, each
party A contacts an off-line trusted party referred to as a certification
authority (CA), to register its public key and obtain the CA‟s signature
verification public key (allowing verification of other users‟ certificates).
The CA certifies A‟s public key by binding it to a string identifying A,
thereby creating a certificate. Parties obtain authentic public keys by
exchanging certificates or extracting them from a public directory.
5. Use of systems implicitly guaranteeing authenticity of public
parameters: In such systems, including identity-based systems and
those using implicitly certified keys, by algorithmic design, modification
of public parameters results in detectable, non-compromising failure of
cryptographic techniques.
10.4 Authentication
In most computer security contexts, user authentication is the fundamental
building block and the primary line of defense. User authentication is the
basis for most types of access control and for user accountability.
The process of verifying an identity claimed by or for a system entity. An
authentication process consists of two steps:
Identification step: Presenting an identifier to the security system.
(Identifiers should be assigned carefully, because authenticated
identities are the basis for other security services, such as access
control service.)
Verification step: Presenting or generating authentication information
that corroborates the binding between the entity and the identifier.
Page 217
Advanced Operating Systems (Distributed Systems) Unit 10
Sikkim Manipal University Page No. 217
For example, user Alice Toklas could have the user identifier ABTOKLAS.
This information needs to be stored on any server or computer system that
Alice wishes to use and could be known to system administrators and other
users. A typical item of authentication information associated with this user
ID is a password, which is kept secret (known only to Alice and to the
system). If no one is able to obtain or guess Alice‟s password, then the
combination of Alice‟s user ID and password enables administrators to set
up Alice‟s access permissions and audit her activity. Because Alice‟s ID is
not secret, system users can send her e-mail, but because her password is
secret, no one can pretend to be Alice.
In essence, identification is the means by which a user provides a claimed
identity to the system; user authentication is the means of establishing the
validity of the claim. Note that user authentication is distinct from message
authentication.
Message authentication is a procedure that allows communicating parties to
verify that the contents of a received message have not been altered and
that the source is authentic. This unit is concerned solely with user
authentication.
Means of Authentication
There are four general means of authenticating a user‟s identity, which can
be used alone or in combination:
Something the individual knows: Examples includes a password, a
personal identification number (PIN), or answers to a prearranged set of
questions.
Something the individual possesses: Examples include electronic
keycards, smart cards, and physical keys. This type of authenticator is
referred to as a token.
Page 218
Advanced Operating Systems (Distributed Systems) Unit 10
Sikkim Manipal University Page No. 218
Something the individual is (static biometrics): Examples include
recognition by fingerprint, retina, and face.
Something the individual does (dynamic biometrics): Examples
include recognition by voice pattern, handwriting characteristics, and
typing rhythm.
All of these methods, properly implemented and used, can provide secure
user authentication. However, each method has problems. An adversary
may be able to guess or steal a password. Similarly, an adversary may be
able to forge or steal a token. A user may forget a password or lose a token.
Further, there is a significant administrative overhead for managing
password and token information on systems and securing such information
on systems. With respect to biometric authenticators, there are a variety of
problems, including dealing with false positives and false negatives, user
acceptance, cost, and convenience.
Password-Based Authentication
A widely used line of defense against intruders is the password system.
Virtually all multi-user systems, network-based servers, Web-based
e-commerce sites, and other similar services require that a user provide not
only a name or identifier (ID) but also a password. The system compares
the password to a previously stored password for that user ID, maintained in
a system password file. The password serves to authenticate the ID of the
individual logging on to the system. In turn, the ID provides security in the
following ways:
The ID determines whether the user is authorized to gain access to a
system. In some systems, only those who already have an ID filed on
the system are allowed to gain access.
The ID determines the privileges accorded to the user. A few users may
have supervisory or “superuser” status that enables them to read files
Page 219
Advanced Operating Systems (Distributed Systems) Unit 10
Sikkim Manipal University Page No. 219
and perform functions that are especially protected by the operating
system. Some systems have guest or anonymous accounts, and users
of these accounts have more limited privileges than others.
The ID is used in what is referred to as discretionary access control. For
example, by listing the IDs of the other users, a user may grant
permission to them to read files owned by that user.
The Use of Hashed Passwords: A widely used password security
technique is the use of hashed passwords and a salt value. This scheme is
found on virtually all UNIX variants as well as on a number of other
operating systems. The following procedure is employed Figure 10.11 (a).
To load a new password into the system, the user selects or is assigned a
password. This password is combined with a fixed-length salt value. In
older implementations, this value is related to the time at which the
password is assigned to the user. Newer implementations use a
pseudorandom or random number. The password and salt serve as inputs
to a hashing algorithm to produce a fixed-length hash code. The hash
algorithm is designed to be slow to execute to thwart attacks. The hashed
password is then stored, together with a plaintext copy of the salt, in the
password file for the corresponding user ID. The hashed-password method
has been shown to be secure against a variety of cryptanalytic attacks.
When a user attempts to log on to a UNIX system, the user provides an ID
and a password Figure 10.11 (b). The operating system uses the ID to index
into the password file and retrieve the plaintext salt and the encrypted
password. The salt and user-supplied password are used as input to the
encryption routine. If the result matches the stored value, the password is
accepted.
Page 220
Advanced Operating Systems (Distributed Systems) Unit 10
Sikkim Manipal University Page No. 220
a) Loading a new password
b) Verifying a Password
Figure 10.11: Unix Password Scheme
The salt serves three purposes:
It prevents duplicate passwords from being visible in the password file.
Even if two users choose the same password, those passwords will be
assigned different salt values. Hence, the hashed passwords of the two
users will differ.
Page 221
Advanced Operating Systems (Distributed Systems) Unit 10
Sikkim Manipal University Page No. 221
It greatly increases the difficulty of offline dictionary attacks. For a salt of
length b bits, the number of possible passwords is increased by a factor
of 2b, increasing the difficulty of guessing a password in a dictionary
attack.
It becomes nearly impossible to find out whether a person with
passwords on two or more systems has used the same password on all
of them.
To see the second point, consider the way that an offline dictionary attack
would work. The attacker obtains a copy of the password file. Suppose first
that the salt is not used. The attacker‟s goal is to guess a single password.
To that end, the attacker submits a large number of likely passwords to the
hashing function. If any of the guesses matches one of the hashes in the
file, then the attacker has found a password that is in the file. But faced with
the UNIX scheme, the attacker must take each guess and submit it to the
hash function once for each salt value in the dictionary file, multiplying the
number of guesses that must be checked.
There are two threats to the UNIX password scheme. First, a user can gain
access on a machine using a guest account or by some other means and
then run a password guessing program, called a password cracker, on that
machine. The attacker should be able to check many thousands of possible
passwords with little resource consumption. In addition, if an opponent is
able to obtain a copy of the password file, then a cracker program can be
run on another machine at leisure. This enables the opponent to run through
millions of possible passwords in a reasonable period.
Token - Based Authentication
Objects that a user possesses for the purpose of user authentication are
called tokens. In this subsection, we examine two types of tokens that are
Page 222
Advanced Operating Systems (Distributed Systems) Unit 10
Sikkim Manipal University Page No. 222
widely used; these are cards that have the appearance and size of bank
cards.
Memory Cards: Memory cards can store but not process data. The most
common such card is the bank card with a magnetic stripe on the back. A
magnetic stripe can store only a simple security code, which can be read
(and unfortunately reprogrammed) by an inexpensive card reader. There are
also memory cards that include an internal electronic memory.
Memory cards can be used alone for physical access, such as a hotel room.
For computer user authentication, such cards are typically used with some
form of password or personal identification number (PIN). A typical
application is an automatic teller machine (ATM).
The memory card, when combined with a PIN or password, provides
significantly greater security than a password alone. An adversary must gain
physical possession of the card (or be able to duplicate it) plus must gain
knowledge of the PIN.
Among the potential drawbacks are the following:
Requires special reader: This increases the cost of using the token
and creates the requirement to maintain the security of the reader‟s
hardware and software.
Token loss: A lost token temporarily prevents its owner from gaining
system access. Thus there is an administrative cost in replacing the lost
token. In addition, if the token is found, stolen, or forged, then an
adversary now need only determine the PIN to gain unauthorized
access.
User dissatisfaction: Although users may have no difficulty in
accepting the use of a memory card for ATM access, its use for
computer access may be deemed inconvenient.
Page 223
Advanced Operating Systems (Distributed Systems) Unit 10
Sikkim Manipal University Page No. 223
Smart Cards: A wide variety of devices qualify as smart tokens. These can
be categorized along three dimensions that are not mutually exclusive:
Physical characteristics: Smart tokens include an embedded
microprocessor. A smart token that looks like a bank card is called a smart
card. Other smart tokens can look like calculators, keys, or other small
portable objects.
Interface: Manual interfaces include a keypad and display for human/
token interaction. Smart tokens with an electronic interface communicate
with a compatible reader/writer.
Authentication protocol: The purpose of a smart token is to provide a
means for user authentication. We can classify the authentication
protocols used with smart tokens into three categories:
– Static: With a static protocol, the user authenticates himself or
herself to the token and then the token authenticates the user to the
computer. The latter half of this protocol is similar to the operation of
a memory token.
– Dynamic password generator: In this case, the token generates a
unique password periodically (e.g., every minute). This password is
then entered into the computer system for authentication, either
manually by the user or electronically via the token. The token and
the computer system must be initialized and kept synchronized so
that the computer knows the password that is current for this token.
– Challenge-response: In this case, the computer system generates
a challenge, such as a random string of numbers. The smart token
generates a response based on the challenge. For example, public-
key cryptography could be used and the token could encrypt the
challenge string with the token‟s private key.
Page 224
Advanced Operating Systems (Distributed Systems) Unit 10
Sikkim Manipal University Page No. 224
For user authentication to computer, the most important category of smart
token is the smart card, which has the appearance of a credit card, has an
electronic interface, and may use any of the type of protocols just described.
The remainder of this section discusses smart cards.
A smart card contains within it an entire microprocessor, including
processor, memory, and I/O ports. Some versions incorporate a special co-
processing circuit for cryptographic operation to speed the task of encoding
and decoding messages or generating digital signatures to validate the
information transferred. In some cards, the I/O ports are directly accessible
by a compatible reader by means of exposed electrical contacts. Other
cards rely instead on an embedded antenna for wireless communication
with the reader.
Biometric Authentication
A biometric authentication system attempts to authenticate an individual
based on his or her unique physical characteristics. These include static
characteristics, such as fingerprints, hand geometry, facial characteristics,
and retinal and iris patterns; and dynamic characteristics, such as voiceprint
and signature. In essence, biometrics is based on pattern recognition.
Compared to passwords and tokens, biometric authentication is both
technically complex and expensive. While it is used in a number of specific
applications, biometrics has yet to mature as a standard tool for user
authentication to computer systems.
A number of different types of physical characteristics are either in use or
under study for user authentication. The most common are the following:
Facial characteristics: Facial characteristics are the most common
means of human-to-human identification; thus it is natural to consider
them for identification by computer. The most common approach is to
define characteristics based on relative location and shape of key facial
Page 225
Advanced Operating Systems (Distributed Systems) Unit 10
Sikkim Manipal University Page No. 225
features, such as eyes, eyebrows, nose, lips, and chin shape. An
alternative approach is to use an infrared camera to produce a face
thermogram that correlates with the underlying vascular system in the
human face.
Fingerprints: Fingerprints have been used as a means of identification
for centuries, and the process has been systematized and automated
particularly for law enforcement purposes. A fingerprint is the pattern of
ridges and furrows on the surface of the fingertip. Fingerprints are
believed to be unique across the entire human population. In practice,
automated fingerprint recognition and matching system extract a number
of features from the fingerprint for storage as a numerical surrogate for
the full fingerprint pattern.
Hand geometry: Hand geometry systems identify features of the hand,
including shape, and lengths and widths of fingers.
Retinal pattern: The pattern formed by veins beneath the retinal surface
is unique and therefore suitable for identification. A retinal biometric
system obtains a digital image of the retinal pattern by projecting a low-
intensity beam of visual or infrared light into the eye.
Iris: Another unique physical characteristic is the detailed structure of
the iris.
Signature: Each individual has a unique style of handwriting, and this is
reflected especially in the signature, which is typically a frequently
written sequence. However, multiple signature samples from a single
individual will not be identical. This complicates the task of developing a
computer representation of the signature that can be matched to future
samples.
Voice: Whereas the signature style of an individual reflects not only the
unique physical attributes of the writer but also the writing habit that has
Page 226
Advanced Operating Systems (Distributed Systems) Unit 10
Sikkim Manipal University Page No. 226
developed, voice patterns are more closely tied to the physical and
anatomical characteristics of the speaker. Nevertheless, there is still a
variation from sample to sample over time from the same speaker,
complicating the biometric recognition task.
10.5 Access Control
An access control policy dictates what types of access are permitted, under
what circumstances, and by whom. Access control policies are generally
grouped into the following categories:
Discretionary access control (DAC): Controls access based on the
identity of the requestor and on access rules (authorizations) stating what
requestors are (or are not) allowed to do.This policy is termed discretionary
because an entity might have access rights that permit the entity, by its own
volition, to enable another entity to access some resource.
Mandatory access control (MAC): Controls access based on comparing
security labels (which indicate how sensitive or critical system resources
are) with security clearances (which indicate system entities are eligible to
access certain resources). This policy is termed mandatory because an
entity that has clearance to access a resource may not, just by its own
volition, enable another entity to access that resource.
Role-based access control (RBAC): Controls access based on the roles
that users have within the system and on rules stating what accesses are
allowed to users in given roles.
DAC is the traditional method of implementing access control. MAC is a
concept that evolved out of requirements for military information security
and is beyond the scope of this book. RBAC has become increasingly
popular and is introduced later in this section.
Page 227
Advanced Operating Systems (Distributed Systems) Unit 10
Sikkim Manipal University Page No. 227
These three policies are not mutually exclusive Figure 10.12. An access
control mechanism can employ two or even all three of these policies to
cover different classes of system resources.
Discretionary Access Control (DAC)
This section introduces a general model for DAC developed by Lampson,
Graham, and Denning. The model assumes a set of subjects, a set of
objects, and a set of rules that govern the access of subjects to objects. Let
us define the protection state of a system to be the set of information, at a
given point in time, that specifies the access rights for each subject with
respect to each object. We can identify three requirements: representing the
protection state, enforcing access rights, and allowing subjects to alter the
protection state in certain ways. The model addresses all three
requirements, giving a general, logical description of a DAC system.
Figure 10.12: Access Control Policies
To represent the protection state, we extend the universe of objects in the
access control matrix to include the following:
Processes: Access rights include the ability to delete a process, stop
(block), and wake up a process.
Devices: Access rights include the ability to read/write the device, to
control its operation (e.g., a disk seek), and to block/unblock the device
for use.
Discretionary Access
Control Policy
Mandatory Access control Policy
Role Based Access Control
Policy
Page 228
Advanced Operating Systems (Distributed Systems) Unit 10
Sikkim Manipal University Page No. 228
Memory locations or regions: Access rights include the ability to
read/write certain locations of regions of memory that are protected so
that the default is that access is not allowed.
Subjects: Access rights with respect to a subject have to do with the
ability to grant or delete access rights of that subject to other objects, as
explained subsequently.
Figure 10.13 is an example for an access control matrix A, each entry
A[S, X] contains strings, called access attributes, that specify the access
rights of subject S to object X. For example, in Figure 10.13, S1 may read
file F2, because „read‟ appears in A[S1, F1].
Figure 10.13: Access Control Matrix
From a logical or functional point of view, a separate access control module
is associated with each type of object Figure 10.14. The module evaluates
each request by a subject to access an object to determine if the access
right exists. An access attempt triggers the following steps:
1. A subject S0 issues a request of type α for object X.
Page 229
Advanced Operating Systems (Distributed Systems) Unit 10
Sikkim Manipal University Page No. 229
2. The request causes the system (the operating system or an access
control interface module of some sort) to generate a message of the
form (S0,α,X) to the controller for X.
Figure 10.14: Organization of Access Control Function
3. The controller interrogates the access matrix A to determine if α is in
A[S0,X]. If so, the access is allowed; if not, the access is denied and a
Page 230
Advanced Operating Systems (Distributed Systems) Unit 10
Sikkim Manipal University Page No. 230
protection violation occurs. The violation should trigger a warning and
appropriate action.
Table 10.1: Access Control System Commands
Figure 10.14 suggests that every access by a subject to an object is
mediated by the controller for that object, and that the controller‟s decision is
based on the current contents of the matrix. In addition, certain subjects
have the authority to make specific changes to the access matrix. A request
to modify the access matrix is treated as an access to the matrix, with the
individual entries in the matrix treated as objects. Such accesses are
mediated by an access matrix controller, which controls updates to the
matrix. The model also includes a set of rules that govern modifications to
the access matrix, shown in Table 10.1. For this purpose, we introduce the
access rights „owner‟ and „control‟ and the concept of a copy flag, explained
Page 231
Advanced Operating Systems (Distributed Systems) Unit 10
Sikkim Manipal University Page No. 231
in the subsequent paragraphs. The first three rules deal with transferring,
granting, and deleting access rights. Suppose that the entry α* exists in
A[S0, X]. This means that S0 has access right α to subject X and, because of
the presence of the copy flag, can transfer this right, with or without copy
flag, to another subject. Rule R1 expresses this capability. A subject would
transfer the access right without the copy flag if there were a concern that
the new subject would maliciously transfer the right to another subject that
should not have that access right. For example, S1 may place „read‟ or „read
*‟ in any matrix entry in the F1 column. Rule R2 states that if S0 is designated
as the owner of object X, then S0 can grant an access right to that object for
any other subject. Rule 2 states that S0 can add any access right to A[S,X]
for any S, if S0 has „owner‟ access to x. Rule R3 permits S0 to delete any
access right from any matrix entry in a row for which S0 controls the subject
and for any matrix entry in a column for which S0 owns the object. Rule R4
permits a subject to read that portion of the matrix that it owns or controls.
The remaining rules in Table 10.1 govern the creation and deletion of
subjects and objects. Rule R5 states that any subject can create a new
object, which it owns, and can then grant and delete access to the object.
Under rule R6, the owner of an object can destroy the object, resulting in the
deletion of the corresponding column of the access matrix. Rule R7 enables
any subject to create a new subject; the creator owns the new subject and
the new subject has control access to itself. Rule R8 permits the owner of a
subject to delete the row and column (if there are subject columns) of the
access matrix designated by that subject.
The set of rules in Table 10.1 is an example of the rule set that could be
defined for an access control system. The following are examples of
additional or alternative rules that could be included. A transfer-only right
could be defined, which results in the transferred right being added to the
target subject and deleted from the transferring subject. The number of
Page 232
Advanced Operating Systems (Distributed Systems) Unit 10
Sikkim Manipal University Page No. 232
owners of an object or a subject could be limited to one by not allowing the
copy flag to accompany the owner right.
The ability of one subject to create another subject and to have „owner‟
access right to that subject can be used to define a hierarchy of subjects.
For example, in Figure 10.13, S1 owns S2 and S3, so that S2 and S3 are
subordinate to S1. By the rules of Table 10.1, S1 can grant and delete to S2
access rights that S1 already has. Thus, a subject can create another
subject with a subset of its own access rights. This might be useful, for
example, if a subject is invoking an application that is not fully trusted, and
does not want that application to be able to transfer access rights to other
subjects.
Role-Based Access Control
Traditional DAC systems define the access rights of individual users and
groups of users. In contrast, RBAC is based on the roles that users assume
in a system rather than the user‟s identity. Typically, RBAC models define a
role as a job function within an organization. RBAC systems assign access
rights to roles instead of individual users. In turn, users are assigned to
different roles, either statically or dynamically, according to their
responsibilities.
RBAC now enjoys widespread commercial use and remains an area of
active research. The National Institute of Standards and Technology (NIST)
has issued a standard, Security Requirements for Cryptographic Modules,
that requires support for access control and administration through roles.
The relationship of users to roles is many to many, as is the relationship of
roles to resources, or system objects Figure 10.15. The set of users
changes, in some environments frequently, and the assignment of a user to
one or more roles may also be dynamic. The set of roles in the system in
most environments is likely to be static, with only occasional additions or
Page 233
Advanced Operating Systems (Distributed Systems) Unit 10
Sikkim Manipal University Page No. 233
deletions. Each role will have specific access rights to one or more
resources. The set of resources and the specific access rights associated
with a particular role are also likely to change infrequently.
We can use the access matrix representation to depict the key elements of
an RBAC system in simple terms, as shown in Figure 10.15. The upper
matrix relates individual users to roles. Typically there are many more users
than roles. Each matrix entry is either blank or marked, the latter indicating
that this user is assigned to this role. Note that a single user may be
assigned multiple roles (more than one mark in a row) and that multiple
users may be assigned to a single role (more than one mark in a
column).The lower matrix has the same structure as the DAC access control
matrix, with roles as subjects. Typically, there are few roles and many
objects or resources. In this matrix the entries are the specific access rights
enjoyed by the roles. Note that a role can be treated as an object, allowing
the definition of role hierarchies.
RBAC lends itself to an effective implementation of the principle of least
privilege. That is, each role should contain the minimum set of access rights
needed for that role. A user is assigned to a role that enables him or her to
perform only what is required for that role. Multiple users assigned to the
same role enjoy the same minimal set of access rights.
Page 234
Advanced Operating Systems (Distributed Systems) Unit 10
Sikkim Manipal University Page No. 234
Figure 10.15: Users, Roles, and Resources
10.6 Digital Signatures
A digital signature of a message is a number dependent on some secret
known only to the signer, and, additionally, on the content of the message
being signed. Signatures must be verifiable; if a dispute arises as to whether
a party signed a document (caused by either a lying signer trying to
Page 235
Advanced Operating Systems (Distributed Systems) Unit 10
Sikkim Manipal University Page No. 235
repudiate a signature it did create, or a fraudulent claimant), an unbiased
third party should be able to resolve the matter equitably, without requiring
access to the signer‟s secret information (private key).
Digital signatures have many applications in information security, including
authentication, data integrity, and non-repudiation. One of the most
significant applications of digital signatures is the certification of public keys
in large networks. Certification is a means for a trusted third party (TTP) to
bind the identity of a user to a public key, so that at some later time, other
entities can authenticate a public key without assistance from a trusted third
party.
The concept and utility of a digital signature was recognized several years
before any practical realization was available. The first method discovered
was the RSA signature scheme, which remains today one of the most
practical and versatile techniques available. Subsequent research has
resulted in many alternative digital signature techniques. Some offer
significant advantages in terms of functionality and implementation.
Basic definitions
1. A digital signature is a data string which associates a message (in
digital form) with some originating entity.
2. A digital signature generation algorithm (or signature generation
algorithm) is a method for producing a digital signature.
3. A digital signature verification algorithm (or verification algorithm) is
a method for verifying that a digital signature is authentic (i.e., was
indeed created by the specified entity).
4. A digital signature scheme (or mechanism) consists of a signature
generation algorithm and an associated verification algorithm.
Page 236
Advanced Operating Systems (Distributed Systems) Unit 10
Sikkim Manipal University Page No. 236
5. A digital signature signing process (or procedure) consists of a
(mathematical) digital signature generation algorithm, along with a
method for formatting data into messages which can be signed.
6. A digital signature verification process (or procedure) consists of a
verification algorithm, along with a method for recovering data from the
message.
Table 10.2: Notation for Digital Signature Mechanisms
(messages) M is the set of elements to which a signer can affix a digital
signature.
(signing space) MS is the set of elements to which the signature
transformations are applied. The signature transformations are not
applied directly to the set M.
(signature space) S is the set of elements associated to messages in
M. These elements are used to bind the signer to the message.
(indexing set) R is used to identify specific signing transformations.
Page 237
Advanced Operating Systems (Distributed Systems) Unit 10
Sikkim Manipal University Page No. 237
A classification of digital signature schemes
There are two general classes of digital signature schemes, which can be
briefly summarized as follows:
1. Digital signature schemes with appendix require the original message as
input to the verification algorithm.
2. Digital signature schemes with message recovery do not require the
original message as input to the verification algorithm. In this case, the
original message is recovered from the signature itself.
Definition: A digital signature scheme (with either message recovery or
appendix) is said to be a randomized digital signature scheme if |R| > 1;
otherwise, the digital signature scheme is said to be deterministic.
Figure 10.16 illustrates this classification. Deterministic digital signature
mechanisms can be further subdivided into one-time signature schemes and
multiple-use schemes.
Figure 10.16: A taxonomy of Digital Signature schemes
Digital signature schemes with appendix
Digital signature schemes with appendix, as discussed in this section, are
the most commonly used in practice. They rely on cryptographic hash
Page 238
Advanced Operating Systems (Distributed Systems) Unit 10
Sikkim Manipal University Page No. 238
functions rather than customized redundancy functions, and are less prone
to existential forgery attacks.
Definition: Digital signature schemes which require the message as input to
the verification algorithm are called digital signature schemes with appendix.
Examples of mechanisms providing digital signatures with appendix are the
DSA, ElGamal, and Schnorr signature schemes.
Algorithm: Key generation for digital signature schemes with appendix
Each entity creates a private key for signing messages, and a
corresponding public key to be used by other entities for verifying
signatures.
1. Each entity A should select a private key which defines a set SA = {SA;k :
k R} of transformations. Each SA,k is a 1-1 mapping from Mh to S and
is called a signing transformation.
2. SA defines a corresponding mapping VA from Mh X S to {true, false} such
that
VA is called a verification transformation and is constructed such that it
may be computed without knowledge of the signer‟s private key.
3. A‟s public key is VA; A‟s private key is the set SA.
Algorithm: Signature generation and verification (digital signature schemes
with appendix)
The entity A produces a signature s S for a message m M, which can
later be verified by any entity B.
Page 239
Advanced Operating Systems (Distributed Systems) Unit 10
Sikkim Manipal University Page No. 239
Figure 10.17: provides a schematic overview of a digital signature scheme
with appendix. The following properties are required of the signing and
verification transformations:
Figure 10.17: Overview of a digital signature scheme with appendix
Page 240
Advanced Operating Systems (Distributed Systems) Unit 10
Sikkim Manipal University Page No. 240
Digital signature schemes with message recovery
The digital signature schemes described in this section have the feature that
the message signed can be recovered from the signature itself. In practice,
this feature is of use for short messages.
Definition: A digital signature scheme with message recovery is a digital
signature scheme for which a priori knowledge of the message is not
required for the verification algorithm. Examples of mechanisms providing
digital signatures with message recovery are RSA, Rabin, and Nyberg-
Rueppel public-key signature schemes.
Algorithm: Key generation for digital signature schemes with message
recovery. Each entity creates a private key to be used for signing messages,
and a corresponding public key to be used by other entities for verifying
signatures.
Algorithm: Signature generation and verification for schemes with message
recovery.
The entity A produces a signature sS for a message mM, which can
later be verified by any entity B. The message m is recovered from s.
1. Signature generation: Entity A should do the following:
a) Select an element k R.
b) Compute ~
m = R(m) and S* = SA,k (~
m ). (R is a redundancy function)
c) A‟s signature is s*; this is made available to entities which may wish
to verify the signature and recover m from it.
Page 241
Advanced Operating Systems (Distributed Systems) Unit 10
Sikkim Manipal University Page No. 241
2. Verification: Entity B should do the following:
a) Obtain A‟s authentic public key VA.
b) Compute ~
m = VA (S*)
c) Verify that ~
mMR. (If ~
m MR, then reject the signature.)
d) Recover m from ~
m by computing R-1(~
m ).
Figure 10.18: Overview of a digital signature scheme with message recovery
Figure 10.18 provides a schematic overview of a digital signature scheme
with message recovery. The following properties are required of the signing
and verification transformations:
i. for each k R, SA,k should be efficient to compute;
ii. VA should be efficient to compute; and
iii. it should be computationally infeasible for an entity other than A to find
any s* S such that VA(s*) MR.
10.7 Design Principles
Designers of security components of a distributed operating system should
follow the following guidelines while designing a secured network:
1. Least Privilege: This principle is also known as need-to-know
principle. It states that any process should be given only those access
rights that enable it to access, at any time, what it needs to accomplish
Page 242
Advanced Operating Systems (Distributed Systems) Unit 10
Sikkim Manipal University Page No. 242
its function and nothing more and nothing less. i.e. the security
system must be flexible enough to allow the access rights of a process
to grow and shrink with its changing access requirements. This
principle serves to limit the damage when a system‟s security is
broken.
2. Fail-Safe defaults: Access rights should be acquired by explicit
permission only and the default should be no access. This principle
requires that access control decisions should be based on why an
object should be accessible to a process rather than on why it should
not be accessible.
3. Open design: This principle requires that the design should not be
secret but should be public. It is a mistake on the part of a designer to
assume that intruders will not know how the security mechanism of the
system works.
4. Built into the system: This principle requires that the security be
designed into the systems at their inception and be built into the lowest
layers of the systems. i.e. security should not be treated as an add-on
feature because security problems cannot be resolved very effectively
by patching the penetration holes detected in an existing system.
5. Check for current authority: This principle requires that every access
to every object must be checked using an access control database for
authority. This is necessary to have immediate effect of revocation of
previously given access rights.
6. Easy granting and revocation of access rights: For greater
flexibility, a security system must allow access rights for an object to
be granted or revoked dynamically. It should be possible to restrict
some of the rights and to grant to a user only those rights that are
sufficient to accomplish its functions. On the other hand, a good
Page 243
Advanced Operating Systems (Distributed Systems) Unit 10
Sikkim Manipal University Page No. 243
security system should allow immediate revocation with the flexibility of
selective and partial revocation.
7. Never trust other parties: For producing a secured distributed
system, the system components must be designed with the
assumption that other parties (human or programs) are not trustworthy
until they are demonstrated to be trustworthy.
8. Always ensure freshness of messages: To avoid security violations
through the replay of messages, the security of a distributed system
must be designed to always ensure freshness of messages
exchanged between two communicating entities.
9. Build firewalls: To limit the damage in case of a system‟s security
being compromised, the system must have firewalls built into it. One
way to meet these requirements is to allow only short-lived passwords
and keys in the system.
10. Efficient: The security mechanisms used must execute efficiently and
be simple to implement.
11. Convenient to use: To be psychologically acceptable, the security
mechanisms must be convenient to use. Otherwise, they are likely to
be bypassed or incorrectly used by the users.
12. Cost Effective: It is often the case that security needs to be traded off
with other goals of the system, such as performance or ease of use.
10.8 Terminal Questions
1. Discuss the major requirements for security services and with a labeled
diagram explain the Security Model. (Refer to Section 10.1)
2. Discuss about the potential attacks on a computer system. Describe the
four general categories of attacks. (Refer to Section 10.2)
Page 244
Advanced Operating Systems (Distributed Systems) Unit 10
Sikkim Manipal University Page No. 244
3. Define Cryptography. Describe the components of Cryptography with a
neat labeled diagram. (Refer to Section 10.3)
4. Define Authentication. Explain various methods of implementing
authentication (Refer to Section 10.4)
5. Describe the following two types of Access Control Mechanisms:
Discretionary Access Control
Role based Access Control
(Refer to Section 10.5)
––––––––––––––––––––––––––