Reliable Distributed Systems RPC and Client-Server Computing
Dec 19, 2015
Reliable Distributed Systems
RPC and Client-Server Computing
Remote Procedure Call
Basic concepts Implementation issues, usual
optimizations Where are the costs? Firefly RPC, Lightweight RPC,
Winsock Direct and VIA Reliability and consistency Multithreading debate
A brief history of RPC Introduced by Birrell and Nelson in 1985 Pre-RPC: Most applications were built
directly over the Internet primitives Their idea: mask distributed computing
system using a “transparent” abstraction Looks like normal procedure call Hides all aspects of distributed interaction Supports an easy programming model
Today, RPC is the core of many distributed systems
More history Early focus was on RPC “environments” Culminated in DCE (Distributed Computing
Environment), standardizes many aspects of RPC
Then emphasis shifted to performance, many systems improved by a factor of 10 to 20
Today, RPC often used from object-oriented systems employing CORBA or COM standards. Reliability issues are more evident than in the past.
The basic RPC protocol
client server“binds” to
serverregisters with name service
The basic RPC protocol
client server“binds” to
server
prepares, sends request
registers with name service
receives request
The basic RPC protocol
client server“binds” to
server
prepares, sends request
registers with name service
receives requestinvokes handler
The basic RPC protocol
client server“binds” to
server
prepares, sends request
registers with name service
receives requestinvokes handlersends reply
The basic RPC protocol
client server“binds” to
server
prepares, sends request
unpacks reply
registers with name service
receives requestinvokes handlersends reply
Compilation stage Server defines and “exports” a header file
giving interfaces it supports and arguments expected. Uses “interface definition language” (IDL)
Client includes this information Client invokes server procedures through
“stubs” provides interface identical to the server version responsible for building the messages and
interpreting the reply messages passes arguments by value, never by reference may limit total size of arguments, in bytes
Binding stage Occurs when client and server program
first start execution Server registers its network address
with name directory, perhaps with other information
Client scans directory to find appropriate server
Depending on how RPC protocol is implemented, may make a “connection” to the server, but this is not mandatory
Data in messages We say that data is “marshalled” into a
message and “demarshalled” from it Representation needs to deal with byte
ordering issues (big-endian versus little endian), strings (some CPUs require padding), alignment, etc
Goal is to be as fast as possible on the most common architectures, yet must also be very general
Request marshalling Client builds a message containing arguments,
indicates what procedure to invoke Do to need for generality, data representation
a potentially costly issue! Performs a send I/O operation to send the
message Performs a receive I/O operation to accept the
reply Unpacks the reply from the reply message Returns result to the client program
Costs in basic protocol? Allocation and marshalling data into
message (can reduce costs if you are certain client, server have identical data representations)
Two system calls, one to send, one to receive, hence context switching
Much copying all through the O/S: application to UDP, UDP to IP, IP to ethernet interface, and back up to application
Schroeder and Burroughs Studied RPC performance in O/S
kernel Suggested a series of major
optimizations Resulted in performance
improvments of about 10-fold for Xerox firefly workstation (from 10ms to below 1ms)
Typical optimizations? Compile the stub “inline” to put arguments
directly into message Two versions of stub; if (at bind time) sender
and dest. found to have same data representations, use host-specific rep.
Use a special “send, then receive” system call (requires O/S extension)
Optimize the O/S kernel path itself to eliminate copying – treat RPC as the most important task the kernel will do
Fancy argument passing RPC is transparent for simple calls with a small
amount of data passed “Transparent” in the sense that the interface to the
procedure is unchanged But exceptions thrown will include new exceptions
associated with network What about complex structures, pointers, big
arrays? These will be very costly, and perhaps impractical to pass as arguments
Most implementations limit size, types of RPC arguments. Very general systems less limited but much more costly.
Overcoming lost packets
client serversends request
Overcoming lost packets
client serversends request
retransmit
ack for request duplicate request: ignored
Timeout!
Overcoming lost packets
client serversends request
retransmit
ack for request
reply
Timeout!
Overcoming lost packets
client serversends request
retransmit
ack for request
reply
ack for reply
Timeout!
Costs in fault-tolerant version? Acks are expensive. Try and avoid
them, e.g. if the reply will be sent quickly supress the initial ack
Retransmission is costly. Try and tune the delay to be “optimal”
For big messages, send packets in bursts and ack a burst at a time, not one by one
Big packets
client serversends request as a burst
ack entire burst
reply
ack for reply
RPC “semantics” At most once: request is processed 0 or
1 times Exactly once: request is always
processed 1 time At least once: request processed 1 or
more times... but exactly once is impossible because
we can’t distinguish packet loss from true failures! In both cases, RPC protocol simply times out.
Implementing at most/least once Use a timer (clock) value and a unique id, plus
sender address Server remembers recent id’s and replies with
same data if a request is repeated Also uses id to identify duplicates and reject
them Very old requests detected and ignored by
checking time Assumes that the clocks are working In particular, requires “synchronized” clocks
RPC versus local procedure call Restrictions on argument sizes and
types New error cases:
Bind operation failed Request timed out Argument “too large” can occur if, e.g., a
table grows Costs may be very high ... so RPC is actually not very
transparent!
RPC costs in case of local destination process
Often, the destination is right on the caller’s machine!
Caller builds message Issues send system call, blocks, context switch Message copied into kernel, then out to dest. Dest is blocked... wake it up, context switch Dest computes result Entire sequence repeated in reverse direction If scheduler is a process, context switch 6 times!
RPC example
Source does
xyz(a, b, c)
Dest on same site
O/S
RPC in normal case
Source does
xyz(a, b, c)
Dest on same site
O/S
Destination and O/S are blocked
RPC in normal case
Source does
xyz(a, b, c)
Dest on same site
O/S
Source, dest both block. O/S runs its scheduler, copies message from source out-
queue to dest in-queue
RPC in normal case
Source does
xyz(a, b, c)
Dest on same site
O/S
Dest runs, copies in message
Same sequence needed to return results
Important optimizations: LRPC Lightweight RPC (LRPC): for case of
sender, dest on same machine (Bershad et. al.)
Uses memory mapping to pass data Reuses same kernel thread to reduce
context switching costs (user suspends and server wakes up on same kernel thread or “stack”)
Single system call: send_rcv or rcv_send
LRPC
Source does
xyz(a, b, c)
Dest on same site
O/S
O/S and dest initially are idle
LRPC
Source does
xyz(a, b, c)
Dest on same site
O/S
Control passes directly to dest
arguments directly visible through remapped memory
LRPC performance impact On same platform, offers about a 10-
fold improvement over a hand-optimized RPC implementation
Does two memory remappings, no context switch
Runs about 50 times faster than standard RPC by same vendor (at the time of the research)
Semantics stronger: easy to ensure exactly once
Fbufs Peterson: tool for speeding up layered
protocols Observation: buffer management is a major
source of overhead in layered protocols (ISO style)
Solution: uses memory management, protection to “cache” buffers on frequently used paths
Stack layers effectively share memory Tremendous performance improvement
seen
Fbufs
control flows through stack of layers, or pipeline of processes
data copied from “out” buffer to “in” buffer
Fbufs
control flows through stack of layers, or pipeline of processes
data placed into “out” buffer, shaded buffers are mapped into address space but protected against access
Fbufs
control flows through stack of layers, or pipeline of processes
buffer remapped to eliminate copy
Fbufs
control flows through stack of layers, or pipeline of processes
in buffer reused as out buffer
Fbufs
control flows through stack of layers, or pipeline of processes
buffer remapped to eliminate copy
Where are Fbufs used?
Although this specific system is not widely used Most kernels use similar ideas to
reduce costs of in-kernel layering And many application-layer libraries
use the same sorts of tricks to achieve clean structure without excessive overheads from layer crossing
Active messages Concept developed by Culler and von
Eicken for parallel machines Assumes the sender knows all about the
dest, including memory layout, data formats
Message header gives address of handler
Applications copy directly into and out of the network interface
Performance impact? Even with optimizations, standard RPC
requires about 1000 instructions to send a null message
Active messages: as few as 6 instructions! One-way latency as low as 35usecs
But model works only if “same program” runs on all nodes and if application has direct control over communication hardware
U/Net Low latency/high performance communication
for ATM on normal UNIX machines, later extended to fast Ethernet
Developed by Von Eicken, Vogels and others at Cornell (1995)
Idea is that application and ATM controller share memory-mapped region. I/O done by adding messages to queue or reading from queue
Latency 50-fold reduced relative to UNIX, throughput 10-fold better for small messages!
U/Net concepts Normally, data flows through the O/S to
the driver, then is handed to the device controller
In U/Net the device controller sees the data directly in shared memory region
Normal architecture gets protection from trust in kernel
U/Net gets protection using a form of cooperation between controller and device driver
U/Net implementation Reprogram ATM controller to
understand special data structures in memory-mapped region
Rebuild ATM device driver to match this model
Pin shared memory pages, leave mapped into I/O DMA map
Disable memory caching for these pages (else changes won’t be visible to ATM)
U-Net Architecture
User’s address space has a direct-mapped communication region
ATM device controller sees whole region and can transfer directly in and out of it
... organized as an in-queue, out-queue, freelist
U-Net protection guarantees No user can see contents of any other
user’s mapped I/O region (U-Net controller sees whole region but not the user programs)
Driver mediates to create “channels”, user can only communicate over channels it owns
U-Net controller uses channel code on incoming/outgoing packets to rapidly find the region in which to store them
U-Net reliability guarantees With space available, has the same
properties as the underlying ATM (which should be nearly 100% reliable)
When queues fill up, will lose packets Also loses packets if the channel
information is corrupted, etc
Minimum U/Net costs? Build message in a preallocated buffer in the
shared region Enqueue descriptor on “out queue” ATM immediately notices and sends it Remote machine was polling the “in queue” ATM builds descriptor for incoming message Application sees it immediately: 35usecs
latency
Protocols over U/Net
Von Eicken, Vogels support IP, UDP, TCP over U/Net
These versions run the TCP stack in user space!
Later in course will look at other complex protocols over U/Net
VIA and Winsock Direct Windows consortium (MSFT, Intel,
others) commercialized U/Net: Virtual Interface Architecture (VIA) Runs in NT Clusters
But most applications run over UNIX-style sockets (“Winsock” interface in NT)
Winsock direct automatically senses and uses VIA where available
Today is widely used on clusters and may be a key reason that they have been successful
Broad comments on RPC RPC is not very transparent Failure handling is not evident at all: if an RPC
times out, what should the developer do? Reissuing the request only makes sense if there is
another server available Anyhow, what if the request was finished but the
reply was lost? Do it twice? Try to duplicate the lost reply?
Performance work is producing enormous gains: from the old 75ms RPC to RPC over U/Net with a 75usec round-trip time: a factor of 1000!
Contents of an RPC environment
Standards for data representation Stub compilers, IDL databases Services to manage server directory,
clock synchronization Tools for visualizing system state
and managing servers and applications
Closely Related Topic Multithreading is a common
performance-enhancing technique Idea is that server is often idle while
doing I/O for one client, so use extra threads to allow concurrent request processing
In the limit, leads to database transactional concurrency model, but many non-transactional servers use threads for enhanced performance
Multithreading debate Three major options:
Single-threaded server: only does one thing at a time, uses send/recv system calls and blocks while waiting
Multi-threaded server: internally concurrent, each request spawns a new thread to handle it
Upcalls: event dispatch loop does a procedure call for each incoming event, like for X11 or PC’s running Windows.
Single threading: drawbacks Applications can deadlock if a request cycle
forms: I’m waiting for you and you send me a request, which I can’t handle
Much of system may be idle waiting for replies to pending requests
Harder to implement RPC protocol itself (need to use a timer interrupt to trigger acks, retransmission, which is awkward)
Multithreading Idea is to support internal concurrency as
if each process was really multiple processes that share one address space
Thread scheduler uses timer interrupts and context switching to mimic a physical multiprocessor using the smaller number of CPU’s actually available
Multithreaded RPC Each incoming request is handled by
spawning a new thread Designer must implement appropriate
mutual exclusion to guard against “race conditions” and other concurrency problems
Ideally, server is more active because it can process new requests while waiting for its own RPC’s to complete on other pending requests
Negatives to multithreading Users may have little experience with
concurrency and will then make mistakes Concurrency bugs are very hard to find due to
non-reproducible scheduling orders Reentrancy can come as an undesired surprise Threads need stacks hence consumption of
memory can be very high Deadlock remains a risk, now associated with
concurrency control Stacks for threads must be finite and can
overflow, corrupting the address space
Threads: can spawn too many
SCHED
event
Threads: can spawn too many
SCHED
event
Thread spawned, but blocks
Threads: can spawn too many
SCHED
eventEventually, application becomes bloated, begins to thrash. Performance drops and clients may think the server has failed
Upcall model Common in windowing systems Each incoming “event” is encoded as a
small descriptive data structure User registers event handling
procedures Dispatch loop calls the procedures as
new events arrive, waits for the call to finish, then dispatches a new event
Upcalls combined with threads
Perhaps the best model for RPC programming
Each handler can be tagged: needs thread, or can be executed “unthreaded”
Developer must still be very careful where threads are used
Recent RPC history RPC was once touted as the transparent
answer to distributed computing Today the protocol is very widely used ... but it isn’t very transparent, and
reliability issues can be a major problem Today the strongest interest is in Web
Services and CORBA, which use RPC as the mechanism to implement object invocation