Scale and Performance in a Distributed File System John H. Howord et. al. in ACM Transactions on Computer Systems Vol. 6, No. 1, 1988 Presenter: Changyeon Jo, Jaegyoon Hahm 2013/10/22
Scale and Performance in a
Distributed File SystemJohn H. Howord et. al.
in ACM Transactions on Computer Systems Vol. 6, No. 1, 1988
Presenter: Changyeon Jo, Jaegyoon Hahm
2013/10/22
2Distributed Information Processing
Presentation Outline
Andrew File System Prototype
Overview of Andrew File System (AFS)
Performance Evaluation Of AFS Prototype
Changes For Performance
Revised Andrew File System
Effect of Changes For Performance
Performance Comparison of NFS and AFS
Changes for Operability
Conclusion
3Distributed Information Processing
Andrew File System Prototype
Changyeon Jo
4Distributed Information Processing
Andrew File System
Distributed File System developed in CMU
Presents a homogeneous, location-transparent file name space
Scalability is most important design goal of AFS
Unusual Design Features
Whole-file caching
AFS architecture is based on observations …
Shared files are infrequently updated
Files are normally accessed by only a single user
Local caches are likely to remain valid for long periods
5Distributed Information Processing
Andrew File System Architecture
Vice
UNIX kernel
Disk Disk Disk
Network
Vice
UNIX kernel
Disk Disk Disk
VenusUser Program
UNIX kernel
Disk
VenusUser Program
UNIX kernel
Disk
VenusUser Program
UNIX kernel
Disk
6Distributed Information Processing
Andrew File System Prototype
Vice
Serve files to Venus
A process running on
server side
A Vice process dedicated
to each Venus client
Venus
Cache files from Vice
Contacts Vice only when a
file is opened or closed
Reading and writing are
performed directly on the
cached copy
Network
VenusUser Program
UNIX kernel
Disk
Vice
UNIX kernel
Disk Disk Disk
7Distributed Information Processing
Andrew File System Prototype (cont’d)
Vice
Serve files to Venus
A process running on
server side
A Vice process dedicated
to each Venus client
Venus
Cache files from Vice
Contacts Vice only when a
file is opened or closed
Reading and writing are
performed directly on the
cached copy
Network
VenusUser Program
UNIX kernel
Disk
Vice
UNIX kernel
Disk Disk DiskA
open(A)
8Distributed Information Processing
Andrew File System Prototype (cont’d)
Vice
Serve files to Venus
A process running on
server side
A Vice process dedicated
to each Venus client
Venus
Cache files from Vice
Contacts Vice only when a
file is opened or closed
Reading and writing are
performed directly on the
cached copy
Network
VenusUser Program
UNIX kernel
Disk
Vice
UNIX kernel
Disk Disk Disk
Amodify A’
A
9Distributed Information Processing
Andrew File System Prototype (cont’d)
Vice
Serve files to Venus
A process running on
server side
A Vice process dedicated
to each Venus client
Venus
Cache files from Vice
Contacts Vice only when a
file is opened or closed
Reading and writing are
performed directly on the
cached copy
Network
VenusUser Program
UNIX kernel
Disk
Vice
UNIX kernel
Disk Disk Disk
close(A) A’
A
A’
10Distributed Information Processing
Andrew File System Prototype (cont’d)
Vice
Serve files to Venus
A process running on
server side
A Vice process dedicated
to each Venus client
Venus
Cache files from Vice
Contacts Vice only when a
file is opened or closed
Reading and writing are
performed directly on the
cached copy
Network
VenusUser Program
UNIX kernel
Disk
Vice
UNIX kernel
Disk Disk Disk
open(A) A’A’
A’
11Distributed Information Processing
Andrew File System Prototype (cont’d)
Vice maintains status
information in separate files
.admin directory to store
configuration data
mirroring original vice file
structures
Stub directory gives a
name space located on
other servers
.admin/
a
b c (stub)
d e
a
b c
d e
located on
other servers
12Distributed Information Processing
Andrew File System Prototype (cont’d)
Vice-Venus interface names
files by their full pathname
there is no notion of low lev
el file name such as inode
full path directory working
is required
/
a
b c
d e
/a/c/e
13Distributed Information Processing
Andrew File System Prototype (cont’d)
Vice create dedicated
processes for each client
the dedicated process
persisted until its client
terminated
too frequent context
switching
Server
ViceVice
ViceVice
Client
Venus
Client
Venus
Client
Venus
Client
Venus
14Distributed Information Processing
Andrew File System Prototype (cont’d)
Venus verify timestamps on
every open
Each open include at least
one interaction with a
server
even if the file were
already in the cache and
up-to-date!
Server
Vice
Client
Venus
A:2
A:2
compare timestamp
15Distributed Information Processing
AFS Prototype Benchmark Setup
Load Unit
Load placed on a server by a single client
A Load Unit = about five Andrew users
Read-only source subtree
70 files
Total 200kbytes
Synthetic Benchmark
Simulate real user actions
MakeDir, Copy, ScanDir, ReadAll, Make
16Distributed Information Processing
Stand-alone Benchmark Performance
MakeDir: Constructs a target
subtree that is identical in
structure to the source subtree
Copy: Copies every file from the
source subtree to the target
subtree
ScanDir: Recursively traverses
the target subtree and examines
the status of every file in it
ReadAll: Scans every byte of
every file in the target subtree
once
Make: Compiles and links all the
files in the target subtree
0
200
400
600
800
1000
1200
Sun2 IBM RT/25 SUN3/50
Be
nrh
mar
k Ti
me
(se
cs)
MakeDir
Copy
ScanDir
ReadAll
Make
17Distributed Information Processing
Stand-alone Benchmark Performance
Hit Ratio of Two Caches
81% for file cache
82% for status cache
0
200
400
600
800
1000
1200
Sun2 IBM RT/25 SUN3/50
Be
nrh
mar
k Ti
me
(se
cs)
MakeDir
Copy
ScanDir
ReadAll
Make
18Distributed Information Processing
Distribution of Vice Calls in Prototype
TestAuth and GetFileStat are
accounting for nearly 90% of the
total calls!
Caused by frequency of cache
validity check
0K
200K
400K
600K
800K
1000K
1200K
1400K
1600K
1800K
cluster0 cluster1 cmu-0 cmu-1 cmu-2
# o
f ca
lls
All OthersListDirSetFileStatStoreFetchGetFileStatTestAuth
19Distributed Information Processing
Prototype Benchmark Performance
Took about 70% longer at a load
of 1 than in the stand-alone case
TestAuth rose rapidly beyond a
load of 5
Running only between 5 and 10
servers is the maximum feasible
0
1
2
3
4
5
6
7
8
9
10
11
12
13
0 1 2 3 4 5 6 7 8 9 10 11
No
rmal
ize
d B
en
chm
ark
Tim
e
Load Units
Overall Benchmark Time
Time Per TestAuth Call
20Distributed Information Processing
Prototype Server Usage
CPU utilization is too high!
Sever CPU is the performance
bottleneck
Caused by frequent context
switching
Traversing full pathnames
Server loads were not evenly
balanced
Require load balancing
between servers0
10
20
30
40
50
cluster0 cluster1 cmu-0 cmu-1
Uti
lizat
ion
(%
)
CPUDisk 1Disk 2
21Distributed Information Processing
Changes For Performance
Cache Management
Caches the contents of
directories and symbolic links
in addition to files
Require workstations to do
path name traversals
Callback
Venus assumes that cache
entries are valid unless
otherwise notified
Server promises to notify it
before allowing a modification
by any other workstation
Server
Vice
Client
Venus
A:2
A:3
callback!
22Distributed Information Processing
Changes For Performance (cont’d)
Name Resolution
Implicit namei in Venus was
costly
Introduce fids
Volume Number
Collection of files located
on one server
Vnode number
Used as an index into an
array containing the file
storage information for the
files in a sing volume
Uniquifier
Allows reuse of vnode
numbers
32-bitVolume Number
32-bitVnode Number
32-bitUniquifier
fids
No explicit location information!
23Distributed Information Processing
Server
Changes For Performance (cont’d)
Communication and Server
Process Structure
A single process to service all
clients of a server
Using multiple Lightweight
Processes (LWPs) within one
process
Context switching is cheap
Vice
LWPLWP
LWPLWP
Client
Venus
Client
Venus
Client
Venus
Client
Venus
24Distributed Information Processing
Changes For Performance (cont’d)
Low-Level Storage
Representation
Access files by their inodes
rather than by path names
Eliminated nearly all
pathname lookups on
workstations and servers
/
a
b c
d e
/a/c/e
inode number
access directly
25Distributed Information Processing
Revised Andrew File System
Jaegyoon Hahm
26Distributed Information Processing
Revised AFS Benchmark Result
Scalability
Andrew is only 19% slower
than a stand-alone
workstation at load 1
(prototype was 70%
slower)
Less than twice as long at a
load of 15 as at a load of 1
The design changes have
improved scalability
considerably!0.0
0.5
1.0
1.5
2.0
2.5
1 2 5 8 10 15 20
No
rmal
ized
Ben
chm
ark
Tim
e
Load Units
MakeDirCopyScanDirReadAllMake
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
0 5 10 15 20
Rel
ativ
e B
ench
mar
k Ti
me
Load Units
Prototype Revised Andrew File System
27Distributed Information Processing
Server utilization during the benchmark
CPU Utilization
8% at load 1 to over 70%
at a load of 20
At a load of 20 the system
is still not saturated
But performance bound is
still CPU
Disk Utilization
Below 20% even at a load
of 20
Better performance requires,
More efficient server
software
Faster server CPU
0
10
20
30
40
50
60
70
80
0 5 10 15 20U
tiliz
atio
n (
%)
Load Units
CPU
Disk
28Distributed Information Processing
Andrew Server Usage
Measured during 8-hour period
from 9 AM to 5 PM
Most of servers show CPU
utilizations between 15% and
25%
Vice9
The highest load of any
server
Serve a bulletin board that
a collection of directories
that are frequently
accessed and modified by
many different users
0
5
10
15
20
25
30
35
40
Uti
lizat
ion
(%
)
CPU
Disk 1
Disk 2
Disk 3
29Distributed Information Processing
Distribution of Calls to Andrew Servers GetTime
Most frequently called
Used by workstations to
synchronize their clocks and
as an implicit keepalive
FetchStatus
Generated by users listing
directories
RemoveCB
Flushes a cache entry
Vice9: Indicates that the files it
stores exhibit poor locality
(mostly just one read for
bulletin board)
Vice8: used by operation staff
Modification made to do
RemoveCB on groups
0K
100K
200K
300K
400K
500K
600K
700K
800K
# o
f ca
lls OtherVolStatsGetTimeRemoveCBGetStatStoreStatusStoreDataFetchStatusFetchData
30Distributed Information Processing
Comparison with A Remote-Open File System
Caching of entire file in the local disks of AFS was motivated
by
Locality of file references by typical users makes caching
attractive
Whole-file transfer contacts servers only on opens and
closes, read-writes cause no network traffic
Whole-file transfer uses efficient bulk data transfer
protocols.
The amount of data fetched after a reboot is usually small.
Caching of entire files simplifies cache management
31Distributed Information Processing
Comparison with A Remote-Open File System
Drawbacks of the entire file caching approach in AFS
Workstations require local disks for acceptable
performance
Files that are larger than the local disk cache cannot be
accessed at all
Strict emulation of 4.2BSD concurrent read and write
semantics across workstations is impossible, because
reads and writes are not intercepted.
Nevertheless, our approach provides superior proformance in
a large scale systems.
32Distributed Information Processing
Comparison with A Remote-Open File System
Remote-Open File System: Sun Microsystem NFS, AT&T
RFS, Locus
The data in a file are not fetched en masse
Instead, the remote site potentially participates in each
individual read and write operation
Even though buffering and read-ahead are used to
improve performance but the remote site is still
conceptually involved in every I/O
Comparison with NFS
Representative of Remote-Open File System
Mature product from a successful venter of distributed
computing and de facto standard
33Distributed Information Processing
Sun Network File System
Servers must be identified and mounted individually: no transparent file
location facility
Both client and server components are implemented in the kernel and are
more efficient than AFS
Page caching: NFS caches inodes and individual pages of a file in
memory.
Once a file is open, the remote site is treated like a local disk with
read-ahead and write-behind of pages
Consistency semantics of NFS
A new file may not be visible elsewhere for 30 seconds
Two processes writing to the same file could produce a different
results
34Distributed Information Processing
Performance
NFS’s performance degrades
rapidly with increasing load
Warm cache of AFS are better
Cold Cache: Workstation
caches were cleared
before each trial
Warm Cache: Caches were
left unaltered 0.5
1.0
1.5
2.0
2.5
0 5 10 15 20N
orm
aliz
ed
Be
nch
mar
k Ti
me
Load Units
Andrew WarmAndrew ColdNFS
35Distributed Information Processing
Performance
0.6
0.8
1.0
1.2
1.4
1.6
1.8
0 5 10 15 20
No
rmal
ize
d B
en
chm
ark
Tim
e
Load Units
ScanDir
NFSAndrew ColdAndrew Warm
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
0 5 10 15 20
No
rmal
ize
d B
en
chm
ark
Tim
e
Load Units
ReadAll
NFSAndrew ColdAndrew Warm
Andrew well scales especially on ScanDir and ReadAll than NFS
Caching and callback result this performance gain
NFS: lack of disk cache and the need to check with the server on each
file open
36Distributed Information Processing
CPU Utilization
CPU utilization of NFS is much
higher than AFS
At a load of 1, CPU utilization
is about 22% in NFS but 3% in
Andrew
At a load 18, CPU utilizations
saturates at 100% in NFS but
for Andrew 38% in the cold
cache and 42% in the warm
case
0
20
40
60
80
100
0 2 4 6 8 10 12 14 16 18 20C
PU
Uti
lizat
ion
(%
)
Load Units
Andrew WarmAndrew ColdNFS
37Distributed Information Processing
Disk Utilization
NFS used both disks on the
server, with utilizations rising
from about 9% and 3% at a
load of 1 to nearly 95% and
19% at a load of 18
respectively
Disk 1: System libs.
Disk 2: User data
Andrew used only one of the
server disks, with utilization
rising from about 4% at a load
of 1 to about 33% at a load of
18 in the cold cache case
0
20
40
60
80
100
0 5 10 15 20D
isk
Uti
lizat
ion
(%
)
Load Units
Andrew WarmAndrew ColdNFS Disk 1NFS Disk 2
38Distributed Information Processing
Network Traffic
NFS generates nearly three
times as many packets as
Andrew at a load of one
0K
2K
4K
6K
8K
10K
12K
Andrew NFS
# o
f p
acke
ts
Packets from Client to Server
Packets from Server to Client
39Distributed Information Processing
Changes for Operability
Goal: to build a system that would be easy for a small operational staff to run and monitor with minimal inconvenience to users
Problems in the prototype: inflexible mapping of Vice files to server disk storage
Vice was constructed out of collections of files glued together by the 4.2BSD Mount mechanism
Movement of files across servers was difficult (embedded file location info)
It was not possible to implement a quota system
The mechanisms of file location and file replication were cumbersome (consistency problem)
Standard backup utilities were not convenient for use in distributed environment
Backup needs the entire disk partition taken off-line
40Distributed Information Processing
Changes for Operability
Volumes
Files in Vise can be distributed over disk partitions with Volume
Collection of files forming a partial subtree of the Vice name
space
Volumes are glued together at Mount Points to form the
complete name space
A volume resides within a single disk partition on a server
Volumes provides a level of Operational Transparancy
Volume Movement
Volume movement is done by creating a clone, a frozen copy-
on-write snapshot of the volume
During movement, the volume location database is updated
41Distributed Information Processing
Changes for Operability
Quotas
Quotas are implemented on a per volume basis
Read-Only Replication
Improves availability and balances load
No callback needed
The volume location database specifies the server containing
the read-write copy of a volume and a list of read-only
replication sites
Backup
Volumes form the basis of the backup and restoration
mechanism
Create a frozen snapshot of a read-only clone, then transfer it
42Distributed Information Processing
Conclusions
Design changes in the Prototype Andrew File System improved scalability considerably
At a load of 20, the system was still not saturated
A server using revised Andrew File System can serve more than 50 users
Changes in cache management, name resolution, server process structure, low-level storage representation
Volumes provide a level of Operational Transparency that is not supported by any other file system
Quota
Read-only replication
Simple and efficient backup
43Distributed Information Processing
Thank You!
Questions?