Jan 18, 2016
Administrivia
• Quiz 3: memory management– to be posted today, due next Monday in class
• Project 2:• HW 6-8: punted• HW 9: file system report• HW 10: echics• Labs 6 & 7:• Case study:
04/21/23 2
04/21/23 3
File systems
• The concept of a file system is simple– the implementation of the abstraction for secondary storage
• abstraction = files
– logical organization of files into directories• the directory hierarchy
– sharing of data between processes, people and machines• access control, consistency, …
04/21/23 4
Files
• A file is a collection of data with some properties– contents, size, owner, last read/write time, protection …
• Files may also have types– understood by file system
• device, directory, symbolic link
– understood by other parts of OS or by runtime libraries• executable, dll, source code, object code, text file, …
• Type can be encoded in the file’s name or contents– windows encodes type in name
• .com, .exe, .bat, .dll, .jpg, .mov, .mp3, …
– old Mac OS stored the name of the creating program along with the file
– unix has a smattering of both• in content via magic numbers or initial characters
04/21/23 5
Basic operations
Windows
• CreateFile(name, CREATE)
• CreateFile(name, OPEN)
• ReadFile(handle, …)
• WriteFile(handle, …)
• FlushFileBuffers(handle, …)
• SetFilePointer(handle, …)
• CloseHandle(handle, …)
• DeleteFile(name)
• CopyFile(name)
• MoveFile(name)
Unix
• create(name)
• open(name, mode)
• read(fd, buf, len)
• write(fd, buf, len)
• sync(fd)
• seek(fd, pos)
• close(fd)
• unlink(name)
• rename(old, new)
04/21/23 6
File access methods
• Some file systems provide different access methods that specify ways the application will access data– sequential access
• read bytes one at a time, in order
– direct access• random access given a block/byte #
– record access• file is array of fixed- or variable-sized records
– indexed access• FS contains an index to a particular field of each record in a file• apps can find a file based on value in that record (similar to DB)
• Why do we care about distinguishing sequential from direct access?– what might the FS do differently in these cases?
04/21/23 7
Directories
• Directories provide:– a way for users to organize their files– a convenient file name space for both users and FS’s
• Most file systems support multi-level directories– naming hierarchies (/, /usr, /usr/local, /usr/local/bin, …)
• Most file systems support the notion of current directory– absolute names: fully-qualified starting from root of FS
bash$ cd /usr/local
– relative names: specified with respect to current directorybash$ cd /usr/local (absolute)
bash$ cd bin (relative, equivalent to cd /usr/local/bin)
04/21/23 8
Directory internals
• A directory is typically just a file that happens to contain special metadata– directory = list of (name of file, file attributes)– attributes include such things as:
• size, protection, location on disk, creation time, access time, …
– the directory list is usually unordered (effectively random)• when you type “ls”, the “ls” command sorts the results for you
04/21/23 9
Path name translation
• Let’s say you want to open “/one/two/three”fd = open(“/one/two/three”, O_RDWR);
• What goes on inside the file system?– open directory “/” (well known, can always find)– search the directory for “one”, get location of “one”– open directory “one”, search for “two”, get location of “two”– open directory “two”, search for “three”, get loc. of “three”– open file “three”– (of course, permissions are checked at each step)
• FS spends lots of time walking down directory paths– this is why open is separate from read/write (session state)– OS will cache prefix lookups to enhance performance
• /a/b, /a/bb, /a/bbb all share the “/a” prefix
04/21/23 10
File protection
• FS must implement some kind of protection system– to control who can access a file (user)– to control how they can access it (e.g., read, write, or exec)
• More generally:– generalize files to objects (the “what”)– generalize users to principals (the “who”, user or program)– generalize read/write to actions (the “how”, or operations)
• A protection system dictates whether a given action performed by a given principal on a given object should be allowed– e.g., you can read or write your files, but others cannot– e.g., your can read /etc/motd but you cannot write to it
04/21/23 11
Model for representing protection
• Two different ways of thinking about it:– access control lists (ACLs)
• for each object, keep list of principals and principals’ allowed actions
– capabilities• for each principal, keep list of objects and principal’s allowed
actions
• Both can be represented with the following matrix:
/etc/passwd /home/gribble /home/guest
root rw rw rw
gribble r rw r
guest r
principals
objects
ACL
capability
04/21/23 12
ACLs vs. Capabilities
• Capabilities are easy to transfer– they are like keys: can hand them off– they make sharing easy
• ACLs are easier to manage– object-centric, easy to grant and revoke
• to revoke capability, need to keep track of principals that have it• hard to do, given that principals can hand off capabilities
• ACLs grow large when object is heavily shared– can simplify by using “groups”
• put users in groups, put groups in ACLs• you are all in the “VMware powerusers” group on Win2K
– additional benefit• change group membership, affects ALL objects that have this
group in its ACL
04/21/23 13
The original Unix file system
• Dennis Ritchie and Ken Thompson, Bell Labs, 1969• “UNIX rose from the ashes of a multi-organizational
effort in the early 1960s to develop a dependable timesharing operating system” – Multics
• Designed for a “workgroup” sharing a single system• Did its job exceedingly well
– Although it has been stretched in many directions and made ugly in the process
• A wonderful study in engineering tradeoffs
04/21/23 14
All disks are divided into five parts …
• Boot block– can boot the system by loading from this block
• Superblock– specifies boundaries of next 3 areas, and contains head of
freelists of inodes and file blocks
• i-node area– contains descriptors (i-nodes) for each file on the disk; all i-
nodes are the same size; head of freelist is in the superblock
• File contents area– fixed-size blocks; head of freelist is in the superblock
• Swap area– holds processes that have been swapped out of memory
04/21/23 15
So …
• You can attach a disk to a dead system …• Boot it up …• Find, create, and modify files …
– because the superblock is at a fixed place, and it tells you where the i-node area and file contents area are
– by convention, the second i-node is the root directory of the volume
04/21/23 16
i-node format
• User number• Group number• Protection bits• Times (file last read, file last written, inode last written)• File code: specifies if the i-node represents a directory,
an ordinary user file, or a “special file” (typically an I/O device)
• Size: length of file in bytes• Block list: locates contents of file (in the file contents
area)– more on this soon!
• Link count: number of directories referencing this i-node
04/21/23 17
The flat (i-node) file system
• Each file is known by a number, which is the number of the i-node– seriously – 1, 2, 3, etc.!– why is it called “flat”?
• Files are created empty, and grow when extended through writes
04/21/23 18
The tree (directory, hierarchical) file system
• A directory is a flat file of fixed-size entries• Each entry consists of an i-node number and a file
name i-node number File name
152 .
18 ..
216 my_file
4 another_file
93 oh_my_gosh
144 a_directory
• It’s as simple as that!
04/21/23 19
The “block list” portion of the i-node (Unix Version 7)
• Points to blocks in the file contents area• Must be able to represent very small and very large files.
How?• Each inode contains 13 block pointers
– first 10 are “direct pointers” (pointers to 512B blocks of file data)– then, single, double, and triple indirect pointers
0
1
10
11
12
…
…
…
…
…
… …
04/21/23 20
So …
• Only occupies 13 x 4B in the i-node• Can get to 10 x 512B = a 5120B file directly
– (10 direct pointers, blocks in the file contents area are 512B)
• Can get to 128 x 512B = an additional 65KB with a single indirect reference– (the 11th pointer in the i-node gets you to a 512B block in the
file contents area that contains 128 4B pointers to blocks holding file data)
• Can get to 128 x 128 x 512B = an additional 8MB with a double indirect reference– (the 12th pointer in the i-node gets you to a 512B block in the
file contents area that contains 128 4B pointers to 512B blocks in the file contents area that contain 128 4B pointers to 512B blocks holding file data)
04/21/23 21
• Can get to 128 x 128 x 128 x 512B = an additional 1GB with a triple indirect reference– (the 13th pointer in the i-node gets you to a 512B block in the
file contents area that contains 128 4B pointers to 512B blocks in the file contents area that contain 128 4B pointers to 512B blocks in the file contents area that contain 128 4B pointers to 512B blocks holding file data)
• Maximum file size is 1GB + a smidge
04/21/23 22
• A later version of Bell Labs Unix utilized 12 direct pointers rather than 10– Why?
• Berkeley Unix went to 1KB block sizes– What’s the effect on the maximum file size?
• 256x256x256x1K = 17 GB + a smidge
– What’s the price?
• Suppose you went to 4KB blocks?– 1Kx1Kx1Kx4K = 4TB + a smidge
• Linux?• How many i-nodes?
04/21/23 23
File system consistency
• Both i-nodes and file blocks are cached in memory• The “sync” command forces memory-resident disk
information to be written to disk– system does a sync every few seconds
• A crash or power failure between sync’s can leave an inconsistent disk
• You could reduce the frequency of problems by reducing caching, but performance would suffer big-time
04/21/23 24
i-check: consistency of the flat file system
• Is each block on exactly one list?– create a bit vector with as many entries as there are blocks– follow the free list and each i-node block list– when a block is encountered, examine its bit
• If the bit was 0, set it to 1
• if the bit was already 1– if the block is both in a file and on the free list, remove it from the
free list and cross your fingers– if the block is in two files, call support!
– if there are any 0’s left at the end, put those blocks on the free list
04/21/23 25
d-check: consistency of the directory file system
• Do the directories form a tree?• Does the link count of each file equal the number of
directories links to it?– I will spare you the details
• uses a zero-initialized vector of counters, one per i-node
• walk the tree, then visit every i-node
04/21/23 26
File Allocation Table (FAT)
• In March 1983, IBM launched the PC XT computer, which featured a “massive” 10 MB hard disk. – DOS 2.0 supports hierarchical directories.– FAT12
• In 1984 IBM released the PC AT, which featured a 20 MB hard disk.– DOS 3.0– FAT16
• Windows 95: FAT32– Supports > 4GB partitions.
04/21/23 27
All partitions are divided into …
• Boot Sector• Reserved Sectors• FAT #1• FAT #2• Root Directory (FAT16)• Data Region (To the end of partition)
04/21/23 28
Sector & Cluster
• Sector size: 512B• Clusters: equal size block in the data region, > 2KB
– FAT16• Max 216 clusters (64K)
• Max cluster size 32KB
• Max partition size 2GB
– FAT32• Max 228 clusters (256M)
• Max cluster size 32KB
• Max partition size 8TB
04/21/23 29
FAT Entries
• Each file occupies one or more clusters.– Chain of clusters: singly linked list
• List of entries map to each cluster.– FAT16: up to 216
– FAT32: up to 228
• Each entry records one of five things:– Cluster # of next one in the chain.– End of cluster chain (EOC)– Free/unused (0)– Bad cluster– Reserved
• No free cluster list !!!• chkdsk
04/21/23 30
Directory
• Special file with table of file/subdirectory entries.
• Directory entries:– 32 bytes– Starting cluster #– Max file size 4GB– Undelete ???
04/21/23 32
Protection
• Objects: individual files• Principals: owner/group/world• Actions: read/write/execute
• This is pretty simple and rigid, but it has proven to be about what we can handle!
04/21/23 33
Review: Model for representing protection
• Two different ways of thinking about it:– access control lists (ACLs)
• for each object, keep list of principals and principals’ allowed actions
– capabilities• for each principal, keep list of objects and principal’s allowed
actions
• Both can be represented with the following matrix:
/etc/passwd /home/gribble /home/guest
root rw rw rw
gribble r rw r
guest r
principals
objects
ACL
capability
04/21/23 34
Review: ACLs vs. Capabilities
• Capabilities are easy to transfer– they are like keys: can hand them off– they make sharing easy
• ACLs are easier to manage– object-centric, easy to grant and revoke
• to revoke capability, need to keep track of principals that have it• hard to do, given that principals can hand off capabilities
• ACLs grow large when object is heavily shared– can simplify by using “groups”
• put users in groups, put groups in ACLs• you are all in the “VMware powerusers” group on Win2K
– additional benefit• change group membership, affects ALL objects that have this
group in its ACL
04/21/23 39
Advanced file system implementations
• We’ve looked at disks• We’ve looked at file systems generically• We’ve looked in detail at the implementation of the
original Bell Labs UNIX file system– a great simple yet practical design– exemplifies engineering tradeoffs that are pervasive in
system design
• Now we’ll look at a more advanced file system– Berkeley Software Distribution (BSD) UNIX Fast File
System (FFS)• enhanced performance for the UNIX file system
• at the heart of most UNIX file systems today
04/21/23 40
BSD UNIX FFS
• Original (1970) UNIX file system was elegant but slow– poor disk throughput
• far too many seeks, on average
• Berkeley UNIX project did a redesign in the mid ’80’s– McKusick, Joy, Fabry, and Leffler– improved disk throughput, decreased average request
response time– principal idea is that FFS is aware of disk structure
• it places related things on nearby cylinders to reduce seeks
04/21/23 41
Recall the UNIX disk layout
• Boot block– can boot the system by loading from this block
• Superblock– specifies boundaries of next 3 areas, and contains head of
freelists of inodes and file blocks
• i-node area– contains descriptors (i-nodes) for each file on the disk; all i-
nodes are the same size; head of freelist is in the superblock
• File contents area– fixed-size blocks; head of freelist is in the superblock
• Swap area– holds processes that have been swapped out of memory
04/21/23 42
0
1
10
11
12
…
…
…
…
…
… …
Recall the UNIX block list / file content structure
• directory entries point to i-nodes – file headers• each i-node contains a bunch of stuff including 13
block pointers– first 10 point to file blocks (i.e., 512B blocks of file data)– then single, double, and triple indirect indexes
04/21/23 43
UNIX FS data and i-node placement
• Original UNIX FS had two major performance problems:– data blocks are allocated randomly in aging file systems
• blocks for the same file allocated sequentially when FS is new
• as FS “ages” and fills, need to allocate blocks freed up when other files are deleted
– deleted files are essentially randomly placed– so, blocks for new files become scattered across the disk!
– i-nodes are allocated far from blocks• all i-nodes at beginning of disk, far from data
• traversing file name paths, manipulating files, directories requires going back and forth from i-nodes to data blocks
• BOTH of these generate many long seeks!
04/21/23 44
FFS: Cylinder groups
• FFS addressed these problems using the notion of a cylinder group– disk is partitioned into groups of cylinders– data blocks from a file are all placed in the same cylinder group– files in same directory are placed in the same cylinder group– i-node for file placed in same cylinder group as file’s data
• Introduces a free space requirement– to be able to allocate according to cylinder group, the disk must
have free space scattered across all cylinders– in FFS, 10% of the disk is reserved just for this purpose!
• good insight: keep disk partially free at all times!
• this is why it may be possible for df to report >100% full!
04/21/23 45
FFS: Increased block size, fragments
• The original UNIX FS had 512B blocks– even more seeking– small maximum file size ( ~1GB maximum file size)
• Then a version had 1KB blocks– still pretty puny
• FFS uses a 4KB blocksize– allows for very large files (4TB)– but, introduces internal fragmentation
• on average, each file wastes 2K!– why?
• worse, the average file size is only about 1K!– why?
– fix: introduce “fragments”• 1KB pieces of a block
04/21/23 46
FFS: Awareness of hardware characteristics
• Original UNIX FS was unaware of disk parameters• FFS parameterizes the FS according to disk and
CPU characteristics– e.g., account for CPU interrupt and processing time, plus
disk characteristics, in deciding where to lay out sequential blocks of a file, to reduce rotational latency
04/21/23 47
FFS: Performance
• This was a long time ago – look at the relative performance, not the absolute performance!
(CPU maxed doing block allocation!)
(block size / fragment size)
(983KB/s is theoretical
disk throughput)
04/21/23 48
FFS: Faster, but less elegant(warts make it faster but ugly)
• Multiple cylinder groups– effectively, treat a single big disk as multiple small disks– additional free space requirement (this is cheap, though)
• Bigger blocks– but fragments, to avoid excessive fragmentation
• Aware of hardware characteristics– ugh!
04/21/23 50
In our most recent exciting episodes …
• Original Bell Labs UNIX file system– a simple yet practical design– exemplifies engineering tradeoffs that are pervasive in
system design– elegant but slow
• and performance gets worse as disks get larger
• BSD UNIX Fast File System (FFS)– solves the throughput problem
• larger blocks
• cylinder groups
• awareness of disk performance details
04/21/23 51
Review: File system consistency
• Both i-nodes and file blocks are cached in memory• The “sync” command forces memory-resident disk
information to be written to disk– system does a sync every few seconds
• A crash or power failure between sync’s can leave an inconsistent disk
• You could reduce the frequency of problems by reducing caching, but performance would suffer big-time
04/21/23 52
i-check: consistency of the flat file system
• Is each block on exactly one list?– create a bit vector with as many entries as there are blocks– follow the free list and each i-node block list– when a block is encountered, examine its bit
• If the bit was 0, set it to 1
• if the bit was already 1– if the block is both in a file and on the free list, remove it from the
free list and cross your fingers– if the block is in two files, call support!
– if there are any 0’s left at the end, put those blocks on the free list
04/21/23 53
d-check: consistency of the directory file system
• Do the directories form a tree?• Does the link count of each file equal the number of
directories links to it?– I will spare you the details
• uses a zero-initialized vector of counters, one per i-node
• walk the tree, then visit every i-node
04/21/23 54
Both are real dogs when a crash occurs
• Buffering is necessary for performance• Suppose a crash occurs during a file creation:
1. Allocate a free inode
2. Point directory entry at the new inode
• In general, after a crash the disk data structures may be in an inconsistent state– metadata updated but data not– data updated but metadata not– either or both partially updated
• fsck (i-check, d-check) are very slow– must touch every block– worse as disks get larger!
04/21/23 55
Journaling file systems
• Became popular ~2002• There are several options that differ in their details
– Ext3, ReiserFS, XFS, JFS, NTFS
• Basic idea– update metadata, or all data, transactionally
• “all or nothing”
– if a crash occurs, you may lose a bit of work, but the disk will be in a consistent state
• more precisely, you will be able to quickly get it to a consistent state by using the transaction log/journal – rather than scanning every disk block and checking sanity conditions
04/21/23 56
Where is the Data?
• In the file systems we have seen already, the data is in two places:– On disk– In in-memory caches
• The caches are crucial to performance, but also the source of the potential “corruption on crash” problem
• The basic idea of the solution:– Always leave “home copy” of data in a consistent state– Make updates persistent by writing them to a sequential
(chronological) journal partition/file– At your leisure, push the updates (in order) to the home
copies and reclaim the journal space
04/21/23 57
Redo log
• Log: an append-only file containing log records– <start t>
• transaction t has begun
– <t,x,v>• transaction t has updated block x and its new value is v
– Can log block “diffs” instead of full blocks
– <commit t>• transaction t has committed – updates will survive a crash
• Committing involves writing the redo records – the home data needn’t be updated at this time
04/21/23 58
If a crash occurs
• Recover the log• Redo committed transactions
– Walk the log in order and re-execute updates from all committed transactions
– Aside: note that update (write) is idempotent: can be done any positive number of times with the same result.
• Uncommitted transactions– Ignore them. It’s as though the crash occurred a tiny bit
earlier…
04/21/23 59
Managing the Log Space
• A cleaner thread walks the log in order, updating the home locations of updates in each transaction– Note that idempotence is important here – may crash while
cleaning is going on
• Once a transaction has been reflected to the home blocks, it can be deleted from the log
04/21/23 60
Impact on performance
• The log is a big contiguous write– very efficient
• And you do fewer synchronous writes– very costly in terms of performance
• So journaling file systems can actually improve performance (immensely)
• As well as making recovery very efficient
04/21/23 61
Want to know more?
• This is a direct ripoff of database system techniques– But it is not what Microsoft Windows Vista was supposed to
be before they backed off – “the file system is a database”– Nor is it a “log-structured file system” – that’s a file system in
which there is nothing but a log (“the log is the file system”)
• “New-Value Logging in the Echo Replicated File System”, Andy Hisgen, Andrew Birrell, Charles Jerian, Timothy Mann, Garret Swart – http://citeseer.ist.psu.edu/hisgen93newvalue.html
NTFS
• http://www.ntfs.com• http://technet.microsoft.com/en-us/library/cc758691(v
=ws.10).aspx• Master File Table• Multiple Data Streams
04/21/23 62
S.M.A.R.T.
• Self-Monitoring Analysis and Reporting Technology• http://en.wikipedia.org/wiki/S.M.A.R.T
• HDD Duty Cycle
04/21/23 63