Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 1
ZFS:The Zettabyte Filesystem
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 2
The Perfect FilesystemWrite my data
Keep it safe
Read it back
Do it fast
Dont hassle me
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 3
Existing FilesystemsWrite my data?
limited size (16TB for UFS)limited number of files
limited directory entries
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 4
Existing FilesystemsKeep it safe?
bit rot causes silent data corruption
no defense against phantom writes,misdirections, other firmware bugs
no defense against administrative errors(e.g. swap on active filesystem device)no security: spying, tampering, theft
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 5
Existing FilesystemsRead it back?
no data integrity checksno data authenticationdata might be good, might be bad
dont knowcouldnt fix it if we did
like running a server without DRAM parity
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 6
Existing FilesystemsDo it fast?
linear-time directory opslinear-time newfs(1M), fsck(1M)limited read/write concurrencyfixed block sizefixed stripe widthpoor random write performanceslow mirroring
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 7
Existing FilesystemsDont hassle me?
create a partition for every FSgrow: manual processshrink: not possibleremember a bunch of c0t0d0s0 namesedit /etc/vfstab by handwait around for fsck(1M)take system down to upgrade disks
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 8
ZFS Objective
End the suffering
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 9
The ZFS FilesystemWrite my data!
immense capacity (128-bit)theres no SI prefix for this!
zettabyte = 70-bit (a billion TB)ZFS capacity: 256 quadrillion ZB
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 10
The ZFS FilesystemKeep it safe!
self-healing datacopes with every class of error
bit rotphantom writesmisdirected reads and writesadministrative errors
disk scrubbingreal-time remote replicationencryption
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 11
The ZFS FilesystemRead it back!
provable data integrity model
detects and corrects errors
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 12
The ZFS FilesystemDo it fast!
write sequentializationdynamic stripingmultiple block sizesconstant-time snapshotsconcurrent, constant-time directory opsbyte-range locking for concurrent writessync semantics at async speed(critical for good NFS performance)
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 13
The ZFS FilesystemDont hassle me!
FS creation is as easy as mkdirgrow and shrink are automaticno raw device names to rememberno volumes at allno more fsck(1M)no more editing /etc/vfstaball administration online
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 14
Organizing PrinciplesSimple administration
Extensible, modular design
Provable data integrity model
Always-available data
High performance
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 15
Simple AdministrationPooled storage
Immense capacity
Quotas and Reservations
User undo
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 16
Volumes vs. Storage PoolsTraditional volumes
partition per FSFS/volume interface:block-level I/O
Pooled StorageFSes share spaceZFS/pool interface:object transactions
FS
Volume(Virtual Disk)
FS
Volume(Virtual Disk)
FS
Volume(Virtual Disk)
No space sharing
Namingand
storagetightlybound
ZFS ZFS ZFS
Storage PoolNaming
andstorage
decoupled
All space shared
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 17
Volumes vs. Storage Pools, contdBoth manage disks and provide mirroring
Traditional FS/volume model: volume providesspace, but FS manages it
volume doesnt know which blocks are in useFS cant easily grow or shrinkFS creation requires new partition
ZFS model: SPA provides and manages spacemany filesystems share spacegrow and shrink are implicitFS create/delete are just like mkdir/rmdironly one pool to manage (vs. volume per FS)
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 18
Volumes vs. Storage Pools, contdAdvantages of pooled storage
reduces fragmentationsimplifies administrationdecouples logical and physical structurefilesystems named by default mount point
Proof of concept: tmpfsall tmpfs mounts share common swap spaceadministration is trivial: swap -a / swap -d
FS becomes more powerful administrative pointno longer tied to physical configurationmore like a directory with heritable attributes
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 19
Immense Capacity128-bit storage pools
128-bit filesystems
128-bit files, but limited to 64-bit accessuntil we have 128-bit OS support
64-bit max files per dataset
64-bit max files per directory
statvfs128() will be needed
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 20
Quotas and ReservationsTraditional model
quotas: per-user UFS bolt-on(cred structures all the way down to bmap)reservations: no (nothing to reserve against)
ZFS modelFS is now the administrative pointFS per home directory, project, workspace, ...quotas: per-FSreservations: per-FSgroup quotas, hierarchical quotas almost free
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 21
User UndoUnlimited snapshots
recover previous version of a file
Undeleterecover recently deleted file
No sysadmin intervention required
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 22
Organizing PrinciplesSimple administration
Extensible, modular design
Provable data integrity model
Always-available data
High performance
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 23
ZFS is Object-BasedAn object is a "flat file"Everything is stored in objects: user data,znodes, directories, free block lists, etc.
Arbitrarily complex operations reduce to readsand writes on a set of objectsSimplifies interfaces, design, and analysis
single I/O pathsingle interposition pointsingle object read/write model
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 24
ZFS Components
ZPLZFS POSIX Layer: standard POSIXsemantics (permission, mode, timestamps);translates vnode ops into object read/write
ZAPZFS Attribute Processor: constant-time,concurrent attribute operations(directories, object properties, etc)
DMU Data Management Unit: transactions,caching, object translations
SPAStorage Pool Allocator: space allocation,replication, checksums, resource controls,encryption, compression, fault management
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 25
SPA ComponentsGather non-dependentI/O into I/O groupsAllocate space frommetaslab layerApply pluggable modules
compressionencryptionchecksum
Dispatch parallel, asyncI/O to vdev stackIssue disk I/O
SPADMU
IOGmetaslaballocator
compression
encryption
checksum
mirrorvdev
diskvdev
diskvdev
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 26
Organizing PrinciplesSimple administration
Extensible, modular design
Provable data integrity model
Always-available data
High performance
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 27
Provable Data Integrity ModelAll operations are copy-on-write
never overwrite live data
All operations are transactionalrelated changes succeed or fail as a whole
All data is checksummedno silent data corruption
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 28
Copy-on-Write TX ModelProblem: modify several objects atomicallyDMU provides transactional interface
ZPL groups work into transactionsDMU sends whole transactions to SPASPA commits transaction groups
SPA never modifies active blocksentire storage pool is a tree of blocksrooted at the "uberblock"transactions COW nodes of the treetransaction group is committed whenuberblock is rewritten to point to new tree
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 29
Copy-on-Write TX Modelinitial block tree
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 30
Copy-on-Write TX Modelwrite: COWs a data block
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 31
Copy-on-Write TX ModelCOW its level-1 indirect block
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 32
Copy-on-Write TX ModelCOW its level-2 indirect block
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 33
Copy-on-Write TX Modelrewrite the uberblock (atomic)
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 34
SnapshotsCOW TX model enables constant-time snapshots
snapshot storage pool by copying its uberblocksnapshot single FS by copying its root blocksnapshot single file by copying its dnode
Provides data recovery and fixed target for backup
Snapshot delta = incremental
Unlimited number of snapshotsc.f. 1 with UFS, 32 with WAFL
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 35
SnapshotsSave old uberblock - describes complete snapshot
snapshotuberblock current
uberblock
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 36
ChecksumsTraditional model: checksum stored with block
Fine for detecting bit rot, but:cant detect phantom writes, misdirectionscant validate the checksum itselfcant protect against tampering
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 37
Checksums, contdSPA model: checksum stored with indirect block
Self-validatingDetects bit rot, phantom writes, misdirections,admin error (e.g. swap on active ZFS disk)
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 38
Checksums, contdPhysical separation improves fault isolation,yet doesnt require additional I/O
64-bit strength ensures data integrityprovides 99.99999999999999999%("nineteen nines") error detection probability
Checksum vectoring provides flexibilityweaker checksums for performancefaster checksums in the futuresecure checksums for data authentication(uberblock checksum provides unforgeablesignature for the entire storage pool)
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 39
Organizing PrinciplesSimple administration
Extensible, modular design
Provable data integrity model
Always-available data
High performance
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 40
Always-Available DataAlways-consistent on-disk formatElimination of fsck(1M)Self-Healing DataFailure Prediction and Disk ScrubbingHot SpaceData MigrationReal-Time Remote ReplicationUser Undo
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 41
Always-Consistent On-Disk FormatZFS is always self-consistent
follows from COW transaction model
Doesnt depend on the intent log
No more fsck(1M)no "clean bit"no off-line maintenanceZFS is always mountable
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 42
Self-Healing DataMedia error under traditional FS:
bad user data causes silent data corruptionbad metadata causes SDC, panic, or both
Media error under ZFS:checksum detects data corruptionSPA gets valid data from another replicaand uses it to repair the damaged oneSPA returns valid data to applicationno sysadmin intervention required
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 43
Failure PredictionSPA automatically migrates data from failingdevices to healthy devices
Detects health by monitoring error rate
Employs disk scrubbing to detect latent errorswhile theyre still correctable
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 44
Hot SpaceHot spare model "Hot space" model
No more dedicated hot spares"hot space" spread across all devices
Keeps all devices activeuses all available I/O bandwidthimproves drive utilizationimproves failure predictionprevents silent atrophy
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 45
Data MigrationAllow transparent disk upgrades anddata migration from failing devices
Apply VM principles to storageDMU names blocks by 128-bit DVA(Data Virtual Address)high-order 64 bits specify metaslabSPA translates metaslab to SPA can migrate metaslabs from one vdevto another without affecting any DMU state
Data remains available during migration
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 46
Real-Time Remote ReplicationEverything in ZFS is an objectEvery change is just a write to an objectWrites are always batched into TX groups
Contents of TX group can be sent async
Latency insensitive!
Occasional ACK for remote TX group commit
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 47
Organizing PrinciplesSimple administration
Extensible, modular design
Provable data integrity model
Always-available data
High performance
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 48
High PerformanceWrite SequentializationDynamic stripingParallel three-phase TX groupsIntelligent prefetchMultiple block sizesSync semantics at async speedConcurrent, constant-time directory opsPOSIX-compliant concurrent writesHot space
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 49
Write SequentializationTraditional FS: random file writes becomerandom disk writes
ZFS: random file writes becomesequential disk writes
follows from COW modelmodified blocks are newly allocatedSPA has complete allocation freedomSPA chooses sequential free blocks
Cost of writing extra ZFS metadata more thanoffset by improved locality
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 50
Dynamic StripingTraditional striping: spread data across multipledevices at fixed stride
Inflexible: cant change stripe width,cant add or remove devices
Dynamic striping: round-robin allocationbalances writes across all available devicesenabled by COW model
0 1 2 3 45 6 7 8 9
10 11 12 13 14... ... ... ... ...
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 51
Three-Phase Transaction GroupsOpen: accepting new transactionsQuiescing: waiting for transactions to finishSyncing: pushing changes to disk
Up to three transaction groups activeone in each state - prevents burstinessuses all available disk bandwidth
Open
Quiescing
Syncing
Closed
Time
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 52
Multiple Block SizesNo block size is optimal for everything
large blocks: less metadatasmall blocks: more efficient for small objectsrecord-structured files have natural granularity;we want to match it to avoid read/modify/write
ZFS supports any power of two block size
Per-object granularityautomatic block size selection by defaultmanual override
Enables transparent block-based compression
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 53
Multiple Block Sizes, contdWhy not extents?
extents dont COW: writes force extent breaksgreater code complexity
Multiple block sizes combine the simplicityof blocks with the metadata savings of extents
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 54
Sync Semantics at Async SpeedReview: ZFS is always self-consistent on diskHowever: after system crash, ZFS wont containtransactions since last syncUse intent log to recover recent transactions
log metadata only: lose recent writes (UFS)log user + metadata: recover everything (NFS)log to disk: wait for one sequential disk writelog to NVRAM on I/O bus: fast (NetApp filers)log to NVRAM on main memory bus: blazing
Ideal configuration: log all ops to NVRAMneed HW/sales/marketing on boardbig payoff: only a system vendor can do this
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 55
Fast Directory OperationsLarge directories: need constant-time operations(lookup, create, delete)Hot directories: need concurrent operations
Solution: extendible hashingblock-basedamortized growth costshort chains for constant-time opsper-block locking for high concurrencyreaddir: returns entries in hash-value order
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 56
Concurrent WritesExisting filesystems force trade-off betweenPOSIX compliance and write concurrency
ZFS employs byte-range locking to allow maximumconcurrency while satisfying POSIX overlappingwrite semantics
Parallel read/write Serialized
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 57
CompressionBlock-level compression in SPA
transparent to all other layersenabled by multiple block size support
Per-file, per-filesystem, or per-poolVectoring for different compression functions
8k 4k 2k 8k
DMU translations: all 8k
SPA blockallocations:
vary withcompression
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 58
Encryption and Data SecurityBlock-level encryption in SPA
transparent to all other layerssupports any symmetric block cipher mode:DES, AES, IDEA, RC6, Blowfish, SEAL, OCB...
Per-filesystem or per-poolVectoring for different encryption functionsData authentication via secure checksumsOpen issues:
key managementlarger data security model
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 59
FuturesPOSIX isnt the only game in town
DMU as native Oracle APIObject-based appliances
agnostic: NFS, database, volume emulationDMU as "foundation classes"
SPA
DMU
ZPL NFS Oraclezvol
UFS *FS raw
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 60
Case Study: Jurassic on UFS/SVMUpgrading disks
major down timesignificant manual labor
FS-to-user mappingsingle FS impossible: exceeds 1TBFS per user impractical: fragments storage
/var/mailcreate/delete .lock files: serial and slow
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 61
Case Study: Jurassic on UFS/SVMQuotas and reservations
quotas: too expensive and broken to usereservations: no such concept
User error recoveryrestore from tapelast 24 hours lost
Reliability / Availabilityseveral instances of data loss this yearhours of down time for fsck(1M)
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 62
Case Study: Jurassic on ZFSUpgrading disks
add new disks to storage poolremove old disks from storage pool(SPA auto-migrates the data)
FS-to-user mappingsingle FS possibleFS per user better: enables per-userreservations, snapshots, encryption, etc.
/var/mailcreate/delete .lock files: parallel and fast
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 63
Case Study: Jurassic on ZFSQuotas and reservations
per-filesystem: e.g. per-workspace,per-home directory, per-project
User error recoveryuser undorestore from snapshoteither way, no sysadmin required
Reliability / Availabilityno fsck(1M); ZFS is always mountableprovable data integrity model
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 64
Where Are We Now?"Hello world" on Oct 31, 2002
complete POSIX-compliant filesystemmost key features working: pooled storage,crash resilience, self-healing data
Full builds of ON10 on ZFS filesystemszvol driver used for MTB-UFS test/bringupStill plenty to do
intent log, snapshots, perf workinternal alpha program
Phase 1 putback in October
Sun Microsystems Proprietary / Confidential Need to Know
ZFS: The Zettabyte Filesystem February 10, 2003 Page 65
ZFS:The Zettabyte Filesystem
Please send questions, comments and ideas to:[email protected]
Want to follow ZFS developments? Join:[email protected]
For the latest information, visit:http://zfs.eng