Optimizing Ext4 for Low Optimizing Ext4 for Low Memory Environments Memory Environments Theodore Ts'o November 7, 2012
Optimizing Ext4 for Low Optimizing Ext4 for Low Memory EnvironmentsMemory Environments
Theodore Ts'o
November 7, 2012
Agenda
Status of Ext4
Why do we care about Low Memory Environments: Cloud Computing
Optimizing Ext4 for Low Memory Environments
Conclusion
Ext4 Status
Now stable in the most common configurations
Some distributions are planning on replacing ext[23] with ext4
New features recently added to ext4
Punch system call
Metadata checksumming
Online resizing for > 16TB file systems
Advantages of ext4
“Modern” file system that is still reasonably simple
Lines of Code as a Proxy for Complexity (as of 3.6.5)
Minix: 2441
Ext2: 9703
Ext3: 19,304
Ext4: 41,249
Btrfs: 88,189
XFS: 94,591
Advantages of ext4
“Modern” file system that is still reasonably simple
Portions of the code base are (relatively) stable and are time-tested
Userspace utilities
Journal Block layer (also used by OCFS2)
Advantages of ext4
“Modern” file system that is still reasonably simple
Portions of the code base are (relatively) stable and are time-tested
Incremental development instead of “rip and replace”
Well understood performance characteristics
Disadvantages of ext4
Incremental development means that certain design decisions are very hard to change:
Fixed inode table
Bitmap based allocations
32-bit inode numbers
Currently RAID support is extremely weak
Lack of sexy new features
Compression
Filesystem-level snapshots (use thin provisioned snapshots instead)
FS-aware RAID and LVM
Common Ext4 Use Cases
Default File System for Desktop / Servers
Distributions may change this choice in the future
Android devices (Honeycomb / Ice Cream Sandwich)
Cloud storage servers
Rise of Cloud Computing
Or Grid Computing, Utility Computing, etc.
Challenges
Usability – How to deliver something useful to the user?SAAS
PAAS
Custom programming for cloud/grid/utility compluting
Security – Public vs. Private Clouds?
Economics – Is it really cheaper at the end of the day?
Rise of Cloud Computing
Or Grid Computing, Utility Computing, etc.
The economics of cloud computing
Really big, efficient data centers
More efficient use of servers
Traditional servers often don't use their resources efficientlyCPU
Disk
Networking Bandwidth
To make the cloud economics work important to pack a lot of jobs onto a smaller number of servers
Virtualization
Containers
Using resources efficiently in file systems
Restricted memory means less caching available
Data Blocks
Metadata Blocks
Block allocation bitmaps are the big problem
When they get pushed out of memory, long unlink() and fallocate() times
Surpringly, CPU can be a problem too
Especially for PCIe attached flash (large IOP/s)
Plenty of other uses for the CPU (transcoding video formats)
Also important for large-scale macro benchmarks (TPC-C)
Restricted Memory is a problem for Copy-on-Write file systems, too
Suggestion from the ZFS Open Solaris list:
“If you are using a laptop and not serving anything and performance is not a major concern and you're free to reboot whenever you want, then you can survive on 2G of ram. But a server presumably DOES stuff and you don't want to reboot frequently. I'd recommend 4G minimally, 8G standard, and if you run any applications (databases, web servers, symantec products) then add more.”
http://permalink.gmane.org/gmane.os.solaris.opensolaris.zfs/44928
A short aside about latency
Avoiding latency makes the users happy
“Fast is better than slow. We know your time is valuable, so when you’re seeking an answer on the web you want it right away–and we aim to please. We may be the only people in the world who can say our goal is to have people leave our homepage as quickly as possible.... And we continue to work on making it all go even faster.”
From “Ten things we know to be true”
A short aside about latency
Avoiding latency makes the users happy
A few slow requests slow the requests behind them
...
A short aside about latency
Avoiding latency makes the users happy
A few slow requests slow the requests behind them
A few slow operations effectively slows down its peers in a distributed computation
Optimizing ext4 for low-memory environments
No Journal Mode
Smarter metadata caching
No Journal Mode for Ext4
General principle: Don't pay for features you don't need
A review of cluster storage at Google
The hardware
Thousands of machines in a data center
Tens of thousands of disks
GFS as a clustered file system
Replication at the clustered file system level(So we can survive loss of machines)
Checksumming done by the clustered file system(The end to end principle)
No Journal Mode for Ext4
General principle: Don't pay for features you don't need
A review of cluster storage at Google
Journaling is not free
Journalling is not free
ext4 ext4 nojournal2000.00
2050.00
2100.00
2150.00
2200.00
2250.00
2300.00
2350.00
FFSB Large File Creates
2 CPU's using Direct I/O
Tra
nsa
ctio
ns
per
sec
ond
No Journal Mode for Ext4
General principle: Don't pay for features you don't need
A review of cluster storage at Google
Journaling is not free
No journal mode one of the first Google changes to ext4
Wanted the improvements of extents, delayed allocation, etc.
Google had chosen not to use ext3 since journalling had significant costs
Ext4 in no journal mode is the best of both worlds
Improving metadata caching
Small inodes
Ext2 only supported 128 byte inodes
Ext3/ext4 supports larger inodes
256 byte defaultUsed to store extended attributes
Also used to store subsecond timestamps for ext4
Small inodes means more inodes per block --- makes a huge difference in memory limited environments
Effects of 128 byte inodes
ext4 ext4-128I ext4 nojournal ext4 128I NJ1900.00
2000.00
2100.00
2200.00
2300.00
2400.00
2500.00
FFSB Large File Creates
2 CPU's using Direct I/O
Tra
nsa
ctio
ns
per
sec
ond
Improving metadata caching
Small inodes
Free block statistics for each block group
Ext4 now caches the size of the largest available free block
This allows a block group to be evaluated without needing to needing to consult the block bitmap
Improving metadata caching
Small inodes
Free block statistics for each block group
Inode extent information
Ext4's on-disk format uses 12 bytes/extent
4 in inode
340 in a 4k extent tree leaf block
Maximum 128M in an extent
Improving metadata caching
Small inodes
Free block statistics for each block group
Inode extent information
Internal bigextent patch in Google
An in-memory b-tree which collapses adjacent extents
Originally because cache line misses was measurable while searching the on-disk representation on PCIe attached flash
Takes less memory than a 4k extent block in most cases
Will be going upstream soon
Conclusion
General Purpose File System Myth
General Purpose File System Myth?
“There can only be one!”
General Purpose File System Myth?
“There can only be one!”
Too hard for users to choose
File systems used to be used for many things at the same time
But.... workloads are different
Design tradeoffs; optimizing for one workload can compromise another
How did this myth survive for so long?
Many workloads did not stress the file system
File systems were simpler – fewer features
Servers were more inefficiently run – more idle resources
Conclusion
General Purpose File System Myth
Future ext4 work
Extent Status Tree
(provides SEEK_HOLE/SEEK_DATA support)
Inline data
RAID stripe awareness
Can also be used to make ext4 erase block aware for eMMC devices with primitive flash translation layers
Atomic msync()
Terence Kelly and Stan Park at HP
Conclusion
General Purpose File System Myth
Future ext4 work
Remember to optimize the entire storage stack
Functionality at the block device layer
Thin-provisioned snapshots
dm-cache / bcache
Optimizing userspace
The sqllite library
Applications
Improving abstractions up and down the storage stack
Thank You!