Top Banner
Beyond the File System Designing Large Scale File Storage and Serving Cal Henderson
101

Web20expo Filesystems

Apr 16, 2017

Download

Technology

royans
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Web20expo Filesystems

Beyond the File System

Designing Large Scale File Storage and Serving

Cal Henderson

Page 2: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 2

Hello!

Page 3: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 3

Big file systems?

• Too vague!• What is a file system?• What constitutes big?• Some requirements would be nice

Page 4: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 4

ScalableLooking at storage and serving infrastructures1

Page 5: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 5

ReliableLooking at redundancy, failure rates, on the fly changes2

Page 6: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 6

CheapLooking at upfront costs, TCO and lifetimes3

Page 7: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 7

Four buckets

Storage

Serving

BCP

Cost

Page 8: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 8

Storage

Page 9: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 9

The storage stack

File system

Block protocol

RAID

Hardware

ext, reiserFS, NTFS

SCSI, SATA, FC

Mirrors, Stripes

Disks and stuff

File protocol NFS, CIFS, SMB

Page 10: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 10

Hardware overview

The storage scale

Internal DAS SAN NAS

Lower Higher

Page 11: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 11

Internal storage

• A disk in a computer– SCSI, IDE, SATA

• 4 disks in 1U is common• 8 for half depth boxes

Page 12: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 12

DAS

Direct attached storage

Disk shelf, connected by SCSI/SATA

HP MSA30 – 14 disks in 3U

Page 13: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 13

SAN

• Storage Area Network• Dumb disk shelves• Clients connect via a ‘fabric’• Fibre Channel, iSCSI, Infiniband

– Low level protocols

Page 14: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 14

NAS

• Network Attached Storage• Intelligent disk shelf• Clients connect via a network• NFS, SMB, CIFS

– High level protocols

Page 15: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 15

Of course, it’s more confusing than that

Page 16: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 16

Meet the LUN

• Logical Unit Number• A slice of storage space• Originally for addressing a single drive:

– c1t2d3– Controller, Target, Disk (Slice)

• Now means a virtual partition/volume– LVM, Logical Volume Management

Page 17: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 17

NAS vs SAN

With a SAN, a single host (initiator) owns a single LUN/volume

With NAS, multiple hosts own a single LUN/volume

NAS head – NAS access to a SAN

Page 18: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 18

SAN Advantages

Virtualization within a SAN offers some nice features:

• Real-time LUN replication• Transparent backup• SAN booting for host replacement

Page 19: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 19

Some Practical Examples

• There are a lot of vendors• Configurations vary • Prices vary wildly

• Let’s look at a couple– Ones I happen to have experience with– Not an endorsement ;)

Page 20: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 20

NetApp Filers

Heads and shelves, up to 500TB in 6 Cabs

FC SAN with 1 or 2 NAS heads

Page 21: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 21

Isilon IQ

• 2U Nodes, 3-96 nodes/cluster, 6-600 TB

• FC/InfiniBand SAN with NAS head on each node

Page 22: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 22

Scaling

Vertical vs Horizontal

Page 23: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 23

Vertical scaling

• Get a bigger box

• Bigger disk(s) • More disks

• Limited by current tech – size of each disk and total number in appliance

Page 24: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 24

Horizontal scaling

• Buy more boxes

• Add more servers/appliances

• Scales forever*

*sort of

Page 25: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 25

Storage scaling approaches

• Four common models:

• Huge FS• Physical nodes• Virtual nodes• Chunked space

Page 26: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 26

Huge FS

• Create one giant volume with growing space– Sun’s ZFS– Isilon IQ

• Expandable on-the-fly?• Upper limits

– Always limited somewhere

Page 27: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 27

Huge FS

• Pluses– Simple from the application side– Logically simple– Low administrative overhead

• Minuses– All your eggs in one basket– Hard to expand– Has an upper limit

Page 28: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 28

Physical nodes

• Application handles distribution to multiple physical nodes– Disks, Boxes, Appliances, whatever

• One ‘volume’ per node• Each node acts by itself• Expandable on-the-fly – add more nodes• Scales forever

Page 29: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 29

Physical Nodes

• Pluses– Limitless expansion– Easy to expand– Unlikely to all fail at once

• Minuses– Many ‘mounts’ to manage– More administration

Page 30: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 30

Virtual nodes

• Application handles distribution to multiple virtual volumes, contained on multiple physical nodes

• Multiple volumes per node• Flexible• Expandable on-the-fly – add more nodes• Scales forever

Page 31: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 31

Virtual Nodes

• Pluses– Limitless expansion– Easy to expand– Unlikely to all fail at once– Addressing is logical, not physical– Flexible volume sizing, consolidation

• Minuses– Many ‘mounts’ to manage– More administration

Page 32: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 32

Chunked space

• Storage layer writes parts of files to different physical nodes

• A higher-level RAID striping

• High performance for large files– read multiple parts simultaneously

Page 33: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 33

Chunked space

• Pluses– High performance– Limitless size

• Minuses– Conceptually complex– Can be hard to expand on the fly– Can’t manually poke it

Page 34: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 34

Real Life

Case Studies

Page 35: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 35

GFS – Google File System

• Developed by … Google• Proprietary• Everything we know about it is based on

talks they’ve given• Designed to store huge files for fast

access

Page 36: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 36

GFS – Google File System

• Single ‘Master’ node holds metadata– SPF – Shadow master allows warm swap

• Grid of ‘chunkservers’– 64bit filenames– 64 MB file chunks

Page 37: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 37

GFS – Google File System

1(a) 2(a)

1(b)

Master

Page 38: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 38

GFS – Google File System

• Client reads metadata from master then file parts from multiple chunkservers

• Designed for big files (>100MB)

• Master server allocates access leases

• Replication is automatic and self repairing– Synchronously for atomicity

Page 39: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 39

GFS – Google File System

• Reading is fast (parallelizable)– But requires a lease

• Master server is required for all reads and writes

Page 40: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 40

MogileFS – OMG Files

• Developed by Danga / SixApart

• Open source

• Designed for scalable web app storage

Page 41: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 41

MogileFS – OMG Files

• Single metadata store (MySQL)– MySQL Cluster avoids SPF

• Multiple ‘tracker’ nodes locate files

• Multiple ‘storage’ nodes store files

Page 42: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 42

MogileFS – OMG Files

Tracker

Tracker

MySQL

Page 43: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 43

MogileFS – OMG Files

• Replication of file ‘classes’ happens transparently

• Storage nodes are not mirrored – replication is piecemeal

• Reading and writing go through trackers, but are performed directly upon storage nodes

Page 44: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 44

Flickr File System

• Developed by Flickr

• Proprietary

• Designed for very large scalable web app storage

Page 45: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 45

Flickr File System

• No metadata store– Deal with it yourself

• Multiple ‘StorageMaster’ nodes

• Multiple storage nodes with virtual volumes

Page 46: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 46

Flickr File System

SM

SM

SM

Page 47: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 47

Flickr File System

• Metadata stored by app– Just a virtual volume number– App chooses a path

• Virtual nodes are mirrored– Locally and remotely

• Reading is done directly from nodes

Page 48: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 48

Flickr File System

• StorageMaster nodes only used for write operations

• Reading and writing can scale separately

Page 49: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 49

Amazon S3

• A big disk in the sky• Multiple ‘buckets’• Files have user-defined keys• Data + metadata

Page 50: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 50

Amazon S3

Servers Amazon

Page 51: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 51

Amazon S3

Servers Amazon

Users

Page 52: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 52

The cost

• Fixed price, by the GB

• Store: $0.15 per GB per month• Serve: $0.20 per GB

Page 53: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 53

The cost

S3

Page 54: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 54

The cost

S3

Regular Bandwidth

Page 55: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 55

End costs

• ~$2k to store 1TB for a year

• ~$63 a month for 1Mb

• ~$65k a month for 1Gb

Page 56: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 56

Serving

Page 57: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 57

Serving files

Serving files is easy!

ApacheDisk

Page 58: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 58

Serving files

Scaling is harder

ApacheDisk

ApacheDisk

ApacheDisk

Page 59: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 59

Serving files

• This doesn’t scale well

• Primary storage is expensive– And takes a lot of space

• In many systems, we only access a small number of files most of the time

Page 60: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 60

Caching

• Insert caches between the storage and serving nodes

• Cache frequently accessed content to reduce reads on the storage nodes

• Software (Squid, mod_cache)• Hardware (Netcache, Cacheflow)

Page 61: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 61

Why it works

• Keep a smaller working set

• Use faster hardware– Lots of RAM– SCSI– Outer edge of disks (ZCAV)

• Use more duplicates– Cheaper, since they’re smaller

Page 62: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 62

Two models

• Layer 4– ‘Simple’ balanced cache– Objects in multiple caches– Good for few objects requested many times

• Layer 7– URL balances cache– Objects in a single cache– Good for many objects requested a few times

Page 63: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 63

Replacement policies

• LRU – Least recently used• GDSF – Greedy dual size frequency• LFUDA – Least frequently used with

dynamic aging

• All have advantages and disadvantages• Performance varies greatly with each

Page 64: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 64

Cache Churn

• How long do objects typically stay in cache?

• If it gets too short, we’re doing badly– But it depends on your traffic profile

• Make the cached object store larger

Page 65: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 65

Problems

• Caching has some problems:

– Invalidation is hard– Replacement is dumb (even LFUDA)

• Avoiding caching makes your life (somewhat) easier

Page 66: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 66

CDN – Content Delivery Network

• Akamai, Savvis, Mirror Image Internet, etc

• Caches operated by other people– Already in-place– In lots of places

• GSLB/DNS balancing

Page 67: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 67

Edge networks

Origin

Page 68: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 68

Edge networks

Origin

Cache

Cache

Cache

CacheCache

Cache

CacheCache

Page 69: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 69

CDN Models

• Simple model– You push content to them, they serve it

• Reverse proxy model– You publish content on an origin, they proxy

and cache it

Page 70: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 70

CDN Invalidation

• You don’t control the caches– Just like those awful ISP ones

• Once something is cached by a CDN, assume it can never change– Nothing can be deleted– Nothing can be modified

Page 71: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 71

Versioning

• When you start to cache things, you need to care about versioning

– Invalidation & Expiry– Naming & Sync

Page 72: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 72

Cache Invalidation

• If you control the caches, invalidation is possible

• But remember ISP and client caches

• Remove deleted content explicitly– Avoid users finding old content– Save cache space

Page 73: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 73

Cache versioning

• Simple rule of thumb:– If an item is modified, change its name (URL)

• This can be independent of the file system!

Page 74: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 74

Virtual versioning• Database indicates

version 3 of file

• Web app writes version number into URL

• Request comes through cache and is cached with the versioned URL

• mod_rewrite converts versioned URL to path

Version 3

example.com/foo_3.jpg

Cached: foo_3.jpg

foo_3.jpg -> foo.jpg

Page 75: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 75

Authentication

• Authentication inline layer– Apache / perlbal

• Authentication sideline– ICP (CARP/HTCP)

• Authentication by URL– FlickrFS

Page 76: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 76

Auth layer

• Authenticator sits between client and storage

• Typically built into the cache software

Cache

Authenticator

Origin

Page 77: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 77

Auth sideline

• Authenticator sits beside the cache

• Lightweight protocol used for authenticator

Cache

Authenticator

Origin

Page 78: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 78

Auth by URL

• Someone else performs authentication and gives URLs to client (typically the web app)

• URLs hold the ‘keys’ for accessing files

Cache OriginWeb Server

Page 79: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 79

BCP

Page 80: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 80

Business Continuity Planning

• How can I deal with the unexpected?– The core of BCP

• Redundancy• Replication

Page 81: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 81

Reality

• On a long enough timescale, anything that can fail, will fail

• Of course, everything can fail

• True reliability comes only through redundancy

Page 82: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 82

Reality

• Define your own SLAs

• How long can you afford to be down?• How manual is the recovery process?• How far can you roll back?• How many $node boxes can fail at once?

Page 83: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 83

Failure scenarios

• Disk failure• Storage array failure• Storage head failure• Fabric failure• Metadata node failure• Power outage• Routing outage

Page 84: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 84

Reliable by design

• RAID avoids disk failures, but not head or fabric failures

• Duplicated nodes avoid host and fabric failures, but not routing or power failures

• Dual-colo avoids routing and power failures, but may need duplication too

Page 85: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 85

Tend to all points in the stack

• Going dual-colo: great

• Taking a whole colo offline because of a single failed disk: bad

• We need a combination of these

Page 86: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 86

Recovery times

• BCP is not just about continuing when things fail

• How can we restore after they come back?

• Host and colo level syncing– replication queuing

• Host and colo level rebuilding

Page 87: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 87

Reliable Reads & Writes

• Reliable reads are easy– 2 or more copies of files

• Reliable writes are harder– Write 2 copies at once– But what do we do when we can’t write to

one?

Page 88: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 88

Dual writes

• Queue up data to be written– Where?– Needs itself to be reliable

• Queue up journal of changes– And then read data from the disk whose write

succeeded

• Duplicate whole volume after failure– Slow!

Page 89: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 89

Cost

Page 90: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 90

Judging cost

• Per GB?

• Per GB upfront and per year

• Not as simple as you’d hope– How about an example

Page 91: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 91

Hardware costs

Cost of hardware

Usable GB

Single Cost

Page 92: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 92

Power costs

Cost of power per year

Usable GB

Recurring Cost

Page 93: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 93

Power costs

Power installation cost

Usable GB

Single Cost

Page 94: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 94

Space costs

Cost per U

Usable GB

[ ]U’s needed (inc network)x

Recurring Cost

Page 95: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 95

Network costs

Cost of network gear

Usable GB

Single Cost

Page 96: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 96

Misc costs

Support contracts + spare disks

Usable GB

+ bus adaptors + cables[ ]Single & Recurring Costs

Page 97: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 97

Human costs

Admin cost per node

Node countx

Recurring Cost

Usable GB

[ ]

Page 98: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 98

TCO

• Total cost of ownership in two parts– Upfront– Ongoing

• Architecture plays a huge part in costing– Don’t get tied to hardware– Allow heterogeneity– Move with the market

Page 99: Web20expo Filesystems

(fin)

Page 100: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 100

Photo credits

• flickr.com/photos/ebright/260823954/• flickr.com/photos/thomashawk/243477905/• flickr.com/photos/tom-carden/116315962/• flickr.com/photos/sillydog/287354869/• flickr.com/photos/foreversouls/131972916/• flickr.com/photos/julianb/324897/• flickr.com/photos/primejunta/140957047/• flickr.com/photos/whatknot/28973703/• flickr.com/photos/dcjohn/85504455/

Page 101: Web20expo Filesystems

Web 2.0 Expo, 17 April 2007 101

You can find these slides online:

iamcal.com/talks/