Page 1
PNUTS: Yahoo!’s Hosted Data Serving Platform
Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver and Ramana
Yerneni
Yahoo! ResearchWith some additions by S. Sudarshan
Page 2
2
How do I build a cool new web app?
Option 1: Code it up! Make it live! Scale it later
It gets posted to slashdot Scale it now! Flickr, Twitter, MySpace, Facebook, …
Page 3
3
How do I build a cool new web app?
Option 2: Make it industrial strength! Evaluate scalable database backends Evaluate scalable indexing systems Evaluate scalable caching systems Architect data partitioning schemes Architect data replication schemes Architect monitoring and reporting infrastructure Write application Go live Realize it doesn’t scale as well as you hoped Rearchitect around bottlenecks 1 year later – ready to go!
Page 4
4
Example: social network updates
Brian
Sonja Jimi Brandon Kurt
What are my friends up to?
Sonja:
Brandon:
Page 5
5
Example: social network updates
16 Mike <ph..
6 Jimi <ph..8 Mary <re..
12 Sonja <ph..
15 Brandon <po..
17 Bob <re..
<photo><title>Flower</title><url>www.flickr.com</url></photo>
Page 6
6
What do we need from our DBMS?
Web applications need: Scalability
And the ability to scale linearly Geographic scope High availability
Web applications typically have: Simplified query needs
No joins, aggregations Relaxed consistency needs
Applications can tolerate stale or reordered data
Page 8
8
What is PNUTS?
E 75656 C
A 42342 EB 42521 W
C 66354 W
D 12352 E
F 15677 E
E 75656 C
A 42342 EB 42521 W
C 66354 W
D 12352 E
F 15677 E
CREATE TABLE Parts (ID VARCHAR,StockNumber INT,Status VARCHAR…
)
CREATE TABLE Parts (ID VARCHAR,StockNumber INT,Status VARCHAR…
)
Parallel databaseParallel database Geographic replicationGeographic replication
Indexes and viewsIndexes and views
Structured, flexible schemaStructured, flexible schema
Hosted, managed infrastructureHosted, managed infrastructure
A 42342 E
B 42521 W
C 66354 W
D 12352 E
E 75656 C
F 15677 E
Page 9
9
Query model
Per-record operations Get Set Delete
Multi-record operations Multiget Scan Getrange
Web service (RESTful) API
Page 10
10
Data-path componentsData-path components
Storage units
Routers
Tablet controller
REST API
Clients
MessageBroker
Detailed architecture
Page 11
11
Storageunits
Routers
Tablet controller
REST API
Clients
Local region Remote regions
YMB
Detailed architecture
Page 12
12
Tablet splitting and balancing
Each storage unit has many tablets (horizontal partitions of the table)Each storage unit has many tablets (horizontal partitions of the table)
Tablets may grow over timeTablets may grow over timeOverfull tablets splitOverfull tablets split
Storage unit may become a hotspotStorage unit may become a hotspot
Shed load by moving tablets to other serversShed load by moving tablets to other servers
Storage unitTablet
Page 13
13
Query processing
Page 14
16
Storage unit 1 Storage unit 2 Storage unit 3
Range queries
Router
AppleAvocadoBananaBlueberry
CanteloupeGrapeKiwiLemon
LimeMangoOrange
StrawberryTomatoWatermelon
Grapefruit…Pear?
Grapefruit…Lime?
Lime…Pear?
MIN-Canteloupe
SU1
Canteloupe-Lime
SU3
Lime-Strawberry
SU2
Strawberry-MAX
SU1
SU1Strawberry-MAX
SU2Lime-Strawberry
SU3Canteloupe-Lime
SU1MIN-Canteloupe
Page 15
17
Updates
1
Write key k
2Write key k7 Sequence # for key k
8 Sequence # for key k
SU SU SU
3Write key k
4
5SUCCESS
6Write key k
RoutersMessage brokers
Page 16
18
Yahoo Message Bus
Distributed publish-subscribe service Guarantees delivery once a message is
published Logging at site where message is published,
and at other sites when received Guarantees messages published to a
particular cluster will be delivered in same order at all other clusters
Record updates are published to YMB by master copy All replicas subscribe to the updates, and
get them in same order for a particular record
Page 17
19
Asynchronous replication and
consistency
Page 18
20
Asynchronous replication
Page 19
21
Consistency model Goal: make it easier for applications to reason about
updates and cope with asynchrony
What happens to a record with primary key “Brian”?
Time
Record inserted
Update Update Update UpdateUpdate Delete
Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1
v. 6 v. 8
Update Update
Page 20
22
Consistency model
Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1
v. 6 v. 8
Current version
Stale versionStale version
Read
Page 21
23
Consistency model
Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1
v. 6 v. 8
Read up-to-date
Current version
Stale versionStale version
Page 22
24
Consistency model
Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1
v. 6 v. 8
Read ≥ v.6
Current version
Stale versionStale version
Read-critical(required version):
Page 23
25
Consistency model
Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1
v. 6 v. 8
Write
Current version
Stale versionStale version
Page 24
26
Consistency model
Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1
v. 6 v. 8
Write if = v.7
ERROR
Current version
Stale versionStale version
Test-and-set-write(required version)
Page 25
27
Consistency model
Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1
v. 6 v. 8
Write if = v.7
ERROR
Current version
Stale versionStale version
Mechanism: per record mastershipMechanism: per record mastership
Page 26
28
Record and Tablet Mastership
Data in PNUTS is replicated across sites Hidden field in each record stores which copy
is the master copy updates can be submitted to any copy forwarded to master, applied in order received by
master Record also contains origin of last few
updates Mastership can be changed by current master,
based on this information Mastership change is simply a record update
Tablets mastership Required to ensure primary key consistency Can be different from record mastership
Page 27
29
Other Features
Per record transactions Copying a tablet (on failure, for e.g.)
Request copy Publish checkpoint message Get copy of tablet as of when checkpoint
is received Apply later updates
Tablet split Has to be coordinated across all copies
Page 28
30
Query Processing
Range scan can span tablets Only one tablet scanned at a time Client may not need all results at once
Continuation object returned to client to indicate where range scan should continue
Notification One pub-sub topic per tablet Client knows about tables, does not know
about tablets Automatically subscribed to all tablets, even as
tablets are added/removed. Usual problem with pub-sub: undelivered
notifications, handled in usual way
Page 30
32
Experimental setup
Production PNUTS code Enhanced with ordered table type
Three PNUTS regions 2 west coast, 1 east coast 5 storage units, 2 message brokers, 1 router West: Dual 2.8 GHz Xeon, 4GB RAM, 6 disk RAID 5
array East: Quad 2.13 GHz Xeon, 4GB RAM, 1 SATA disk
Workload 1200-3600 requests/second 0-50% writes 80% locality
Page 31
33
Inserts
Inserts required 75.6 ms per insert in West 1
(tablet master) 131.5 ms per insert into the non-
master West 2, and 315.5 ms per insert into the non-
master East.
Page 32
34
10% writes by default
Page 33
35
Scalability
0
20
40
60
80
100
120
140
160
1 2 3 4 5 6
Storage units
Ave
rag
e la
ten
cy (
ms)
Hash table Ordered table
Page 34
36
Request skew
0
10
20
30
40
50
60
70
80
90
100
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Zipf parameter
Ave
rag
e la
ten
cy (
ms)
Hash table Ordered table
Page 35
37
Size of range scans
0
1000
2000
3000
4000
5000
6000
7000
8000
0 0.02 0.04 0.06 0.08 0.1 0.12
Fraction of table scanned
Ave
rag
e la
ten
cy (
ms)
30 clients 300 clients
Page 36
38
Related work
Distributed and parallel databases Especially query processing and transactions BigTable, Dynamo, S3, SimpleDB, SQL Server Data
Services, Cassandra
Distributed filesystems Ceph, Boxwood, Sinfonia
Distributed (P2P) hash tables Chord, Pastry, …
Database replication Master-slave, epidemic/gossip, synchronous…
Page 37
39
Conclusions and ongoing work
PNUTS is an interesting research product Research: consistency, performance, fault
tolerance, rich functionality Product: make it work, keep it (relatively)
simple, learn from experience and real applications
Ongoing work Indexes and materialized views Bundled updates Batch query processing
Page 38
40
Thanks!
[email protected] research.yahoo.com