PNUTS: Yahoo!’s Hosted Data Serving Platform Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen,
Post on 18-Dec-2015
234 Views
Preview:
Transcript
PNUTS: Yahoo!’s Hosted Data Serving Platform
Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver and Ramana
Yerneni
Yahoo! ResearchWith some additions by S. Sudarshan
2
How do I build a cool new web app?
Option 1: Code it up! Make it live! Scale it later
It gets posted to slashdot Scale it now! Flickr, Twitter, MySpace, Facebook, …
3
How do I build a cool new web app?
Option 2: Make it industrial strength! Evaluate scalable database backends Evaluate scalable indexing systems Evaluate scalable caching systems Architect data partitioning schemes Architect data replication schemes Architect monitoring and reporting infrastructure Write application Go live Realize it doesn’t scale as well as you hoped Rearchitect around bottlenecks 1 year later – ready to go!
4
Example: social network updates
Brian
Sonja Jimi Brandon Kurt
What are my friends up to?
Sonja:
Brandon:
5
Example: social network updates
16 Mike <ph..
6 Jimi <ph..8 Mary <re..
12 Sonja <ph..
15 Brandon <po..
17 Bob <re..
<photo><title>Flower</title><url>www.flickr.com</url></photo>
6
What do we need from our DBMS?
Web applications need: Scalability
And the ability to scale linearly Geographic scope High availability
Web applications typically have: Simplified query needs
No joins, aggregations Relaxed consistency needs
Applications can tolerate stale or reordered data
8
What is PNUTS?
E 75656 C
A 42342 EB 42521 W
C 66354 W
D 12352 E
F 15677 E
E 75656 C
A 42342 EB 42521 W
C 66354 W
D 12352 E
F 15677 E
CREATE TABLE Parts (ID VARCHAR,StockNumber INT,Status VARCHAR…
)
CREATE TABLE Parts (ID VARCHAR,StockNumber INT,Status VARCHAR…
)
Parallel databaseParallel database Geographic replicationGeographic replication
Indexes and viewsIndexes and views
Structured, flexible schemaStructured, flexible schema
Hosted, managed infrastructureHosted, managed infrastructure
A 42342 E
B 42521 W
C 66354 W
D 12352 E
E 75656 C
F 15677 E
9
Query model
Per-record operations Get Set Delete
Multi-record operations Multiget Scan Getrange
Web service (RESTful) API
10
Data-path componentsData-path components
Storage units
Routers
Tablet controller
REST API
Clients
MessageBroker
Detailed architecture
11
Storageunits
Routers
Tablet controller
REST API
Clients
Local region Remote regions
YMB
Detailed architecture
12
Tablet splitting and balancing
Each storage unit has many tablets (horizontal partitions of the table)Each storage unit has many tablets (horizontal partitions of the table)
Tablets may grow over timeTablets may grow over timeOverfull tablets splitOverfull tablets split
Storage unit may become a hotspotStorage unit may become a hotspot
Shed load by moving tablets to other serversShed load by moving tablets to other servers
Storage unitTablet
16
Storage unit 1 Storage unit 2 Storage unit 3
Range queries
Router
AppleAvocadoBananaBlueberry
CanteloupeGrapeKiwiLemon
LimeMangoOrange
StrawberryTomatoWatermelon
Grapefruit…Pear?
Grapefruit…Lime?
Lime…Pear?
MIN-Canteloupe
SU1
Canteloupe-Lime
SU3
Lime-Strawberry
SU2
Strawberry-MAX
SU1
SU1Strawberry-MAX
SU2Lime-Strawberry
SU3Canteloupe-Lime
SU1MIN-Canteloupe
17
Updates
1
Write key k
2Write key k7 Sequence # for key k
8 Sequence # for key k
SU SU SU
3Write key k
4
5SUCCESS
6Write key k
RoutersMessage brokers
18
Yahoo Message Bus
Distributed publish-subscribe service Guarantees delivery once a message is
published Logging at site where message is published,
and at other sites when received Guarantees messages published to a
particular cluster will be delivered in same order at all other clusters
Record updates are published to YMB by master copy All replicas subscribe to the updates, and
get them in same order for a particular record
21
Consistency model Goal: make it easier for applications to reason about
updates and cope with asynchrony
What happens to a record with primary key “Brian”?
Time
Record inserted
Update Update Update UpdateUpdate Delete
Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1
v. 6 v. 8
Update Update
22
Consistency model
Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1
v. 6 v. 8
Current version
Stale versionStale version
Read
23
Consistency model
Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1
v. 6 v. 8
Read up-to-date
Current version
Stale versionStale version
24
Consistency model
Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1
v. 6 v. 8
Read ≥ v.6
Current version
Stale versionStale version
Read-critical(required version):
25
Consistency model
Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1
v. 6 v. 8
Write
Current version
Stale versionStale version
26
Consistency model
Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1
v. 6 v. 8
Write if = v.7
ERROR
Current version
Stale versionStale version
Test-and-set-write(required version)
27
Consistency model
Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1
v. 6 v. 8
Write if = v.7
ERROR
Current version
Stale versionStale version
Mechanism: per record mastershipMechanism: per record mastership
28
Record and Tablet Mastership
Data in PNUTS is replicated across sites Hidden field in each record stores which copy
is the master copy updates can be submitted to any copy forwarded to master, applied in order received by
master Record also contains origin of last few
updates Mastership can be changed by current master,
based on this information Mastership change is simply a record update
Tablets mastership Required to ensure primary key consistency Can be different from record mastership
29
Other Features
Per record transactions Copying a tablet (on failure, for e.g.)
Request copy Publish checkpoint message Get copy of tablet as of when checkpoint
is received Apply later updates
Tablet split Has to be coordinated across all copies
30
Query Processing
Range scan can span tablets Only one tablet scanned at a time Client may not need all results at once
Continuation object returned to client to indicate where range scan should continue
Notification One pub-sub topic per tablet Client knows about tables, does not know
about tablets Automatically subscribed to all tablets, even as
tablets are added/removed. Usual problem with pub-sub: undelivered
notifications, handled in usual way
32
Experimental setup
Production PNUTS code Enhanced with ordered table type
Three PNUTS regions 2 west coast, 1 east coast 5 storage units, 2 message brokers, 1 router West: Dual 2.8 GHz Xeon, 4GB RAM, 6 disk RAID 5
array East: Quad 2.13 GHz Xeon, 4GB RAM, 1 SATA disk
Workload 1200-3600 requests/second 0-50% writes 80% locality
33
Inserts
Inserts required 75.6 ms per insert in West 1
(tablet master) 131.5 ms per insert into the non-
master West 2, and 315.5 ms per insert into the non-
master East.
35
Scalability
0
20
40
60
80
100
120
140
160
1 2 3 4 5 6
Storage units
Ave
rag
e la
ten
cy (
ms)
Hash table Ordered table
36
Request skew
0
10
20
30
40
50
60
70
80
90
100
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Zipf parameter
Ave
rag
e la
ten
cy (
ms)
Hash table Ordered table
37
Size of range scans
0
1000
2000
3000
4000
5000
6000
7000
8000
0 0.02 0.04 0.06 0.08 0.1 0.12
Fraction of table scanned
Ave
rag
e la
ten
cy (
ms)
30 clients 300 clients
38
Related work
Distributed and parallel databases Especially query processing and transactions BigTable, Dynamo, S3, SimpleDB, SQL Server Data
Services, Cassandra
Distributed filesystems Ceph, Boxwood, Sinfonia
Distributed (P2P) hash tables Chord, Pastry, …
Database replication Master-slave, epidemic/gossip, synchronous…
39
Conclusions and ongoing work
PNUTS is an interesting research product Research: consistency, performance, fault
tolerance, rich functionality Product: make it work, keep it (relatively)
simple, learn from experience and real applications
Ongoing work Indexes and materialized views Bundled updates Batch query processing
top related