Distributed Data Storage & Access Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems March 23, 2022 Some slide content courtesy Tanenbaum & van Steen
Dec 29, 2015
Distributed Data Storage & Access
Zachary G. IvesUniversity of Pennsylvania
CIS 455 / 555 – Internet and Web Systems
April 19, 2023
Some slide content courtesy Tanenbaum & van Steen
Reminders
Homework 2 Milestone 1 deadline imminent
Homework 2 Milestone 2 due Monday after Spring Break
Wed: Marie Jacob on the Q query answering system
Next week: Spring Break
2
Building Over a DHT
“Message passing” architecture to coordinate behavior among different nodes in an application Send a request to the “owner” of a key
Request contains a custom-formatted message type
Each node has an event handler loopswitch (msg.type) {case one:case two:…
The request handler may send back a result, as appropriate Requires that the message include info about who the
requestor was, how to return the data
3
4
Example: How Do We Create a Hash Table (Hash Multiset) Abstraction?
We want the following: put (key, value) remove (key) valueSet = get (key)
How can we use Pastry to do this? route() deliver()
An Alternate Programming Abstraction: GFS + MapReduce
Abstraction: Instead of sending messages, different pieces of code communicate through files Code is going to take a very “stylized” form; at
each stage each machine will get input from files, send output to files
Files are generally persistent, name-able (in contrast to DHT messages, which are transient)
Files consist of blocks, which are the basic unit of partitioning (in contrast to object / data item IDs)
5
Background: Distributed Filesystems
Many distributed filesystems have been developed: NFS, SMB are the most prevalent today Andrew FileSystem (AFS) was also fairly
popular
Hundreds of other research filesystems, e.g., Coda, Sprite, … with different properties
6
NFS in a Nutshell
(Single) server, multi-client architecture Server is stateless, so clients must send all context
(including position to read from) in each request
Plugs into VFS APIs, mostly mimics UNIX semantics Opening a file requires opening each dir along the way
fd = open(“/x/y/z.txt”) will do a lookup for x from the root handle lookup for y from x’s handle lookup for z from y’s handle
Server must commit writes immediately Client does heavy caching – requires frequent polling
for validity, and/or use of external locking service
7
The Google File System (GFS)
Goals: Support millions of huge (many-TB) files Partition & replicate data across thousands of unreliable
machines, in multiple racks (and even data centers)
Willing to make some compromises to get there: Modified APIs – doesn’t plug into POSIX APIs
In fact, relies on being built over Linux file system
Doesn’t provide transparent consistency to apps! App must detect duplicate or bad records, support checkpoints
Performance is only good with a particular class of apps: Stream-based reads Atomic record appends
8
GFS Basic Architecture & Lookups
Files broken into 64MB “chunks” Master stores metadata; 3 chunkservers store each chunk
A single “flat” file namespace maps to chunks + replicas As with Napster, actual data transfer from chunkservers to client
No client-side caching!9
The Master: Metadata and Versions
Controls (and locks as appropriate): Mapping from files -> chunks within each namespace Controls reallocation, garbage collection of chunks
Maintains a log (replicated to backups) of all mutations to the above
Also knows mapping from chunk ID -> <version, {machines}> Doesn’t have persistent knowledge of what’s on
chunkservers Instead, during startup, it polls them … Or when one joins, it registers
10
Chunkservers
Each holds replicas of some of the chunks
For a given write operation, one of the owners of the chunk gets a lease – becomes the primary and all others the secondary Receives requests for mutations Assigns an order Notifies the secondary nodes
Waits for all to say they received the message
Responds with a write-succeeded message Failure results in inconsistent data!!
11
A Write Operation
1. Client asks Master for lease-owning chunkserver
2. Master gives ID of primary, secondary chunkservers; client caches
3. Client sends its data to all replicas, in any order
4. Once client gets ACK, it requests primary to do a write of those data items. Primary assigns serial numbers to these operations.
5. Primary forwards write to secondaries (in a chain).
6. Secondaries reply “SUCCESS”7. Primary replies to client
12
Append
GFS supports atomic append that multiple machines can use at the same time
Primary will interleave the requests in any order Will be written “at least once”!
Primary determines a position for the write, forwards this to the secondaries
13
Failures and the Client
If there is a failure in a record write or append, the client will generally retry If there was “partial success” in a previous
append, there might be more than one copy on some nodes – and inconsistency
Client must handle this through checksums, record IDs, and periodic checkpointing
14
GFS Performance
Many performance numbers in the paper Not enough context here to discuss them in much detail
– would need to see how they compare with other approaches!
But: validate high scalability in terms of concurrent reads, concurrent appends, with data partitioned and replicated across many machines
Also show fast recovery from failed nodes
Not the only approach to many of these problems, but one shown to work at industrial-strength!
15
A Popular Distributed Programming Model: MapReduce
In many circles, considered the key building block for much of Google’s data analysis A programming language built on it: Sawzall,
http://labs.google.com/papers/sawzall.html … Sawzall has become one of the most widely used programming
languages at Google. … [O]n one dedicated Workqueue cluster with 1500 Xeon CPUs, there were 32,580 Sawzall jobs launched, using an average of 220 machines each. While running those jobs, 18,636 failures occurred (application failure, network outage, system crash, etc.) that triggered rerunning some portion of the job. The jobs read a total of 3.2x1015 bytes of data (2.8PB) and wrote 9.9x1012 bytes (9.3TB).
Other similar languages: Yahoo’s Pig Latin and Pig; Microsoft’s Dryad
Cloned in open source: Hadoop,http://hadoop.apache.org/core/
So what is it? What’s it good for?
16
MapReduce: Simple Distributed Functional Programming Primitives
Modeled after Lisp primitives:map (apply function to all items in a collection) and reduce (apply function to set of items with a common key)
We start with: A user-defined function to be applied to all data,
map: (key,value) (key, value) Another user-specified operation
reduce: (key, {set of values}) result A set of n nodes, each with data
All nodes run map on all of their data, producing new data with keys
This data is collected by key, then shuffled, reducedDataflow is through temp files on GFS
17
Some Example Tasks
Count word occurrences Map: output word with count 1 Reduce: sum the counts
Distributed grep – all lines matching a pattern Map: filter by pattern Reduce: output set
Count URL access frequency Map: output each URL as key, with count 1 Reduce: sum the counts
For each IP address, get the document with the most in-links
Number of queries by IP address (requires multiple steps)
18
MapReduce Dataflow Diagram(Default MapReduce Uses Filesystem)
19
Datapartitionsby key
Map compu-tation partitions Reduce compu-
tation partitions
Redistributionby output’s key
Coordinator
Some Details
Fewer computation partitions than data partitions All data is accessible via a distributed filesystem with
replication Worker nodes produce data in key order (makes it easy to
merge) The master is responsible for scheduling, keeping all nodes
busy The master knows how many data partitions there are, which
have completed – atomic commits to disk
Fault tolerance: master triggers re-execution of work originally performed by failed nodes – to make their data available again
Locality: master tries to do work on nodes that have replicas of the data
20
Hadoop: A “Modern” Open-Source “Clone” of MapReduce + GFS
Underlying Hadoop: HDFS, a page-level replicating filesystem Modeled in part after GFS
Supports “streaming” page access from each site Master/Slave: “Namenode” vs “Datanodes”
21Source: Hadoop HDFS architecture documentation
Hadoop HDFS + MapReduce
22
Source: “Meet Hadoop”, Devaraj Das, Yahoo Bangalore & Apache
Hadoop MapReduce Architecture
“Jobtracker” (Master): Accepts jobs submitted by users Gives tasks to Tasktrackers – makes scheduling
decisions, co-locates tasks to data Monitors task, tracker status, re-executes tasks
if needed
“Tasktrackers” (Slaves): Run Map and Reduce tasks Manage storage, transmission of intermediate
output
23
How Does this Relate to DHTs?
Consider replacing the filesystem with the DHT…
24
What Does MapReduce Do Well?
What are its strengths?
What about weaknesses?
25
MapReduce is a ParticularProgramming Model
… But it’s not especially general (though things like Pig Latin improve it)
Suppose we have autonomous application components that wish to communicate
We’ve already seen a few strategies: Request/response from client to server
HTTP itself Asynchronous messages
Router “gossip” protocols P2P “finger tables”, etc.
Are there general mechanisms and principles?(Of course!)
… Let’s first look at what happens if we need in-order messaging
26
27
Message-Queuing Model (1)
Four combinations for loosely-coupled communications using queues.
2-26
28
Message-Queuing Model (2)
Basic interface to a queue in a message-queuing system.
Primitive Meaning
Put Append a message to a specified queue
Get Block until the specified queue is nonempty, and remove the first message
Poll Check a specified queue for messages, and remove the first. Never block.
NotifyInstall a handler to be called when a message is put into the specified queue.
29
General Architecture of a Message-Queuing System (1)
The relationship between queue-level addressing and network-level addressing.
30
General Architecture of a Message-Queuing System (2)
The general organization of a message-queuing system with routers.
2-29
31
Benefits of Message Queueing
Allows both synchronous (blocking) and asynchronous (polling or event-driven) communication
Ensures messages are delivered (or at least readable) in the order received
The basis of many transactional systems e.g., Microsoft Message Queue (MMQ), IBM
MQseries, etc.