A New Partnership for Cross-Scale, Cross-Domain eScience
Post on 10-May-2015
573 Views
Preview:
DESCRIPTION
Transcript
A New Partnership for eScience
Bill Howe, UW
Ed Lazowska, UW
Garth Gibson, CMU
Christos Faloutsos, CMU
Peter Lee, CMU (DARPA)
Chris Mentzel, Moore
QuickTime™ and a decompressor
are needed to see this picture.
http://escience.washington.edu
3/12/09 Bill Howe, eScience Institute 4
3/12/09 Bill Howe, eScience Institute 5
\
3/12/09 Bill Howe, eScience Institute 6
The University of Washington eScience Institute
Rationale The exponential increase in sensors is transitioning all fields of science and
engineering from data-poor to data-rich Techniques and technologies include
Sensors and sensor networks, databases, data mining, machine learning, visualization, cluster/cloud computing
If these techniques and technologies are not widely available and widely practiced, UW will cease to be competitive
Mission Help position the University of Washington at the forefront of research both in
modern eScience techniques and technologies, and in the fields that depend upon them
Strategy Bootstrap a cadre of Research Scientists Add faculty in key fields Build out a “consultancy” of students and non-research staff
QuickTime™ and a decompressor
are needed to see this picture.
3/12/09 Bill Howe, eScience Institute 7
Staff and Funding
Funding $1M/year direct appropriation from WA State Legislature $1.5M from Gordon and Betty Moore Foundation (joint with CMU) Multiple proposals outstanding
Staffing Dave Beck, Research Scientist: Biosciences and software eng. Jeff Gardner, Research Scientist: Astrophysics and HPC Bill Howe,Research Scientist: Databases, visualization, DISC Ed Lazowska, Director Erik Lundberg (50%), Operations Director Mette Peters, Health Sciences Liaison Chance Reschke, Research Engineer: large scale computing platforms
…plus a senior faculty search underway …plus a “consultancy” of students and professional staff
3/12/09 Bill Howe, eScience Institute 8
All science is reducing to a database problem
Old model: “Query the world” (Data acquisition coupled to a specific hypothesis) New model: “Download the world” (Data acquired en masse, in support of many hypotheses)
Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS) Medicine: ubiquitous digital records, MRI, ultrasound Oceanography: high-resolution models, cheap sensors, satellites Biology: lab automation, high-throughput sequencing
“Increase data collection exponentially with FlowCam!”
3/12/09 Bill Howe, eScience Institute 9
The long tail is getting fatter:
notebooks become spreadsheets (MB), spreadsheets become databases (GB), databases become clusters (TB) clusters become clouds (PB)
The Long Tailda
ta v
olum
e
rank
Researchers with growing data management challenges but limited resources for cyberinfrastructure
• No dedicated IT staff
• Over-reliance on inadequate but familiar toolsCERN (~15PB/year)
LSST (~100PB)
PanSTARRS (~40PB)
Ocean Modelers <Spreadsheet
users>
SDSS (~100TB)
Seis-mologists
MicrobiologistsCARMEN (~50TB)
“The future is already here. It’s just not very evenly distributed.” -- William Gibson
3/12/09 Bill Howe, eScience Institute 10
Case Study: Armbrust Lab
3/12/09 Bill Howe, eScience Institute 11
Armbrust Lab Tech Roadmap
ClustalW
scala
bili
ty
cluster/cloud
workstation/server
MAQsp
ecif
ic ta
sks
gene
ral t
asks
Excel
NCBI BLAST
Phred/Phrap
CloudBurst
CLC Genomics Machine
Hadoop/Dryad
Parallel Databases
?
Azure, AWS
WebBlast*
RDBMS
R
PPlacer*
AnnoJBioPython
Past
Present
Soon
Other tools
specialization
3/12/09 Bill Howe, eScience Institute 12
What Does Scalable Mean?
Operationally: In the past: “Works even if data doesn’t fit in main memory” Now: “Can make use of 1000s of cheap computers”
Formally: In the past: polynomial time and space. If you have N data
items, you must do no more than Nk operations Soon: logarithmic time and linear space. If you have N data
items, you must do no more than N log(N) operations
Soon, you’ll only get one pass at the data So you better make that one pass count
3/12/09 Bill Howe, eScience Institute 13
A Goal: Cross-Scale Solutions
Gracefully scale up from files to databases to cluster to cloud from MB to GB to TB to PB
“Gracefully” means: logical data independence no expensive ETL migration projects
“Gracefully” means: everyone can use it Hackers / Computational Scientists Lab/Field Scientists The Public K12 Legislators
3/12/09 Bill Howe, eScience Institute 14
Data Model Operations Services
GPL * * None for free
Workflow * arbitrary boxes-and-arrows
typing, provenance, Pegasus-style resource mapping, task parallelism
SQL / Relational Algebra
Relations Select, Project, Join, Aggregate, …
optimization, physical data independence, indexing, parallelism
MapReduce [(key,value)] Map, Reduce massive data parallelism, fault tolerance, scheduling
Pig Nested Relations
RA-like, with Nest/Flatten
optimization, monitoring, scheduling
DryadLINQ IQueryable, IEnumerable
RA + Apply + Partitioning
typing, massive data parallelism, fault tolerance
MPI Arrays/ Matrices
70+ ops data parallelism, full control
3/12/09 Bill Howe, eScience Institute 15
MapReduce
Many tasks process big data, produce big data Want to use hundreds or thousands of CPUs
... but this needs to be easy Parallel databases exist, but require DBAs and $$$$ …and do not easily scale to thousands of computers
MapReduce is a lightweight framework, providing: Automatic parallelization and distribution Fault-tolerance I/O scheduling Status and monitoring
3/12/09 Bill Howe, eScience Institute 16
public class LogEntry { public string user, ip; public string page; public LogEntry(string line) { string[] fields = line.Split(' '); this.user = fields[8]; this.ip = fields[9]; this.page = fields[5]; }}
public class UserPageCount{ public string user, page; public int count; public UserPageCount( string usr, string page, int cnt){ this.user = usr; this.page = page; this.count = cnt; }}
PartitionedTable<string> logs = PartitionedTable.Get<string>(@”file:\\…\logfile.pt”);var logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line);var user = from access in logentries where access.user.EndsWith(@"\ulfar") select access;var accesses = from access in user group access by access.page into pages select new UserPageCount("ulfar", pages.Key, pages.Count());var htmAccesses = from access in accesses where access.page.EndsWith(".htm") orderby access.count descending select access; htmAccesses.ToPartitionedTable(@”file:\\…\results.pt”);
slide source: Christophe Poulain, MSR
A complete DryadLINQ program
3/12/09 Bill Howe, eScience Institute 17
Relational DatabasesPre-relational DBMS brittleness: if your data changed, your application often broke.
Early RDBMS were buggy and slow (and often reviled), but required only 5% of the application code.
physical data independence
logical data independence
files and pointers
relations
views
“Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed.”
Key Idea: Programs that manipulate tabular data exhibit an algebraic structure allowing reasoning and manipulation independently of physical data representation
3/12/09 Bill Howe, eScience Institute 18
Relational Databases
Databases are especially, but exclusively, effective at “Needle in Haystack” problems:
Extracting small results from big datasets Transparently provide “old style” scalability Your query will always* finish, regardless of dataset size.
Indexes are easily built and automatically used when appropriateCREATE INDEX seq_idx ON sequence(seq);
SELECT seq FROM sequence WHERE seq = ‘GATTACGATATTA’;
*almost
3/12/09 Bill Howe, eScience Institute 19
Key Idea: Data Independence
physical data independence
logical data independence
files and pointers
relations
views
SELECT * FROM my_sequences
SELECT seq FROM ncbi_sequences WHERE seq = ‘GATTACGATATTA’;
f = fopen(‘table_file’);fseek(10030440);while (True) { fread(&buf, 1, 8192, f); if (buf == GATTACGATATTA) { . . .
3/12/09 Bill Howe, eScience Institute 20
Key Idea: An Algebra of Tables
select
project
join join
Other operators: aggregate, union, difference, cross product
3/12/09 Bill Howe, eScience Institute 21
Key Idea: Algebraic Optimization
N = ((z*2)+((z*3)+0))/1
Algebraic Laws: 1. (+) identity: x+0 = x2. (/) identity: x/1 = x3. (*) distributes: (n*x+n*y) = n*(x+y)4. (*) commutes: x*y = y*x
Apply rules 1, 3, 4, 2:N = (2+3)*z
two operations instead of five, no division operator
Same idea works with the Relational Algebra!
3/12/09 Bill Howe, eScience Institute 22
Shared Nothing Parallel Databases
Teradata Greenplum Netezza Aster Data Systems DataAllegro Vertica MonetDB
Microsoft
Recently commercialized as “Vectorwise”
Case Study: Astrophysics Simulation
24
N-body Astrophysics Simulation
• 15 years in dev
• 109 particles
• Gravity
• Months to run
• 7.5 million CPU hours
• 500 timesteps
• Big Bang to now
Simulations from Tom Quinn’s Lab, work by Sarah Loebman, YongChul Kwon, Bill Howe, Jeff Gardner, Magda Balazinska
25
Q1: Find Hot Gas
SELECT id
FROM gas
WHERE temp > 150000
26
Single Node: Query 1
169 MB 1.4 GB 36 GB
27
Multiple Nodes: Query 1
Database Z
28
Multiple Nodes:Query 2
Database Z
29
Q4: Gas Deletion
SELECT gas1.id
FROM gas1
FULL OUTER JOIN gas2
ON gas1.id=gas2.id
WHERE gas2.id=NULL
Particles removed between two timesteps
30
Single Node: Query 4
31
Multiple Nodes: Query 4
3/12/09 Bill Howe, eScience Institute 32
Ease of Use
star43 = FOREACH rawGas43 GENERATE $0 AS pid:long; star60 = FOREACH rawGas60 GENERATE $0 AS pid:long; groupedGas = COGROUP star43 BY pid, star60 BY pid;
selectedGas = FOREACH groupedGas GENERATE FLATTEN((IsEmpty(gas43) ? null : gas43)) as s43, FLATTEN((IsEmpty(gas60) ? null : gas60)) as s60;
destroyed = FILTER selectedGas BY s60 is null;
Visualization and Mashups
Dancing with Data
3/12/09 Bill Howe, eScience Institute 34
Data explosion, again
Data growth is outpacing Moore’s Law Why? Cost of acquisition has dropped through the floor Every pairwise comparison of datasets
generates a new dataset -- N2 growth
So: Scalable analysis is necessary But: Scalable analysis is hard
3/12/09 Bill Howe, eScience Institute 35
It’s not just the size….
Corollary: # of apps scales as N2
Every pairwise comparison motivates a new application
To keep up, we need to entrain new programmers, make existing programmers more productive, or both
3/12/09 Bill Howe, eScience Institute 36
Satellite Images + Crime Incidence Reports
3/12/09 Bill Howe, eScience Institute 37
Twitter Feed + Flickr Stream
3/12/09 Bill Howe, eScience Institute 38
Zooplankton and Temperature
<Vis movie>
QuickTime™ and a decompressor
are needed to see this picture.
3/12/09 Bill Howe, eScience Institute 39
Why Visualization?
High bandwidth of the human visual cortex Query-writing presumes a precise goal Try this in SQL: “What does the salt wedge look like?”
3/12/09 Bill Howe, eScience Institute 40
Data Product Ensembles
source: Antonio Baptista, Center for Coastal Margin Observation and Prediction
3/12/09 Bill Howe, eScience Institute 41
Example: Find matching sequences
Given a set of sequences Find all sequences equal to
“GATTACGATATTA”
3/12/09 Bill Howe, eScience Institute 42
Example System: Teradata
AMP = unit of parallelism
3/12/09 Bill Howe, eScience Institute 43
Example System: Teradata
SELECT * FROM Orders o, Lines i WHERE o.item = i.item AND o.date = today()
join
select
scan scan
date = today()
o.item = i.item
Order oItem i
Find all orders from today, along with the items ordered
3/12/09 Bill Howe, eScience Institute 44
Example System: Teradata
AMP 1 AMP 2 AMP 3
selectdate=today()
selectdate=today()
selectdate=today()
scanOrder o
scanOrder o
scanOrder o
hashh(item)
hashh(item)
hashh(item)
AMP 1 AMP 2 AMP 3
3/12/09 Bill Howe, eScience Institute 45
Example System: Teradata
AMP 1 AMP 2 AMP 3
scanItem i
AMP 1 AMP 2 AMP 3
hashh(item)
scanItem i
hashh(item)
scanItem i
hashh(item)
3/12/09 Bill Howe, eScience Institute 46
Example System: Teradata
AMP 1 AMP 2 AMP 3
join join joino.item = i.item o.item = i.item o.item = i.item
contains all orders and all lines where hash(item) = 1
contains all orders and all lines where hash(item) = 2
contains all orders and all lines where hash(item) = 3
3/12/09 Bill Howe, eScience Institute 47
MapReduce Programming Model
Input & Output: each a set of key/value pairs Programmer specifies two functions:
Processes input key/value pair Produces set of intermediate pairs
Combines all intermediate values for a particular key Produces a set of merged output values (usually just
one)
map (in_key, in_value) -> list(out_key, intermediate_value)
reduce (out_key, list(intermediate_value)) -> list(out_value)
Inspired by primitives from functional programming languages such as Lisp, Scheme, and Haskell
slide source: Google, Inc.
3/12/09 Bill Howe, eScience Institute 48
Abridged Declaration of Independence A Declaration By the Representatives of the United States of America, in General Congress Assembled. When in the course of human events it becomes necessary for a people to advance from that subordination in which they have hitherto remained, and to assume among powers of the earth the equal and independent station to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the change. We hold these truths to be self-evident; that all men are created equal and independent; that from that equal creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just power from the consent of the governed; that whenever any form of government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect their safety and happiness. Prudence indeed will dictate that governments long established should not be changed for light and transient causes: and accordingly all experience hath shewn that mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to throw off such government and to provide new guards for future security. Such has been the patient sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for the truth of which we pledge a faith yet unsullied by falsehood.
Example: Document Processing
3/12/09 Bill Howe, eScience Institute 49
Abridged Declaration of Independence A Declaration By the Representatives of the United States of America, in General Congress Assembled. When in the course of human events it becomes necessary for a people to advance from that subordination in which they have hitherto remained, and to assume among powers of the earth the equal and independent station to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the change. We hold these truths to be self-evident; that all men are created equal and independent; that from that equal creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just power from the consent of the governed; that whenever any form of government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect their safety and happiness. Prudence indeed will dictate that governments long established should not be changed for light and transient causes: and accordingly all experience hath shewn that mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to throw off such government and to provide new guards for future security. Such has been the patient sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for the truth of which we pledge a faith yet unsullied by falsehood.
Example: Word length histogram
How many “big”, “medium”, and “small” words are used?
Abridged Declaration of Independence A Declaration By the Representatives of the United States of America, in General Congress Assembled. When in the course of human events it becomes necessary for a people to advance from that subordination in which they have hitherto remained, and to assume among powers of the earth the equal and independent station to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the change. We hold these truths to be self-evident; that all men are created equal and independent; that from that equal creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just power from the consent of the governed; that whenever any form of government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect their safety and happiness. Prudence indeed will dictate that governments long established should not be changed for light and transient causes: and accordingly all experience hath shewn that mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to throw off such government and to provide new guards for future security. Such has been the patient sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for the truth of which we pledge a faith yet unsullied by falsehood.
Big = Yellow = 10+ letters
Medium = Red = 5..9 letters
Small = Blue = 2..4 letters
Tiny = Pink = 1 letter
Example: Word length histogram
Abridged Declaration of IndependenceA Declaration By the Representatives of the United States of America, in GeneralCongress Assembled.When in the course of human events it becomes necessary for a people to advance fromthat subordination in which they have hitherto remained, and to assume among powers ofthe earth the equal and independent station to which the laws of nature and of nature'sgod entitle them, a decent respect to the opinions of mankind requires that they shoulddeclare the causes which impel them to the change.We hold these truths to be self-evident; that all men are created equal and independent;that from that equal creation they derive rights inherent and inalienable, among which arethe preservation of life, and liberty, and the pursuit of happiness; that to secure theseends, governments are instituted among men, deriving their just power from the consentof the governed; that whenever any form of government shall become destructive of theseends, it is the right of the people to alter or to abolish it, and to institute new government,laying it's foundation on such principles and organizing it's power in such form, as tothem shall seem most likely to effect their safety and happiness. Prudence indeed will
dictate that governments long established should not be changed for light and transientcauses: and accordingly all experience hath shewn that mankind are more disposed tosuffer while evils are sufferable, than to right themselves by abolishing the forms towhich they are accustomed. But when a long train of abuses and usurpations, begun at adistinguished period, and pursuing invariably the same object, evinces a design to reducethem to arbitrary power, it is their right, it is their duty, to throw off such government andto provide new guards for future security. Such has been the patient sufferings of thecolonies; and such is now the necessity which constrains them to expunge their formersystems of government. the history of his present majesty is a history of unremittinginjuries and usurpations, among which no one fact stands single or solitary to contradictthe uniform tenor of the rest, all of which have in direct object the establishment of anabsolute tyranny over these states. To prove this, let facts be submitted to a candid world,for the truth of which we pledge a faith yet unsullied by falsehood.
Example: Word length histogram
Split the document into chunks and process each chunk on a different computer
Chunk 1
Chunk 2
(yellow, 20)(red, 71)(blue, 93)(pink, 6 )
Abridged Declaration of Independence A Declaration By the Representatives of the United States of America, in General Congress Assembled. When in the course of human events it becomes necessary for a people to advance from that subordination in which they have hitherto remained, and to assume among powers of the earth the equal and independent station to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the change. We hold these truths to be self-evident; that all men are created equal and independent; that from that equal creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just power from the consent of the governed; that whenever any form of government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect their safety and happiness. Prudence indeed will dictate that governments long established should not be changed for light and transient causes: and accordingly all experience hath shewn that mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to throw off such government and to provide new guards for future security. Such has been the patient sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for the truth of which we pledge a faith yet unsullied by falsehood.
Map Task 1(204 words)
Map Task 2(190 words)
(key, value)
(yellow, 17)(red, 77)(blue, 107)(pink, 3)
Example: Word length histogram
3/12/09 Bill Howe, eScience Institute 53
(yellow, 17)(red, 77)(blue, 107)(pink, 3)
(yellow, 20)(red, 71)(blue, 93)(pink, 6 )
Reduce tasks
(yellow, 17)(yellow, 20)
(red, 77)(red, 71)
(blue, 93)(blue, 107)
(pink, 6)(pink, 3)
Example: Word length histogram
A Declaration By the Representatives of the United States of America, in GeneralCongress Assembled.When in the course of human events it becomes necessary for a people to advance fromthat subordination in which they have hitherto remained, and to assume among powers ofthe earth the equal and independent station to which the laws of nature and of nature'sgod entitle them, a decent respect to the opinions of mankind requires that they shoulddeclare the causes which impel them to the change.We hold these truths to be self-evident; that all men are created equal and independent;that from that equal creation they derive rights inherent and inalienable, among which arethe preservation of life, and liberty, and the pursuit of happiness; that to secure theseends, governments are instituted among men, deriving their just power from the consentof the governed; that whenever any form of government shall become destructive of theseends, it is the right of the people to alter or to abolish it, and to institute new government,laying it's foundation on such principles and organizing it's power in such form, as tothem shall seem most likely to effect their safety and happiness. Prudence indeed will
dictate that governments long established should not be changed for light and transientcauses: and accordingly all experience hath shewn that mankind are more disposed tosuffer while evils are sufferable, than to right themselves by abolishing the forms towhich they are accustomed. But when a long train of abuses and usurpations, begun at adistinguished period, and pursuing invariably the same object, evinces a design to reducethem to arbitrary power, it is their right, it is their duty, to throw off such government andto provide new guards for future security. Such has been the patient sufferings of thecolonies; and such is now the necessity which constrains them to expunge their formersystems of government. the history of his present majesty is a history of unremittinginjuries and usurpations, among which no one fact stands single or solitary to contradictthe uniform tenor of the rest, all of which have in direct object the establishment of anabsolute tyranny over these states. To prove this, let facts be submitted to a candid world,for the truth of which we pledge a faith yet unsullied by falsehood.
Map task 1
Map task 2
“Shuffle step”
(yellow, 37)
(red, 148)
(blue, 200)
(pink, 9)
3/12/09 Bill Howe, eScience Institute 54
New Example: What does this do?
map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, 1);
reduce(String output_key, Iterator intermediate_values): // output_key: word // output_values: ???? int result = 0; for each v in intermediate_values: result += v; Emit(result);
slide source: Google, Inc.
3/12/09 Bill Howe, eScience Institute 55
Before RDBMS: if your data changed, your application broke.
Early RDBMS were buggy and slow (and often reviled), but required only 5% of the application code.
“Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed.” -- E.F. Codd 1979
Key Ideas: Programs that manipulate tabular data exhibit an algebraic structure allowing reasoning and manipulation independently of physical data representation
Relational Database Management Systems (RDBMS)
3/12/09 Bill Howe, eScience Institute 56
MapReduce is a Nascent Database Engine
Access Methods and Scheduling:
Query Language:
Query Optimizer:
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Pig Latin
Graphics taken from: hadoop.apache.org and research.yahoo.com/node/90
3/12/09 Bill Howe, eScience Institute 57
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
MapReduce and Hadoop
MR introduced by Google Published paper in OSDI 2004
MR: high-level programming model and implementation for large-scale parallel data processing
Hadoop Open source MR implementation Yahoo!, Facebook, New York Times
3/12/09 Bill Howe, eScience Institute 58
operators: • LOAD• STORE • FILTER• FOREACH … GENERATE • GROUP
binary operators: • JOIN• COGROUP• UNION
other support:• UDFs• COUNT• SUM • AVG• MIN/MAX
Additional operators:http://wiki.apache.org/pig-data/attachments/FrontPage/attachments/plrm.htm
A Query Language for MR: Pig Latin
High-level, SQL-like dataflow language for MR Goal: Sweet spot between SQL and MR
Applies SQL-like, high-level language constructs to accomplish low-level MR programming.
3/12/09 Bill Howe, eScience Institute 59
New Task: k-mer Similarity
Given a set of database sequences and a set of query sequences
Return the top N similar pairs, where similarity is defined as the number of common k-mers
3/12/09 Bill Howe, eScience Institute 60
Pig Latin program
D = LOAD ’db_sequences.fasta' USING FASTA() AS (did,dsequence);
Q = LOAD ’query_sequences.fasta' USING FASTA() AS (qid,qsequence);
Kd = FOREACH D GENERATE did, FLATTEN(kmers(7, dsequence));Kq = FOREACH Q GENERATE qid, FLATTEN(kmers(7, qsequence));
R = JOIN Kd BY kmer, Kq BY kmer
G = GROUP R BY (qid, did);C = FOREACH G GENERATE qid, did, COUNT(kmer) as scoreT = FILTER C BY score > 4
STORE g INTO seqs.txt';
3/12/09 Bill Howe, eScience Institute 61
New Task: Alignment
RMAP alignment implemented in HadoopMichael Schatz, CloudBurst: highly sensitive read mapping with MapReduce, Bioinformatics 25(11), April 2009
Goal: Align reads to a reference genome
Overview: Map: Split reads and reference into k-mers Reduce: for matching k-mers, find end-to-end
alignments (seed and extend)
3/12/09 Bill Howe, eScience Institute 62
MapReduce Overhead
QuickTime™ and a decompressor
are needed to see this picture.
3/12/09 Bill Howe, eScience Institute 63
Elastic MapReduce
Custom Jar Java
Streaming Any language that can read/write stdin/stdout
Pig Simple data flow language
Hive SQL
3/12/09 Bill Howe, eScience Institute 64
Taxonomy of Parallel Architectures
Easiest to program, but $$$$
Scales to 1000s of computers
top related