Hadoop as a Service Boston Azure 29-March-2012 Copyright (c) 2011, Bill Wilder – Use allowed under Creative Commons license http ://creativecommons.org/licenses/by-nc-sa/3.0 / Boston Azure User Group http ://www.bostonazure.org @bostonazure Bill Wilder http://blog.codingoutlou d.com @codingoutloud Big Data tools for the Windows Azure cloud platform
37
Embed
Hadoop as a Service Boston Azure 29-March-2012 Copyright (c) 2011, Bill Wilder – Use allowed under Creative Commons license
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Hadoop as a Service
Boston Azure29-March-2012
Copyright (c) 2011, Bill Wilder – Use allowed under Creative Commons license http://creativecommons.org/licenses/by-nc-sa/3.0/
Boston Azure User Grouphttp://www.bostonazure.org@bostonazure
Bill Wilderhttp://blog.codingoutloud.com@codingoutloud
Big Data tools for the Windows Azure cloud platform
show tables;describe hivesampletable;describe extended hivesampletable;select count(*) from hivesampletables;select country from hivesampletable;select distinct country from hivesampletable;select avg(querydwelltime) from hivesampletable;select devicemodel, SUM(querydwelltime) as totaldwelltime from hivesampletable group by devicemodel order by totaldwelltime DESC limit 10;
• Batch, not real-time or transactional• Scale out with commodity hardware• Big customers like LinkedIn and Yahoo!
– Clusters with 10s of Petabytes • (pssst… these fail… daily)
• Import data from Azure Blob, Data Market , S3– Or from files, like we will do in our example
Word Frequency Counter – how?
• The “hello world” of Hadoop / MapReduce– But we start without Hadoop / MapReduce
• Input: large corpus– Wikipedia extract for example – Can handle into PB
• Output: list of words, ordered by frequencythe 31415 be 9265 to 3589 of 793 and 238 … …
Simple Word Frequency Counter
const string file = @"e:\dev\azure\hadoop\wordcount\davinci.txt";var text = File.ReadAllText(file);
var matches = Regex.Matches(text, @"\b[\w]*\b");var words = (from m in matches.Cast<Match>() where !string.IsNullOrEmpty(m.Value) orderby m.Value.ToLower() select m.Value.ToLower()).ToArray();
var wordCounts = new Dictionary<string, int>();foreach (var word in words){ if (wordCounts.ContainsKey(word)) wordCounts[word]++; else wordCounts.Add(word, 1);}
• Read input, write output – that’s all– Function signature: Map(text) – Parses the text and returns { key, value }– Map(“a b a foo”) returns {a, 1}, {b, 1}, {a, 1}, {foo,
1}
* for Word Frequency Counter!
Actual Java Map Function
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException{ StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, “1”); }}
Step 3. Shuffle
• Shuffle collects all data from Map, organizes by key, and redistributes data to the nodes for Reduce
Step 3. Shuffle – example
• Mapper input:“the Bruins are the best in the NHL”
• Output from Step 3. Shuffle has been distributed to datanodes
• Your “Reducer” is called on local data– Repeatedly, until all complete– Tasks run in parallel on nodes
• This is very simple for Word Frequency Counter! – Function signature:
Reduce(key, values[]) – Adds up all the values and returns { key, sum }
Actual Java Reduce Functionpublic void reduce( Text key,
Iterable<IntWritable> values, Context context
) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); }}
Step 5. Celebrate!
• You are done – you have your output• In more complex scenario, might repeat
– Hadoop tool ecosystem knows how to do this• There are other projects in the Hadoop
ecosystem for …– Multi-step jobs– Managing a data warehouse – Supporting ad hoc querying – And more!
www.hadooponazure.com
demo
There’s a LOT MORE to the Hadoops…
• Hadoop streaming interface allows other languages– C#