“Best friend” Big Data & Hadoop showcase Dušan Zamurović @codecentricRS
“Best friend”Big Data & Hadoop
showcaseDušan Zamurović
@codecentricRS
Name: Dušan Zamurović Where I come from?
◦ codecentric Novi Sad What I do?
◦ Java web-app background◦ ♥ JavaScript ♥
Ajax with DWR lib◦ Android◦ currently Big Data (reporting QA)
Who am I?
me Big Data Map/Reduce algorithm Hadoop platform Pig language Showcase
◦ Java Map/Reduce implementation◦ Pig implementation
Conclusion
What I will talk about?
A revolution that will transform how we live, work, and think.
3 Vs of big data◦ Volume◦ Variety◦ Velocity
Every day use-cases◦ Beautiful◦ Useful◦ Funny
Big Data
The principal characteristic Studies report
◦ 1.2 trillion gigabytes of new data was created worldwide in 2011 alone
◦ From 2005 to 2020, the digital universe will grow by a factor of 300
◦ By 2020 the digital universe will amount to 40 trillion gigabytes (more than 5,200 gigabytes for every man, woman, and child in 2020)
Big Data - Volume
The biggest growth – unstructured data◦ Documents◦ Web logs◦ Sensor data◦ Videos and photos◦ Medical devices◦ Social media
>90% of this Big Data is unstructured Analytic value?
◦ 33% valuable info by 2020
Big Data - Variety
Generated at high speed Needs real-time processing
Example I◦ Financial world◦ Thousands or millions of transactions
Example II◦ Retail◦ Analyze click streams to offer recommendations
Big Data – Velocity
Value of Big Data is potentially great but can be released only with the right combination of people, processes and technologies.
…unlock significant value by making information transparent and usable at much higher frequency
Big Data - Value
Measuring heartbeat of a city - Rio de Janeiro
More examples◦ Product development – most valuable features◦ Manufacturing – indicators of quality problems◦ Distribution – optimize inventory and supply chains◦ Sales – account targeting, resource allocation
Beer and diapers
Possible issues?◦ Privacy, security, intellectual property, liability…
Big Data - Value
"Map/Reduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.“- research publication http://research.google.com/archive/mapreduce.html
Map/Reduce
Map/Reduce
In the beginning, there was Nutch
Which problems does it address?◦ Big Data◦ Not fit for RDBMS◦ Computationally extensive
Hadoop && RDBMS◦ “Get data to process” or “send code where data is”◦ Designed to run on large number of machines◦ Separate storage
Hadoop
Distributed File System◦ Designed for commodity hardware◦ Highly fault-tolerant◦ Relaxed POSIX
To enable streaming access to file system data
Assumptions and Goals◦ Hardware failure◦ Streaming data access◦ Large data sets◦ Write-once-read-many◦ Move computation, not data
HDFS
NameNode◦ Master server, central component◦ HDFS cluster has single NameNode◦ Manages client’s access◦ Keeps track where data is kept◦ Single point of failure
Secondary NameNode◦ Optional component◦ Checkpoints of the namespace
Does not provide any real redundancy
HDFS Architecture
DataNode◦ Stores data in the file system◦ Talks to NameNode and responds to requests◦ Talks to other DataNodes
Data replication
TaskTracker◦ Should be where DataNode is◦ Accepts tasks (Map, Reduce, Shuffle…)◦ Set of slots for tasks◦ ♥__ ♥__ ♥__ ________ ♥_ ♥ ♥ ♥__________________
HDFS Architecture
JobTracker◦ Farms tasks to specific nodes in the cluster◦ Point of failure for MapReduce
How it goes?1. Client submits jobs JobTracker2. JobTracker, whereis NameNode3. JobTracker locates TaskTracker4. JobTracker, tasks TaskTracker5. TaskTracker ♥__ ♥__ ♥__
1. Job failed, TaskTracker informs, JobTracker decides2. Job done, JobTracker updates status
6. Client can poll JobTracker for information
HDFS Architecture
Platform for analyzing large data sets◦ Language – Pig Latin◦ High level approach◦ Compiler◦ Grunt shell
Pig compared to SQL◦ Lazy evaluation◦ Procedural language◦ More like an execution plan
Apache Pig
Pig Latin statements◦ A relation is a bag◦ A bag is collection of tuples◦ A tuple is on ordered set of fields◦ A field is piece of data◦ A relation is referenced by name, i.e. alias
Apache Pig – Pig Latin
A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);DUMP A;(John,18,4.0F)(Mary,19,3.8F)(Bill,20,3.9F)(Joe,18,3.8F)
Data types◦ Simple
int – signed 32-bit integer long – signed 64-bit integer float – 32-bit floating point double – 64-bit floating point charrarray – UTF-8 string bytearray – blob boolean – since Pig 0.10 datetime
◦ Complex tuple – an ordered set of fields (21,32) bag – a collection of tuples {(21,32),(32,43)} map – a set of key value pairs [pig#latin]
Apache Pig – Pig Latin data types
Data structure and defining schemas◦ Why to define schema?◦ Where to define schema?◦ How to define schema?
Apache Pig – Pig Latin schemas
/* data types not specified */a = LOAD '1.txt' AS (a0, b0);a: {a0: bytearray,b0: bytearray}
/* number of fields not known */a = LOAD '1.txt';a: Schema for a unknown
Arithmetic: +, -, *, /, %, ? : Boolean: AND, OR, NOT Cast Comparison: ==, !=, <, >, <=, >=, matches Type construction: (), {}, [] incl. eq. functions Relational
◦ GROUP◦ DEFINE◦ FILTER◦ FOREACH◦ JOIN◦ UNION◦ STORE◦ LOAD◦ SPLIT
Apache Pig – Pig Latin operators
Eval functions◦ AVG, MAX, MIN, COUNT, SUM, …
Load/Store functions◦ BinStorage◦ JsonLoader, JsonStorage◦ PigStorage
Math functions◦ ABS, COS, …, EXP, RANDOM, ROUND, …
String functions◦ TRIM, LOWER, SUBSTRING, REPLACE, …
Datetime functions◦ *Between, Get*, …
Tuple, Bag, Map functions◦ TOTUPLE, TOBAG, TOMAP
Apache Pig – Pig Latin built in
User Defined Functions◦ Java, Python, JavaScript, Ruby, Groovy
How to write an UDF?◦ Eval function extends EvalFunc<something>◦ Load function extends LoadFunc◦ Store function extends StoreFunc
How to use an UDF?◦ Register◦ Define the name of the UDF if you like◦ Call it
Apache Pig – extend Pig Latin
“Best friend”Hadoop showcase
Imaginary social network A lots of users…
… with their friends, girlfriends, boyfriends, wives, husbands, mistresses, etc…
New relationship arises…◦ … but new friend is not shown in news feed
Where are his/her activities?◦ Hidden, marked as not important
Showcase: a problem
Find out the value of the relationship Monitor and log user activities
◦ For each user, of course◦ Each activity has some value (event weight)◦ Records user’s activities◦ Store those logs in HDFS◦ Analyze those logs from time to time◦ Calculate needed values◦ Show only the activities of “important” friends
Showcase: a solution
Events recorded in JSON format
{ "timestamp": 1341161607860, "sourceUser": "marry.lee", "targetUser": "ruby.blue", "eventName": "VIEW_PHOTO", "eventWeight": 1}
Showcase: input data
Showcase: input data
public enum EventType { VIEW_DETAILS(3), VIEW_PROFILE(10), VIEW_PHOTO(1), COMMENT(2), COMMENT_LIKE(1), WALL_POST(3), MESSAGE(1); …}
Showcase: Java M/Rstatic public class InteractionMap extends Mapper<LongWritable, Text, Text, InteractionWritable> {@Overrideprotected void map(LongWritable offset, Text text, Context context) … { …}@Overrideprotected void reduce(Text token, Iterable<InteractionWritable> interactions, Context context) … { …}
Showcase: Java M/R
void map(LongWritable offset, Text text, Context context) { String[] tokens = MyJsonParser.parse(text); String sourceUser = tokens[1]; String targetUser = tokens[2]; int eventWeight = Integer.parseInt(tokens[4]); context.write(new Text(sourceUser), new InteractionWritable(targetUser, eventWeight));}
Showcase: Java M/Rvoid reduce(Text token, Iterable<InteractionWritable> iActions, Context context) … { Map<Text, InteractionValuesWritable> iActionsGroup = newHashMap<Text,InteractionValuesWritable>(); Iterator<InteractionWritable> iActionsIterator = iActions.iterator(); while(iActionsIterator.hasNext()) { InteractionWritable iAction = iActionsIterator.next(); Text targetUser = new Text(iAction.getTargetUser().toString()); int weight = iAction.getEventWeight().get(); int count = 1;
…
Showcase: Java M/R … InteractionValuesWritable iActionValues = iActionGroup.get(tUser); if (iActionsValues != null) { weight += iActionValues.getWeight().get(); count = iActionValues.getCount.get() + 1; } iActionGroup.put(targetUser, new InteractionValuesWritable(weight, count));
List orderedInteractions = sortInteractionsByWeight(iActionsGroup); for (Entry entry : orderedInteractions) { InteractionsValuesWritable value = entry.getValue(); String resLine = … // entry.key + value.weight + value.count context.write(token, new Text(resLine)); }}
Showcase: M/R resultcasie.keller petar.petrovic 97579 32554casie.keller marry.lee 97284 32094casie.keller jane.doe 97247 32400casie.keller domenico.quatro-formaggi 96712 32106casie.keller esmeralda.aguero 96665 32251casie.keller jason.bourne 96499 32043casie.keller jose.miguel 96304 31927casie.keller steve.smith 95929 32267casie.keller john.doe 95664 31996casie.keller swatka.mawa 95421 31785casie.keller lee.young 95400 31758casie.keller ruby.blue 95132 32181domenico.quatro-formaggi jane.doe 97442 32492domenico.quatro-formaggi ruby.blue 97072 31916domenico.quatro-formaggi jason.bourne 96967 3223…
Showcase: Pig M/Rclass JsonLoader extends LoadFunc { @Override public InputFormat getInputFormat() throws IOException { return new TextInputFormat(); } public ResourceSchema getSchema(String location, Job job) … { ResourceSchema schema = new ResourceSchema(); ResourceFieldSchema[] fieldSchemas = new ResourceFieldSchema[SCHEMA_FIELDS_COUNT]; fieldSchemas[0] = new ResourceFieldSchema(); fieldSchemas[0].setName(FIELD_NAME_TIMESTAMP); fieldSchemas[0].setType(DataType.LONG); … schema.setFields(fieldSchemas); return schema; }}
Showcase: Pig M/Rclass JsonLoader extends LoadFunc {… @Override public Tuple getNext() throws IOException { try { boolean notDone = in.nextKeyValue(); if (!notDone) { return null; } Text jsonRecord = (Text) in.getCurrentValue(); String[] values = MyJsonParser.parse(jsonRecord); Tuple tuple = tuppleFactory.newTuple(Arrays.asList(values)); return tuple; } catch (Exception exc) { throw new IOException(exc); } }}
Showcase: Pig M/Rclass AverageWeight extends EvalFunc<String> {… @Override public String exec(Tuple input) … { String output = null; if (input != null && input.size() == 2) { Integer totalWeight = (Integer) input.get(0); Integer totalCount = (Integer) input.get(1); BigDecimal average = new BigDecimal(totalWeight). divide(new BigDecimal(totalCount), SCALE, RoundingMode.HALF_UP); output = average.stripTrailingZeros().toPlainString(); } return output; }
}
Showcase: Pig M/RREGISTER codingserbia-udf.jarDEFINE AVG_WEIGHT com.codingserbia.udf.AverageWeight();
interactionRecords = LOAD ‘/blog/user_interaction_big.json’ USING com.codingserbia.udf.JsonLoader();
interactionData = FOREACH interactionRecords GENERATE sourceUser, targetUser, eventWeight;
groupInteraction = GROUP interactionData BY (sourceUser, targetUser);…
Showcase: Pig M/R…summarizedInteraction = FOREACH groupInteraction GENERATE group.sourceUser AS sourceUser, group.targetUser AS targetUser, SUM(interactionData.eventWeight) AS eventWeight, COUNT(interactionData.eventWeight) AS eventCount, AVG_WEIGHT( SUM(interactionData.eventWeight), COUNT(interactionData.eventWeight)) AS averageWeight;
result = ORDER summarizedInteraction BY sourceUser, eventWeight DESC;
STORE result INTO '/results/pig_mr’ USING PigStorage();
Conclusion