Coding serbia

“Best friend”Big Data & Hadoop

showcaseDušan Zamurović

@codecentricRS

Name: Dušan Zamurović Where I come from?

◦ codecentric Novi Sad What I do?

◦ Java web-app background◦ ♥ JavaScript ♥

Ajax with DWR lib◦ Android◦ currently Big Data (reporting QA)

Who am I?

me Big Data Map/Reduce algorithm Hadoop platform Pig language Showcase

◦ Java Map/Reduce implementation◦ Pig implementation

Conclusion

What I will talk about?

A revolution that will transform how we live, work, and think.

3 Vs of big data◦ Volume◦ Variety◦ Velocity

Every day use-cases◦ Beautiful◦ Useful◦ Funny

Big Data

The principal characteristic Studies report

◦ 1.2 trillion gigabytes of new data was created worldwide in 2011 alone

◦ From 2005 to 2020, the digital universe will grow by a factor of 300

◦ By 2020 the digital universe will amount to 40 trillion gigabytes (more than 5,200 gigabytes for every man, woman, and child in 2020)

Big Data - Volume

The biggest growth – unstructured data◦ Documents◦ Web logs◦ Sensor data◦ Videos and photos◦ Medical devices◦ Social media

>90% of this Big Data is unstructured Analytic value?

◦ 33% valuable info by 2020

Big Data - Variety

Generated at high speed Needs real-time processing

Example I◦ Financial world◦ Thousands or millions of transactions

Example II◦ Retail◦ Analyze click streams to offer recommendations

Big Data – Velocity

Value of Big Data is potentially great but can be released only with the right combination of people, processes and technologies.

…unlock significant value by making information transparent and usable at much higher frequency

Big Data - Value

Measuring heartbeat of a city - Rio de Janeiro

More examples◦ Product development – most valuable features◦ Manufacturing – indicators of quality problems◦ Distribution – optimize inventory and supply chains◦ Sales – account targeting, resource allocation

Beer and diapers

Possible issues?◦ Privacy, security, intellectual property, liability…

Big Data - Value

"Map/Reduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.“- research publication http://research.google.com/archive/mapreduce.html

Map/Reduce

Map/Reduce

In the beginning, there was Nutch

Which problems does it address?◦ Big Data◦ Not fit for RDBMS◦ Computationally extensive

Hadoop && RDBMS◦ “Get data to process” or “send code where data is”◦ Designed to run on large number of machines◦ Separate storage

Hadoop

Distributed File System◦ Designed for commodity hardware◦ Highly fault-tolerant◦ Relaxed POSIX

To enable streaming access to file system data

Assumptions and Goals◦ Hardware failure◦ Streaming data access◦ Large data sets◦ Write-once-read-many◦ Move computation, not data

HDFS

NameNode◦ Master server, central component◦ HDFS cluster has single NameNode◦ Manages client’s access◦ Keeps track where data is kept◦ Single point of failure

Secondary NameNode◦ Optional component◦ Checkpoints of the namespace

Does not provide any real redundancy

HDFS Architecture

DataNode◦ Stores data in the file system◦ Talks to NameNode and responds to requests◦ Talks to other DataNodes

Data replication

TaskTracker◦ Should be where DataNode is◦ Accepts tasks (Map, Reduce, Shuffle…)◦ Set of slots for tasks◦ ♥__ ♥__ ♥__ ________ ♥_ ♥ ♥ ♥__________________

HDFS Architecture

JobTracker◦ Farms tasks to specific nodes in the cluster◦ Point of failure for MapReduce

How it goes?1. Client submits jobs JobTracker2. JobTracker, whereis NameNode3. JobTracker locates TaskTracker4. JobTracker, tasks TaskTracker5. TaskTracker ♥__ ♥__ ♥__

1. Job failed, TaskTracker informs, JobTracker decides2. Job done, JobTracker updates status

6. Client can poll JobTracker for information

HDFS Architecture

Platform for analyzing large data sets◦ Language – Pig Latin◦ High level approach◦ Compiler◦ Grunt shell

Pig compared to SQL◦ Lazy evaluation◦ Procedural language◦ More like an execution plan

Apache Pig

Pig Latin statements◦ A relation is a bag◦ A bag is collection of tuples◦ A tuple is on ordered set of fields◦ A field is piece of data◦ A relation is referenced by name, i.e. alias

Apache Pig – Pig Latin

A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);DUMP A;(John,18,4.0F)(Mary,19,3.8F)(Bill,20,3.9F)(Joe,18,3.8F)

Data types◦ Simple

int – signed 32-bit integer long – signed 64-bit integer float – 32-bit floating point double – 64-bit floating point charrarray – UTF-8 string bytearray – blob boolean – since Pig 0.10 datetime

◦ Complex tuple – an ordered set of fields (21,32) bag – a collection of tuples {(21,32),(32,43)} map – a set of key value pairs [pig#latin]

Apache Pig – Pig Latin data types

Data structure and defining schemas◦ Why to define schema?◦ Where to define schema?◦ How to define schema?

Apache Pig – Pig Latin schemas

/* data types not specified */a = LOAD '1.txt' AS (a0, b0);a: {a0: bytearray,b0: bytearray}

/* number of fields not known */a = LOAD '1.txt';a: Schema for a unknown

Arithmetic: +, -, *, /, %, ? : Boolean: AND, OR, NOT Cast Comparison: ==, !=, <, >, <=, >=, matches Type construction: (), {}, [] incl. eq. functions Relational

◦ GROUP◦ DEFINE◦ FILTER◦ FOREACH◦ JOIN◦ UNION◦ STORE◦ LOAD◦ SPLIT

Apache Pig – Pig Latin operators

Eval functions◦ AVG, MAX, MIN, COUNT, SUM, …

Load/Store functions◦ BinStorage◦ JsonLoader, JsonStorage◦ PigStorage

Math functions◦ ABS, COS, …, EXP, RANDOM, ROUND, …

String functions◦ TRIM, LOWER, SUBSTRING, REPLACE, …

Datetime functions◦ *Between, Get*, …

Tuple, Bag, Map functions◦ TOTUPLE, TOBAG, TOMAP

Apache Pig – Pig Latin built in

User Defined Functions◦ Java, Python, JavaScript, Ruby, Groovy

How to write an UDF?◦ Eval function extends EvalFunc<something>◦ Load function extends LoadFunc◦ Store function extends StoreFunc

How to use an UDF?◦ Register◦ Define the name of the UDF if you like◦ Call it

Apache Pig – extend Pig Latin

“Best friend”Hadoop showcase

Imaginary social network A lots of users…

… with their friends, girlfriends, boyfriends, wives, husbands, mistresses, etc…

New relationship arises…◦ … but new friend is not shown in news feed

Where are his/her activities?◦ Hidden, marked as not important

Showcase: a problem

Find out the value of the relationship Monitor and log user activities

◦ For each user, of course◦ Each activity has some value (event weight)◦ Records user’s activities◦ Store those logs in HDFS◦ Analyze those logs from time to time◦ Calculate needed values◦ Show only the activities of “important” friends

Showcase: a solution

Events recorded in JSON format

{ "timestamp": 1341161607860, "sourceUser": "marry.lee", "targetUser": "ruby.blue", "eventName": "VIEW_PHOTO", "eventWeight": 1}

Showcase: input data

Showcase: input data

public enum EventType { VIEW_DETAILS(3), VIEW_PROFILE(10), VIEW_PHOTO(1), COMMENT(2), COMMENT_LIKE(1), WALL_POST(3), MESSAGE(1); …}

Showcase: Java M/Rstatic public class InteractionMap extends Mapper<LongWritable, Text, Text, InteractionWritable> {@Overrideprotected void map(LongWritable offset, Text text, Context context) … { …}@Overrideprotected void reduce(Text token, Iterable<InteractionWritable> interactions, Context context) … { …}

Showcase: Java M/R

void map(LongWritable offset, Text text, Context context) { String[] tokens = MyJsonParser.parse(text); String sourceUser = tokens[1]; String targetUser = tokens[2]; int eventWeight = Integer.parseInt(tokens[4]); context.write(new Text(sourceUser), new InteractionWritable(targetUser, eventWeight));}

Showcase: Java M/Rvoid reduce(Text token, Iterable<InteractionWritable> iActions, Context context) … { Map<Text, InteractionValuesWritable> iActionsGroup = newHashMap<Text,InteractionValuesWritable>(); Iterator<InteractionWritable> iActionsIterator = iActions.iterator(); while(iActionsIterator.hasNext()) { InteractionWritable iAction = iActionsIterator.next(); Text targetUser = new Text(iAction.getTargetUser().toString()); int weight = iAction.getEventWeight().get(); int count = 1;

…

Showcase: Java M/R … InteractionValuesWritable iActionValues = iActionGroup.get(tUser); if (iActionsValues != null) { weight += iActionValues.getWeight().get(); count = iActionValues.getCount.get() + 1; } iActionGroup.put(targetUser, new InteractionValuesWritable(weight, count));

List orderedInteractions = sortInteractionsByWeight(iActionsGroup); for (Entry entry : orderedInteractions) { InteractionsValuesWritable value = entry.getValue(); String resLine = … // entry.key + value.weight + value.count context.write(token, new Text(resLine)); }}

Showcase: M/R resultcasie.keller petar.petrovic 97579 32554casie.keller marry.lee 97284 32094casie.keller jane.doe 97247 32400casie.keller domenico.quatro-formaggi 96712 32106casie.keller esmeralda.aguero 96665 32251casie.keller jason.bourne 96499 32043casie.keller jose.miguel 96304 31927casie.keller steve.smith 95929 32267casie.keller john.doe 95664 31996casie.keller swatka.mawa 95421 31785casie.keller lee.young 95400 31758casie.keller ruby.blue 95132 32181domenico.quatro-formaggi jane.doe 97442 32492domenico.quatro-formaggi ruby.blue 97072 31916domenico.quatro-formaggi jason.bourne 96967 3223…

Showcase: Pig M/Rclass JsonLoader extends LoadFunc { @Override public InputFormat getInputFormat() throws IOException { return new TextInputFormat(); } public ResourceSchema getSchema(String location, Job job) … { ResourceSchema schema = new ResourceSchema(); ResourceFieldSchema[] fieldSchemas = new ResourceFieldSchema[SCHEMA_FIELDS_COUNT]; fieldSchemas[0] = new ResourceFieldSchema(); fieldSchemas[0].setName(FIELD_NAME_TIMESTAMP); fieldSchemas[0].setType(DataType.LONG); … schema.setFields(fieldSchemas); return schema; }}

Showcase: Pig M/Rclass JsonLoader extends LoadFunc {… @Override public Tuple getNext() throws IOException { try { boolean notDone = in.nextKeyValue(); if (!notDone) { return null; } Text jsonRecord = (Text) in.getCurrentValue(); String[] values = MyJsonParser.parse(jsonRecord); Tuple tuple = tuppleFactory.newTuple(Arrays.asList(values)); return tuple; } catch (Exception exc) { throw new IOException(exc); } }}

Showcase: Pig M/Rclass AverageWeight extends EvalFunc<String> {… @Override public String exec(Tuple input) … { String output = null; if (input != null && input.size() == 2) { Integer totalWeight = (Integer) input.get(0); Integer totalCount = (Integer) input.get(1); BigDecimal average = new BigDecimal(totalWeight). divide(new BigDecimal(totalCount), SCALE, RoundingMode.HALF_UP); output = average.stripTrailingZeros().toPlainString(); } return output; }

}

Showcase: Pig M/RREGISTER codingserbia-udf.jarDEFINE AVG_WEIGHT com.codingserbia.udf.AverageWeight();

interactionRecords = LOAD ‘/blog/user_interaction_big.json’ USING com.codingserbia.udf.JsonLoader();

interactionData = FOREACH interactionRecords GENERATE sourceUser, targetUser, eventWeight;

groupInteraction = GROUP interactionData BY (sourceUser, targetUser);…

Showcase: Pig M/R…summarizedInteraction = FOREACH groupInteraction GENERATE group.sourceUser AS sourceUser, group.targetUser AS targetUser, SUM(interactionData.eventWeight) AS eventWeight, COUNT(interactionData.eventWeight) AS eventCount, AVG_WEIGHT( SUM(interactionData.eventWeight), COUNT(interactionData.eventWeight)) AS averageWeight;

result = ORDER summarizedInteraction BY sourceUser, eventWeight DESC;

STORE result INTO '/results/pig_mr’ USING PigStorage();

Conclusion

Coding serbia

Technology