Data at Tumblr
Adam LaiacanoNYC Data Science Meetup
@adamlaiacanoadamlaiacano.tumblr.com
Monday, April 8, 13
What I Needed to LearnWhen I Started My Job
Monday, April 8, 13
Electrical Engineering backgroundWorked at CBS to learn more about stats / data
Joined Tumblr in August 201140th employee, now over 160
About Me
Monday, April 8, 13
blogging platform / social network100,000,000 blogs!
unique signals:asynchronous following graph
reblogs, likes, replies
About Tumblr
Monday, April 8, 13
Country March Apr May
USA 10000 12000 14000
Canada 7000 6500 5000
France 1200 1400 2000
Country Month Value
USA March 10000
USA April 12000
USA May 14000
Canada March 7000
Canada April 6500
Canada May 5000
France March 1200
France April 1400
France May 2000
About You
Monday, April 8, 13
Country March Apr May
USA 10000 12000 14000
Canada 7000 6500 5000
France 1200 1400 2000
Country Month Value
USA March 10000
USA April 12000
USA May 14000
Canada March 7000
Canada April 6500
Canada May 5000
France March 1200
France April 1400
France May 2000
Pivot Table!
About You
Monday, April 8, 13
About YouCountry Month Value
USA March 1000
USA April 12000
USA May 14000
Canada March 7000
Canada April 6500
Canada May 5000
France March 1200
France April 1400
France May 2000
Country March Apr May
USA 10000 12000 14000
Canada 7000 6500 5000
France 1200 1400 2000
Monday, April 8, 13
pivoted <- cast(melted, country~month) melted <- melt.data.frame(pivoted, id.vars='country')
About YouCountry Month Value
USA March 1000
USA April 12000
USA May 14000
Canada March 7000
Canada April 6500
Canada May 5000
France March 1200
France April 1400
France May 2000
Country March Apr May
USA 10000 12000 14000
Canada 7000 6500 5000
France 1200 1400 2000
Monday, April 8, 13
About YouCountry Month Value
USA March 1000
USA April 12000
USA May 14000
Canada March 7000
Canada April 6500
Canada May 5000
France March 1200
France April 1400
France May 2000
Country March Apr May
USA 10000 12000 14000
Canada 7000 6500 5000
France 1200 1400 2000
Monday, April 8, 13
Who Cares?
About YouCountry Month Value
USA March 1000
USA April 12000
USA May 14000
Canada March 7000
Canada April 6500
Canada May 5000
France March 1200
France April 1400
France May 2000
Country March Apr May
USA 10000 12000 14000
Canada 7000 6500 5000
France 1200 1400 2000
Monday, April 8, 13
One more question:
Monday, April 8, 13
Monday, April 8, 13
Hadoop
Monday, April 8, 13
What tools we use
What we do with those tools
Monday, April 8, 13
Plumbing
John D. Cook "The plumber programmer" November 2011 http://bit.ly/XfcXrt
Monday, April 8, 13
1. Record events / actions2. Store / archive everything3. Extract information
a. Reports / BIb. Back to Tumblr application
Pipes
Monday, April 8, 13
Built-in Variables•timestamp•referring page•user identifier•action identifier•location (city)•language setting
GiantOctopus: in-house event logging system.
GiantOctopus::log( ‘posts’, array(‘send_to_fb’=>1, ‘send_to_twitter’=>0 ));
Step 1: Log Events
Monday, April 8, 13
Scribe
Continuously Writing HDFSDaily
Cron
Web Servers Scribe Servers
Monday, April 8, 13
Step 2: Store in HadoopOne huge computer:
300TB hard drive7.8TB of RAM
85 x 2 = 170 hex-core processors
Monday, April 8, 13
Step 2: Store in HadoopOne huge computer:
300TB hard drive7.8TB of RAM
85 x 2 = 170 hex-core processors
One huge PITA:awful docs (search-hadoop.com helps)
java everywherefragmented community
Monday, April 8, 13
Hadoop
hive
pig
map/reduce
Monday, April 8, 13
Hive
"Basically SQL"
Compiles to Java map/reduce
About 100 hive tables
Each "table" is really a directory of flat files
SELECT root_post_id, count(*) AS likesFROM postsWHERE action='like'ORDER BY likes DESCLIMIIT 10;
10 most liked posts
Monday, April 8, 13
Hive Partitions
/posts/2013/03/26/*.lzo/posts/2013/03/27/*.lzo/posts/2013/03/28/*.lzo
dt='2013-03-26'dt='2013-03-26'dt='2013-03-26'
File location in HDFS Hive partition value
SELECT action, COUNT(*) AS views FROM pageviews WHERE dt = "2012-03-05" GROUP BY action
204 mappers 22,895 mappers
SELECT action, COUNT(*) AS views FROM pageviews WHERE ts > 1330927200 AND ts < 1331013600 GROUP BY action
Monday, April 8, 13
Extending Hive: Streaming
add file helpers.py;
FROM usersSELECT TRANSFORM(id, email) USING 'helpers.py' AS (id_with_gmail)
•Add all .py files you’ll need to the query•Sends each record to python script via stdin•Can be used as a subquery in a “normal” hive query
#!/usr/bin/python## helpers.py
import sys, regmail = re.compile(r'[email protected]')for row in sys.stdin: id, email = row.split('\t') if gmail.match(email): print id
Monday, April 8, 13
Pig
posts = LOAD 'posts.tsv' AS ( root_post_id:int, action:chararray);
likes = FILTER posts BY action=='like';
grouped = GROUP likes BY root_post_id;
counted = FOREACH grouped GENERATE group AS root_post_id, COUNT(likes.root_post_id) AS likes;
sorted = ORDER counted BY likes DESC;
top10 = LIMIT sorted 10;
STORE top10 INTO 'top10.csv';
"Basically SQL" if you had toexplain it piece by piece.
"DataBag" == "DataFrame"
Monday, April 8, 13
Extending Pig: Python UDFsExtract word prefixes for type-
ahead tag search
def prefixes(input, max_len=3): nchar = min(len(input), max_len) + 1 return [input[:i] for i in range(1,nchar)]
>>> prefixes('museum')['m', 'mu', 'mus', 'muse', 'museu', 'museum']
Monday, April 8, 13
Extending Pig: Python UDFsExtract word prefixes for type-
ahead tag search
def prefixes(input, max_len=3): nchar = min(len(input), max_len) + 1 return [input[:i] for i in range(1,nchar)]
>>> prefixes('museum')['m', 'mu', 'mus', 'muse', 'museu', 'museum']
@outputSchema("t:(prefix:chararray)")
Monday, April 8, 13
package com.tumblr.swine;
import java.util.ArrayList;import java.util.List;
public class Prefixes {
private int maxTermLen;
public Prefixes() { this.maxTermLen = Integer.MAX_VALUE; }
public Prefixes(int maxTermLen) { this.maxTermLen = maxTermLen; }
public List<String> get(String s) { int size = s.length() < maxTermLen ? s.length() : maxTermLen; ArrayList<String> results = new ArrayList<String>(); for (int i=1; i < size + 1; i++) { results.add(s.substring(0,i)); } return results; }}
Extending Pig: Java UDFs
Monday, April 8, 13
package com.tumblr.swine;
import java.util.ArrayList;import java.util.List;
public class Prefixes {
private int maxTermLen;
public Prefixes() { this.maxTermLen = Integer.MAX_VALUE; }
public Prefixes(int maxTermLen) { this.maxTermLen = maxTermLen; }
public List<String> get(String s) { int size = s.length() < maxTermLen ? s.length() : maxTermLen; ArrayList<String> results = new ArrayList<String>(); for (int i=1; i < size + 1; i++) { results.add(s.substring(0,i)); } return results; }}
package com.tumblr.swine.pig;
import java.io.IOException;import java.util.ArrayList;
import java.util.List;
import org.apache.pig.EvalFunc;import org.apache.pig.FuncSpec;import org.apache.pig.data.DataBag;import org.apache.pig.data.DataType;import org.apache.pig.data.DefaultBagFactory;import org.apache.pig.data.Tuple;import org.apache.pig.data.TupleFactory;import org.apache.pig.impl.logicalLayer.FrontendException;import org.apache.pig.impl.logicalLayer.schema.Schema;
public class Prefixes extends EvalFunc<DataBag> {
public DataBag exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return null; try{ DataBag output = DefaultBagFactory.getInstance().newDefaultBag(); String word = (String)input.get(0); int max = Integer.MAX_VALUE; if (input.size() == 2) { max = (Integer)input.get(1); } com.tumblr.swine.Prefixes prefixes = new com.tumblr.swine.Prefixes(max); for (String prefix : prefixes.get(word)) { Tuple t = TupleFactory.getInstance().newTuple(1); t.set(0, prefix); output.add(t); } return output; }catch(Exception e){ System.err.println("Prefixes: failed to process input; error - " + e.getMessage()); return null; } }
@Override public Schema outputSchema(Schema input) { Schema bagSchema = new Schema(); bagSchema.add(new Schema.FieldSchema("prefix", DataType.CHARARRAY)); try{ return new Schema(new Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), input), bagSchema, DataType.BAG)); }catch (FrontendException e){ return null; } }
@Override public List<FuncSpec> getArgToFuncMapping() throws FrontendException { List<FuncSpec> funcSpecs = new ArrayList<FuncSpec>(2); Schema s = new Schema(); s.add(new Schema.FieldSchema(null, DataType.CHARARRAY)); funcSpecs.add(new FuncSpec(this.getClass().getName(), s)); // Allow specifying optional max length of prefix s = new Schema(); s.add(new Schema.FieldSchema(null, DataType.CHARARRAY)); s.add(new Schema.FieldSchema(null, DataType.INTEGER)); funcSpecs.add(new FuncSpec(this.getClass().getName(), s)); return funcSpecs; }
}
Extending Pig: Java UDFs
Monday, April 8, 13
HUE
Keeps query history
Preview tables / results
Save queries & templates
Monday, April 8, 13
What tools we use
What we do with those tools
Monday, April 8, 13
Spam
Classic example of supervised learning
Don't get too clever
Build good tooling!
Monday, April 8, 13
Spam: Vowpal WabbitOnline (continuously learning) system
Updates parameters with every new piece of information
Parallelizable, can run as service, very fast.
Loss functions:•squared•logistic•hinge•quantile
Monday, April 8, 13
Spam: Vowpal Wabbitblog: 'adamlaiacano',tags: ['free ipad', 'warez'],location: 'US~NY-New York',is_suspended: 0 or 1
Post:
Model: is_suspended ~ free_ipad + warez + US~NY-New_York + .....
Square loss functionVery high dimension: L1 regularization to avoid overfittingGreat precision, decent recall
Monday, April 8, 13
Type - Ahead search
Most popular tags for any letter combination
Store daily results in distributed Redis cluster
m: [me, model, mine]mu: [muscle, muscles, music video]mus: [muscle, muscles, music video]muse: [muse, museum, nine muses]museu: [museum, metropolitan museum of art, natural history museum]
Monday, April 8, 13
Type - Ahead search
Only keep popular prefixes: tag must occur 10 times
Only update keys that have changed.
- muse: [muse, museum, nine muses]+ muse: [muse, museum, arizona muse]
Monday, April 8, 13
Questions?
@adamlaiacano
http://adamlaiacano.tumblr.com
Monday, April 8, 13