Hadoop @ eBuddy
Hadoop @ eBuddy
eBuddy
Web based chat (Started in 2003)● Initially no statistics, msn only● Started basic logging in 2004● Today
○ 34.467.010.693 login records (34x109)○ It takes about 40min to select them all.
XMS (Launched May 23, 2011)● Today
○ 1.334.794.121 records (1,3x109)Website (google analytics)Banners (openx)
Warehousing needs
● Product owners○ Comparing product version
■ avg duration■ msg sent/received
○ Churn analysis○ Feature analysis
● Marketing○ What countries should we focus on○ What people should we target?
● Sales○ Sell banners in countries/products.
● Operations/Dev○ Help solve bugs○ Blocked in countries/providers
Interesting to know
● Developers are Java centric● Hosting in the US but BI people in Amsterdam● 18 hadoop nodes each having
○ 16 cores○ 24G ram○ 4x400G HD's
● We make money with banners○ So don't expect deep pockets
Warehouse timeline● Traditional rdbms (2004)● Custom mapreduce code (2008)
○ Joining two files (merge join/map join?)○ Repeating code○ Consider abstraction○ Changing data changing code?
● Pig scripts (2008/2009)○ Much simpler to read but domain specific
● Hive (2009)○ Generic sql but with some limitations○ Existing tools can be used
Hive
● Hey I already know this:select *from table1 t1 left outer join table2 t2 on (t1.id = t2.id)where t2.id is null;
● Java programmers will like this:○ Spring JdbcTemplates○ Existing jdbc tools (SQuirreL)○ Syntax highlighting○ Code completion
Present● App servers log to mysql
○ Brittle but it works● Hive
○ Sql (most developers know this)○ Partition pruning issues○ No rollup queries
● ETL○ Star schema○ Fair scheduling (ETL vs BI)
■ reserved for etl pool■ don't start reducers until 90% mappers done
○ Lzo on all jobs● MicroStrategy (odbc)● SQuirreL (jdbc)
Future● Look at users from a to z
○ website logs○ banners
● Cassandra handler for hive○ Looking at contact lists (not just size)
● Streaming ETL○ flume
■ No more mysql & scripts■ Directly write into the correct partition
○ avro■ Less schema related problems
○ snappy■ Lightweight compression
Questions?
Hive partition pruning
● Won't workselect count(*)from chatsessions cs inner join calendar c on (c.cldr_id = cs.login_cldr_id)where c.iso_date = '2012-06-14';
● Will workselect cldr_id from calendar where iso_date = '2012-06-14';select count(*) from chatsessions where login_cldr_id in (1234);
Left outer join in PigA = LOAD 'file1' USING PigStorage(',') AS (a1:int,a2:chararray);B = LOAD 'file2' USING PigStorage(',') AS (b1:int,b2:chararray);C = COGROUP A BY a1, B BY b1 OUTER;X = FILTER C BY IsEmpty(B);Z = FOREACH X GENERATE flatten(A.a2);DUMP Z;
● avro & hive: https://issues.apache.org/jira/browse/HIVE-895
● flume:
https://cwiki.apache.org/FLUME/