Top Banner
Data Platform and Services Vipul Sharma and EyalReuveni
34
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Ashu Desc

Data Platform and Services

Vipul Sharma and EyalReuveni

Page 2: Ashu Desc

Agenda

Eventbrite

Data Products

Data Platform

Recommendations

Questions

Page 3: Ashu Desc

• A social event ticketing and discovery platform

• 50th Million Ticket Sold

• Revenue doubled YOY

• 180 Employees in SOMA SF

• Solving significant engineering problems

• Data

• Data, Infrastructure, Mobile, Web, Scale, Ops, QA

• Firing all cylinders and hiring blazing fast

www.eventbrite.com/jobs

Page 4: Ashu Desc

Data Products

Page 5: Ashu Desc
Page 6: Ashu Desc
Page 7: Ashu Desc

Analytics

• Add–Hoc queries by Analysts

Page 8: Ashu Desc

Fraud and Spam

Page 9: Ashu Desc

Data Platform

Page 10: Ashu Desc
Page 11: Ashu Desc

Hadoop Cluster

• 30 persistent EC2 High-Memory Instances

• 30TB disk with replication factor of 2, ext3 formatted

• CDH3

• Fair Scheduler

• HBase

Page 12: Ashu Desc

Infrastructure

• Search• Solr

• Incremental updates towards event driven

• Recommendation/Graph• Hadoop

• Native Java MapReduce

• Bash for workflow

• Persistence• MySql

• HDFS

• HBase

• MongoDB (Investigating Cassandra and Riak)

Page 13: Ashu Desc

Infrastructure

• Stream

• RabbitMQ

• Internal Fire hose (Investigating Kafka)

• Offline

• MapRedude

• Streaming

• Hive

• Hue

Page 14: Ashu Desc

Infrastructure - Sqoozie

• Workflow for mysql imports to HDFS

• Generate Sqoop commands

• Run these imports in parallel

• Transparent to schema changes

• Include or exclude on column, data types, table level

• Data Type Casting tinyint(1) Integer

• Distributed Table Imports

Page 15: Ashu Desc

Infrastructure - Blammo

• Raw logs are imported to HDFS via flume

• Almost real-time – 5 min latency

• Logs are key-value pairs in JSON

• Each log producer publishes schema in yaml

• Hive schema and schema yaml in sync using thrift

• Control exclusion and inclusion

Page 16: Ashu Desc

Recommendations

Page 17: Ashu Desc

You will like to attend this event

Page 18: Ashu Desc

Item Hierarchy

(You bought camera so you need batteries - Amazon)

Collaborative Filtering – User-User Similarity

(People who bought camera also bought batteries -Amazon)

Collaborative Filtering – Item-Item similarity

(You like Godfather so you will like Scarface -Netflix)

Social Graph Based (Your friends like Lady Gaga so you will like Lady Gaga, PYMK – Facebook, Linkedin)

Interest Graph Based

(Your friends who like rock music like you are attending Eric Clapton Event–Eventbrite)

Recommendation Engines

Page 19: Ashu Desc

Why Interest?

Events are Social Events are Interest

Dense Graph is IrrelevantInterest are Changing

Page 20: Ashu Desc

How do we know your Interest?

• We ask you

• Based on your activity

• Events Attended

• Events Browsed

• Facebook Interests

• User Interest has to match Event category

• Static

• Machine Learning

• Logistic Regression using MLE

• Sparse Matrix is generated using MapReduce

• A model for each interest

Page 21: Ashu Desc

Model Based vs Clustering

Building Social Graph is Clustering Step

Social Graph Recommendation is a Ranking Problem

Item-Item vs User-User

Page 22: Ashu Desc

Implicit Social Graph

U1

U2 U3

U4 U5

E1

E2 E3

E4

Page 23: Ashu Desc

Mixed Social Graph

U1

U2 U3

U4 U5

E1

E2 E3FB

LI

Page 24: Ashu Desc

15M * 260 * 260 = 1.14 Trillion Edges

4Billion edges ranked

Each node is a feature vector representing a User

Each edge is a feature vector representing a Relationship

Page 25: Ashu Desc

Feature Generation

• Mixed Features

• A series of map-reduce jobs

• Output on HDFS in flat files; Input to subsequent jobs

• Orders = Event Attendees• MAP: eid: uid

• REDUCE: eid:[uid]

• Attendees Social Graph• Input: eid:[uid]

• MAP: uidi:[uid]

• REDUCE: uid:[neighbors]

• Interest based features, user specific, graph mining etc

• Upload feature values to HBase

Page 26: Ashu Desc

U1

U2 U3

Page 27: Ashu Desc

HBase

Page 28: Ashu Desc

HBase

• Collect data from multiple Map Reduce jobs

• Stores entire social graph

• Over one million writes per second

Page 29: Ashu Desc

HBase

rowid neighbors events featureX

2718282 101 3 0.3678795

Page 30: Ashu Desc

HBase

rowid 314159:n 314159:e 314159:fx 161803:n 161803:e 161803:fx

2718282 31 1 0.3183 83 2 0.618

Page 31: Ashu Desc

Tips & Tricks

• Distributed cache database

• Sped up some Map Reduce jobs by hours

• Be sure to use counters!

Page 32: Ashu Desc

Tips & Tricks

• Hive (ab)uses

• Almost as many hive jobs as custom ones

• “flip join”

• Statistical functions using hive

• UDF

Page 33: Ashu Desc

Tips & Tricks

• Memory Memory Memory

• LZO, WAL

• Combiners are great until

• Shuffle and Sorting stage

• Hadoop ecosystem is still new

Page 34: Ashu Desc

Questions?