Top Banner
Hadoop at Lookout Aug 13, 2014 Yash Ranadive @yashranadive Thursday, August 14, 14
31

SF Hadoop Users Group August 2014 Meetup Slides

Jan 15, 2015

Download

Engineering

Yash Ranadivé

Slides for Hadoop Users Group Meetup on 13th August 2014
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: SF Hadoop Users Group August 2014 Meetup Slides

Hadoop at LookoutAug 13, 2014

Yash Ranadive@yashranadive

Thursday, August 14, 14

Page 2: SF Hadoop Users Group August 2014 Meetup Slides

BIO

• Data Engineer

• From Mumbai, India

• Lived in 7 different cities in US

• @yashranadive

• etl.svbtle.com

Thursday, August 14, 14

Page 3: SF Hadoop Users Group August 2014 Meetup Slides

AGENDA

• What we do @Lookout

• Data warehouse

• Evolution from monolithic to micro-services

• Protocol Buffers

• Areas we are exploring

Thursday, August 14, 14

Page 4: SF Hadoop Users Group August 2014 Meetup Slides

WHAT WE DO@LOOKOUT

Thursday, August 14, 14

Page 5: SF Hadoop Users Group August 2014 Meetup Slides

Over 50 million registered users

Thursday, August 14, 14

Page 6: SF Hadoop Users Group August 2014 Meetup Slides

DATA TEAM

• 3 Data Engineers

• 6 data analysts

• Hadoop

• 64 hosts

• 300 TB capacity

Thursday, August 14, 14

Page 7: SF Hadoop Users Group August 2014 Meetup Slides

DATA WAREHOUSEINTERNAL AND EXTERNAL DATA SOURCES

MySQL Star Schema

Warehouse

HDFS

HIVE HBase ImpalaChunker

Mudskipper

R Hue Shiny Tableau Custom Apps

WAREHOUSE

Thursday, August 14, 14

Page 8: SF Hadoop Users Group August 2014 Meetup Slides

FROM MONOLITHIC TO MICROSERVICES

Thursday, August 14, 14

Page 9: SF Hadoop Users Group August 2014 Meetup Slides

MONOLITHIC APPLICATION

Routing

Controller

Mobile/Web Clients

Database

RAILS APPLICATION

HTTP

ORM

Views

Tables

Thursday, August 14, 14

Page 10: SF Hadoop Users Group August 2014 Meetup Slides

DATA INGESTION - MONOLITHIC

Application master_db slave_db

Data Warehouse

MySQL HiveETL

ELTMySQL

Replication

External Sources

Reporting

Ingestion is batch-oriented

Thursday, August 14, 14

Page 11: SF Hadoop Users Group August 2014 Meetup Slides

PROBLEM

• Rails has fast TTM but challenges in scaling

• One code base

• Slower Deployments

• Too complex and large to manage

• Solution

• Microservices / service oriented architecture

• Break out the app in to smaller services

Thursday, August 14, 14

Page 12: SF Hadoop Users Group August 2014 Meetup Slides

MICROSERVICES ARCHITECTURE

Routing

Controller

Mobile/Web Clients

Database

RAILS APPLICATION

HTTP

ORM

Views

Tables

Settings Service

PhotoBackup

We frequently add new services

Thursday, August 14, 14

Page 13: SF Hadoop Users Group August 2014 Meetup Slides

DATA INGESTION - MICROSERVICES

Application master_db slave_db

Data Warehouse

MySQL Hive

ETL

ELTMySQL

Replication

External Sources

Reporting

Settings Service

Backup Service

Locate Service

Messaging Layer

Consumer

Thursday, August 14, 14

Page 14: SF Hadoop Users Group August 2014 Meetup Slides

DATA INGESTION - MONOLITIHIC VS MICROSERVICES

select * from user_settings;

id | setting_id | user_id | modified_at===========================1 backup 2629 20140709T0400Z3 locate 2682 20140709T0402Z8 wipe 2629 20140709T0403Z9 theft_alert 2629 20140709T0407Z

{guid: 1, event_type: “modify_setting”,setting_id: “backup”, setting_status: “ON”, user_id: “2629”, timestamp: “20140709T0400Z”}

{guid: 3, event_type: “start_backup”, user_id: “2629”, timestamp: “20140709T0400Z”}...

Monolithic - Snapshot of a point in time

Microservices - Events

Thursday, August 14, 14

Page 15: SF Hadoop Users Group August 2014 Meetup Slides

DESIGN

• We wanted to create an always-on event ingestion framework that:

• Would scale workers on demand

• Would be easy to monitor

Thursday, August 14, 14

Page 16: SF Hadoop Users Group August 2014 Meetup Slides

FIRST STAB - WORKER

Service ActiveMQ Ruby Worker HIVE

• Upstart script that daemonized Ruby process

• Monitoring using Zenoss

• Very easy to set up

• Mapping Files for JSON -> CSV

• Ruby is terse and clean

Thursday, August 14, 14

Page 17: SF Hadoop Users Group August 2014 Meetup Slides

PROBLEMS

• ActiveMQ

• ActiveMQ did not scale well - even with multiple machines in the AMQ cluster

• ActiveMQ creates a separate queue for every consumer of the topic

• Monitoring using Zenoss is not ideal especially for multi-process consumers

• The worker ran on a single machine- not fault tolerant

Thursday, August 14, 14

Page 18: SF Hadoop Users Group August 2014 Meetup Slides

CURRENT ARCHITECTURE - WORKER

Service Kafka Storm HIVE

• Monitoring using Storm’s thrift API

• Scaling number of workers is easy

• Kafka has better scalability than Kafka

Service ActiveMQ

Thursday, August 14, 14

Page 19: SF Hadoop Users Group August 2014 Meetup Slides

Storm

STORM TOPOLOGY

Service Kafka HDFS

Kafka Spout

ActiveMQ Spout

Processing Bolt

Storm-hdfs bolt

Landing Directory

Hive Directory

Thursday, August 14, 14

Page 20: SF Hadoop Users Group August 2014 Meetup Slides

JSON PROBLEMS

• Problems with JSON

• No predefined schema

• No enforcement of backward compatibility

• Solution

• Protocol Buffers (also Avro/Thrift)

Thursday, August 14, 14

Page 21: SF Hadoop Users Group August 2014 Meetup Slides

PROTOBUFS

• What?

• Way of encoding structured data

• Binary

• Why?

• Schema

• Backward compatibility

• Smaller in size than JSON

Thursday, August 14, 14

Page 22: SF Hadoop Users Group August 2014 Meetup Slides

VERSIONING

• backward compatible changes only

,proto ,proto

Version 1.4 Version 1.1

Producer ConsumerQueue

Thursday, August 14, 14

Page 23: SF Hadoop Users Group August 2014 Meetup Slides

SHARING PROTOBUF SCHEMAS

Artifactory(Schema Repo)

Data Team Storm ProjectProducers

PushJava jars

Ruby gems

PullJava jars

Thursday, August 14, 14

Page 24: SF Hadoop Users Group August 2014 Meetup Slides

BUT HOW DO YOU STORE PROTOBUFS IN HDFS?

Thursday, August 14, 14

Page 25: SF Hadoop Users Group August 2014 Meetup Slides

HOW WE STORE PROTOBUFS

• Store raw version

• Raw dump of kafka topic in to HDFS

• Convert them to a tuple using Storm

• Inflate then convert to TSV

• Can query raw protobufs directly from HIVE but we don’t yet

• elephant-bird (difficult to get it working)

Thursday, August 14, 14

Page 26: SF Hadoop Users Group August 2014 Meetup Slides

Storm

STORM TOPOLOGY

Service Kafka HDFS

Kafka Spout

ActiveMQ Spout

Deserialize Protobuf

Storm-hdfs bolt

Landing Directory

Hive Directory

Thursday, August 14, 14

Page 27: SF Hadoop Users Group August 2014 Meetup Slides

AREAS WE ARE EXPLORING

Thursday, August 14, 14

Page 28: SF Hadoop Users Group August 2014 Meetup Slides

SPARK

• ETL

• Wordcount ~5 lines of scala code vs. 58 lines of Java Map reduce code

• Spark Streaming can achieve similar results as of storm through micro-batchinghttp://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming

• Machine Learning

• Online learning using MLLIB

• Logistic Regression and SVM

Thursday, August 14, 14

Page 29: SF Hadoop Users Group August 2014 Meetup Slides

H20

• In-memory machine learning

• Tight integration with R

• Preferred by Data Scientists

Thursday, August 14, 14

Page 30: SF Hadoop Users Group August 2014 Meetup Slides

OPEN SOURCE PROJECTS

• Currently open sourced

• Pipefish - write from MySQL to HDFSgithub.com/lookout/pipefish

• Future

• Mudskipper - capture change-data events from MySQL binlogs.

• Chunker - download mysql table data in chunks

Thursday, August 14, 14

Page 31: SF Hadoop Users Group August 2014 Meetup Slides

Questions

Thursday, August 14, 14