Top Banner
Confident ial Using Hadoop to build data driven Products 50 Billion pins and counting Krishna Gade 1
58
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Pinterest hadoop summit_talk

Confidential

Using Hadoop to build data driven Products

50 Billion pins and counting

Krishna Gade

1

Page 2: Pinterest hadoop summit_talk

What is Pinterest?

A visual bookmarking tool

Discover an inspiring ideaSave it to a board

Go do it

Page 3: Pinterest hadoop summit_talk

Krishna Gade

• Data Engineering at Pinterest

• Search and Data platforms at Twitter and Bing

• Follow @krishnagade

Who am I?

Page 4: Pinterest hadoop summit_talk

Pinterest is a data product

Page 5: Pinterest hadoop summit_talk
Page 6: Pinterest hadoop summit_talk

Why do we care about data?

How is Hadoop helping us to harness the power of the data?

What are some of the tools we built on top of Hadoop Platform?

Page 7: Pinterest hadoop summit_talk

Why do we care about data?

How is Hadoop helping us to harness the power of data?

What are some of the tools we built on top of Hadoop Platform?

Page 8: Pinterest hadoop summit_talk
Page 9: Pinterest hadoop summit_talk

3.375

Page 10: Pinterest hadoop summit_talk

5’10”

Page 11: Pinterest hadoop summit_talk
Page 12: Pinterest hadoop summit_talk
Page 13: Pinterest hadoop summit_talk

< uncertainty

Page 14: Pinterest hadoop summit_talk

> odds of making the best decisions

Page 15: Pinterest hadoop summit_talk

15

It is a capital mistake to theorize before one has data.

- Sherlock Holmes

Page 16: Pinterest hadoop summit_talk

Why do we care about data?

How is Hadoop helping us to harness the power of the data?

What are some of the tools we built on top of Hadoop Platform?

Page 17: Pinterest hadoop summit_talk

Data at Pinterest

• 50 Billion Pins• 1 Billion boards• 40 PB of data on S3• 3 PB processed every day• 2000 node Hadoop cluster• 200 engineers

Page 18: Pinterest hadoop summit_talk

Pinterest Data Architecture

App

Page 19: Pinterest hadoop summit_talk

Pinterest Data Architecture

App

events

Kafka

Secor

Singer

Page 20: Pinterest hadoop summit_talk

Pinterest Data Architecture

App

events

Kafka

Secor

Singer

Page 21: Pinterest hadoop summit_talk

Pinterest Data Architecture

App

events

Kafka

SecorSkyline

Pinball

Redshift

Pinalytics

Features

Qubole (Hadoop)

Singer

Page 22: Pinterest hadoop summit_talk

•Ephemeral clusters

•Access control layer

•Shared data store

•Easy deployment

Hadoop Platform Requirements

•Isolated multi-tenancy

•Elasticity

•Support multiple clusters

Page 23: Pinterest hadoop summit_talk

Confidential

Design Choices

23

Page 24: Pinterest hadoop summit_talk

Decoupling compute & storage

Hadoop Cluster 1

Transient HDFS

Hadoop Cluster 2

Transient HDFS

S3 Persistent Store

Page 25: Pinterest hadoop summit_talk

Centralized Hive Metastore

Hive Metastore

Pig

Cascading

Hive

HDFS/S3

DataMetadata

Page 26: Pinterest hadoop summit_talk

Multi-layered Packaging

Mapreduce JobsHadoop Jars/Libs

Job/User level Configs

Software Packages/LibsConfigs (OS/Hadoop)

Misc Sys Admin

OSBootstrap Script

Core SW

Runtime Staging(on S3)

Automated Configuration

(Masterless Puppet)

Baked AMI

Page 27: Pinterest hadoop summit_talk

Executor Abstraction Layer

Hive Metastore

HDFS/S3

Qubole

Managed Hadoop

EMR

Executor

Pinball

Dev Server

Page 28: Pinterest hadoop summit_talk

•API for simplified executor abstraction

•Advanced support for spot instances

•Baked AMI customization

Why Qubole?

•Hadoop & Spark as managed services

•Tight integration with Hive

•Graceful cluster scaling

Page 29: Pinterest hadoop summit_talk

Confidential

● Scale:o 50 Billion Pinso Hundreds of workflowso Thousands of jobso 500+ jobs in a workflowo 3 petabytes processed daily

● Support:o Hadoop, Cascading, Hive, Spark …

Scale of Processing

job

workflow

Page 30: Pinterest hadoop summit_talk

Confidential

Pinball

30

Page 31: Pinterest hadoop summit_talk

Confidential

Why Pinball?● Requirements

o Simple abstractionso Extensible in futureo Reliable stateless computingo Easy to debugo Scales horizontallyo Can be upgraded w/o aborting workflowso Rich features like auto-retries, per-job emails, overrun

policies…

● Optionso Apache Oozie, Azkaban, Luigi

Page 32: Pinterest hadoop summit_talk

Confidential

Pinball Design

Page 33: Pinterest hadoop summit_talk

Confidential

● Workflow o A directed graph

of nodes called jobs

● Edgeo Run after

dependence● Node

o Job is a node

Workflow Model

Page 34: Pinterest hadoop summit_talk

Confidential

Job State● Job state is captured in a token● Tokens are named hierarchically

Master

Job Token

version: 123name: /workflow/w1/jobowner: worker_0expiration: 1234567data: JobTemplate(....)

Page 35: Pinterest hadoop summit_talk

Confidential

Job State Machine

Page 36: Pinterest hadoop summit_talk

Confidential

● Master keeps the state● Workers claim and execute tasks● Horizontally scalable

Master Worker Interaction

Worker Master Persistent Store

1: request 2: update

3: ack

Page 37: Pinterest hadoop summit_talk

Confidential

Master

● Entire state is kept in memory● Each state update is synchronously

persisted before master replies to client● Master runs on a single thread – no

concurrency issues

Page 38: Pinterest hadoop summit_talk

Confidential

Worker

Page 39: Pinterest hadoop summit_talk

Confidential

Open Source

Git repo: https://github.com/pinterest/pinball

Mailing list:https://groups.google.com/forum/#!forum/pinball-users

Page 40: Pinterest hadoop summit_talk

Confidential

Data Driven Products

40

Page 41: Pinterest hadoop summit_talk

Confidential

Guided Search

Page 42: Pinterest hadoop summit_talk

Confidential

Related Pins

Page 43: Pinterest hadoop summit_talk

Why do we care about insights?

How is Hadoop helping us to harness the power of data?

What are some of the tools we built on top of Hadoop Platform?

Page 44: Pinterest hadoop summit_talk

Confidential

Scalable Data Analytics Engine

Pinalytics

44

Page 45: Pinterest hadoop summit_talk

Confidential

Architecture

45

BackendThrift Services and Hbase databases

WebappRich UI Components

ReporterGenerates formatted data

MetricsCustomized optimizations

1

2

3

4

Main Components

Page 46: Pinterest hadoop summit_talk

Confidential

Visualizations• Highcharts• Time-series updated automatically ● daily

Customizability• Dashboards• Built-in or user-defined reports

User Interface

47

Page 47: Pinterest hadoop summit_talk

Confidential

Pinomaly• Anomalous metric tracking• Email alerts

Reporting• Formatted dashboards• PDF printing• Duplicated weekly

Metric Manipulation• Metric Composer• Global operations (segmentation,● rollup/aggregation, etc).

User Interface

48

Page 48: Pinterest hadoop summit_talk

Confidential

Date, seg1, seg2, ... => value• Store the value for every possible segmentation• On-the-fly aggregation

E.g.• 2015-01-01, US, Male => 1• 2015-01-01, US, Female => 2• 2015-01-01, UK, Male => 3• 2015-01-01, UK, Female => 4• 2015-01-01, UK, * => 7• 2015-01-01, *, Male => 4

Data Model

51

Page 49: Pinterest hadoop summit_talk

Confidential

Backend Architecture

53

PinalyticsThrift

Service

2. readMetrics()

5. metrics

HBase

Region Server 1

Region Server N

Region Server 2

Region1 CP

Region2 CP

Region3 CP

Region4 CP

Region5 CP

RegionM CP

Metric table

WebappServer

3. Scan &Aggregate

1. request

4. Region aggregation

Page 50: Pinterest hadoop summit_talk

Confidential

Horizontal Scalability• No app-level sharding

Flexibility in Aggregation• FuzzyRowFilter• Coprocessor

Tables• Report metadata• Reports

HBase

54

Page 51: Pinterest hadoop summit_talk

Confidential

Composite row key• METRIC|TIME|SEG1|SEG2|...

Filters rows given a row key and a fuzzy row• 0: match the byte, 1: don’t match the byte

E.g. MAU of male users on 2015-01-01• Start row: MAU|2015-01-01|• End row: MAU|2015-01-01||• Row Key: MAU|2015-01-01|--|M-• Fuzzy filter: 000|0000000000|11|00

Fuzzy Row Filter

55

Page 52: Pinterest hadoop summit_talk

Confidential

• Region-local aggregation with coprocessor

• Final aggregation at the Thrift service

• Reduces Network I/O

• Low Latency

HBase Coprocessor

56

Page 53: Pinterest hadoop summit_talk

Confidential

Flexible python client library for generating reports• Arbitrary metrics and segments

Easy-to-access data• Data is automatically copied to s3• Hive external table is generated

Reporter

58

Page 54: Pinterest hadoop summit_talk

Confidential

WAU, WARC and MAU segmented by gender and countryclass DemoWAUReport(PinalyticsWideReport):

_METRIC_NAMES = ['wau', 'warc', 'mau']_SEGKEY_NAMES = ['gender', 'country']_QUERY_TEMPLATE = """

SELECT dt, gender, country, wau, warc, mauFROM activity_metrics WHERE dt>='2015-01-01';"""

• Sample query output[‘2015-01-01’, ‘male’, ‘US’, 102, 53, 110]

Reporter Example

60

Page 55: Pinterest hadoop summit_talk

Confidential

• Pre-compute a lot of core metrics

• Standard segmentation

● - Gender, Country, App● - Spam-filtering

Core Metrics

62

• Activity• Event counts• Retention• Signups

Page 56: Pinterest hadoop summit_talk

Confidential

Outcomes

69

Page 57: Pinterest hadoop summit_talk

Confidential

70

Internal Tools MatterSolving problems inside of our company

400 Unique users

800 Page views per day

1500 Custom charts created and updated daily

Page 58: Pinterest hadoop summit_talk

Confidential

Thank You