Top Banner
THE EVOLUTION OF HADOOP AT STRIPE colin marc @colinmarc
26
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The Evolution of Hadoop at Stripe

THE EVOLUTION OF HADOOP AT STRIPEcolin marc @colinmarc

Page 2: The Evolution of Hadoop at Stripe

ABOUT STRIPE

• payments for the web

• based in SF

• last time I checked, ~75 people (stripe.com/about)

• main product is an API

Page 3: The Evolution of Hadoop at Stripe

WITH US, DATA WAS AN AFTERTHOUGHT

Page 4: The Evolution of Hadoop at Stripe

A LOT OF OUR DATA IS IN MONGO

• MongoDB is a fantastic application database

• uses BSON - like JSON, but has a binary representation

• MongoDB is schemaless, but has indexed queries and other features that are nice for applications

Page 5: The Evolution of Hadoop at Stripe

APPLICATION DBS SUCK FOR ANALYSIS

• well, sometimes. relational databases are OK

• MongoDB is awful (for this)

• no joins

• scans are painful

• no declarative query language

Page 6: The Evolution of Hadoop at Stripe

SOLUTION: PUT THE DATA SOMEWHERE ELSE

Page 7: The Evolution of Hadoop at Stripe

V1: TSV + IMPALA

• threw together a Hadoop cluster on the developer boxes

• “nightly” script dumped models to TSV files in HDFS

• janky script output the schema from our models

• query from Impala

Page 8: The Evolution of Hadoop at Stripe

ASIDE: IMPALA IS PRETTY COOL• developed by Cloudera

• absurdly fast queries over HDFS

• SQL is great

• most of our questions are ad-hoc

secrets =(

woah

Page 9: The Evolution of Hadoop at Stripe

A NICE EXPERIMENT, BUT...

• schema translation is hard

• SLOW SLOW SLOW

• TSV is not a great format

• script never runs

• not production data

Page 10: The Evolution of Hadoop at Stripe

V2: MONGO -> HBASE

• Impala can query HBase, I think?

• @nelhage wrote MoSQL - let’s do the same thing, but put the data in HBase!

• translating from one k/v store to another is easier

Page 11: The Evolution of Hadoop at Stripe

ZEROWINGhttp://github.com/stripe/zerowing

Page 12: The Evolution of Hadoop at Stripe

FIRST, SNAPSHOT

• using Mongo-Hadoop, map over your MongoDB database

• HFileOutputFormat, completeBulkLoad

Page 13: The Evolution of Hadoop at Stripe

THEN, STREAM

• tail the MongoDB oplog, like a replica set member

• replicate inserts/updates/deletes by _id

Page 14: The Evolution of Hadoop at Stripe

HAVING DATA IN HDFS IS A GREAT

Page 15: The Evolution of Hadoop at Stripe

THEN, QUERY IT WITH IMPALA...UM

• wait, impala can’t actually query HBase effectively

• 30-40x slower over the same data

• limiting factor is HBase scan speed, I think

Page 16: The Evolution of Hadoop at Stripe

LOST IN TRANSLATION

• our schema problem is still there!

• BSON is typed, but HBase is just strings

• nested hashes still don’t work

• lists???

• what is the canonical schema?

Page 17: The Evolution of Hadoop at Stripe
Page 18: The Evolution of Hadoop at Stripe

V3: PARQUET + THRIFT• instead of storing k/v pairs, just

store the raw BSON blobs

• write your MR jobs against HBase if you want up-to-date data

• also periodically dump out Parquet files

• use thrift definitions to manage schema

Page 19: The Evolution of Hadoop at Stripe

USING THRIFT AS SCHEMA• thrift is a nice way to define what

fields we expect to be in the BSON

• in most cases, we can do the translation automatically

• decode on the backend, instead of during replication

• no information loss

Page 20: The Evolution of Hadoop at Stripe

GENERATE THRIFT DEFINITIONS?

• thrift still isn’t the canonical schema for our application - that exists in our ODM

• wrote a quick ruby script to generate thrift definitions from our application models

Page 21: The Evolution of Hadoop at Stripe

PARQUET <3 THRIFT

• columnar, read-optimized

• with a little bit of glue, serialize any basic thrift struct easily

Page 22: The Evolution of Hadoop at Stripe

IMPALA <3 PARQUET

• more glue can automatically import parquet files into Impala

• Impala and parquet are designed to work well with each other

• nested structs don’t work yet =(

Page 23: The Evolution of Hadoop at Stripe

SCALDING <3 PARQUET• we use scalding for a lot of

MapReduce stuff

• added ParquetSource to scalding to make this easy (source and sink)

Page 24: The Evolution of Hadoop at Stripe

THIS WORKS FOR ANY DATA

• use thrift to define an intermediate or derived data type, and you get, for free:

• serialization using parquet

• easy MR jobs with scalding

• ad-hoc querying with Impala

Page 25: The Evolution of Hadoop at Stripe

OVERVIEW

MongoDB

ZeroWingApplication

Land

Hadoop Land

HBase

Parquet Snapshots

MR

Impala

Hadoop

Page 26: The Evolution of Hadoop at Stripe

QUESTIONS?

• meeeee: @colinmarc

• Stripe: stripe.com

• we’re hiring! stripe.com/jobs

• ZeroWing: github.com/stripe/zerowing

• Impala: github.com/cloudera/impala

• Parquet: parquet.github.com