Top Banner
Hadoop At Datasift
31

Hadoop at datasift

May 13, 2015

Download

Technology

Jairam Chandar

Slides from the presentation at Hadoop UK User group meetup in London as part of BigDataWeek.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Hadoop at datasift

Hadoop At

Datasift

Page 2: Hadoop at datasift

About me

Jairam ChandarBig Data Engineer

Datasift

@jairamc

http://about.me/jairam

http://blog.jairam.me

Page 3: Hadoop at datasift

Outline

What is Datasift?

Where do we use Hadoop?

– The Numbers– The Use-cases– The Lessons

Page 4: Hadoop at datasift

!! Sales Pitch Alert !!

Page 5: Hadoop at datasift

What is Datasift?

Page 6: Hadoop at datasift

What is Datasift?

Page 7: Hadoop at datasift

What is Datasift?

Page 8: Hadoop at datasift

What is Datasift?

Page 9: Hadoop at datasift

What is Datasift?

Page 10: Hadoop at datasift

What is Datasift?

Page 11: Hadoop at datasift

What is Datasift?

Page 12: Hadoop at datasift

What is Datasift?

Page 13: Hadoop at datasift

What is Datasift?

Page 14: Hadoop at datasift

What is Datasift?

Page 15: Hadoop at datasift

The Numbers

Machines

– 60 machines ● Datanode● Tasktracker● RegionServer

– 2 machines● Namenode

– 2 machines● HBase Master

– In the processing of doubling our capacity

Page 16: Hadoop at datasift

The Numbers

Machines

– 2 * Intel Xeon E5620 @ 2.40GHz (16 core total)

– 24GB RAM

– 6 * 2 TB disks in JBOD (small partition on frst disk for OS, rest is storage)

– 1 Gigabit network links

Page 17: Hadoop at datasift

The Numbers

Data

– Avg load of 3500 interactions/second

– Peak load of 6000 interactions/second

– Highest during the Superbowl – 12000 interactions/second

– Avg size of interaction 2 KB – thats 2 TB a day with replication (RF = 3)

– And that's not it!

Page 18: Hadoop at datasift

The Use Cases

HBase

– Recordings– Archive/Ultrahose

Map/Reduce

– Exports– Historics

Page 19: Hadoop at datasift

The Use Cases

Recordings– User defned streams

– Stored in HBase for later retrieval

– Export to multiple output formats and stores

– <recording-id><interaction-uuid>● Recording-id is a SHA-1 hash● Allows recordings to be distributed by their key

without generating hot-spots.

Page 20: Hadoop at datasift

The Use Cases

Recordings continued ...

Page 21: Hadoop at datasift

The Use Cases

Exporter– Export data from HBase for customer

– Export fles 5 – 10 GB or 3-6 million records

– MR over HBase using TableInputFormat

– But the data needs to be sorted● TotalOrderPartioner

Page 22: Hadoop at datasift

The Use Cases

Exporter Continued

Page 23: Hadoop at datasift

!! Sales Pitch Alert !!

Page 24: Hadoop at datasift

Historics

Page 25: Hadoop at datasift

The Use Cases

Archive/Ultrahose– Not just the Firehose but the Ultrahose

– Stored in HBase as well

– HBase architecture (BigTable) creates Hotspots with Time Series data

● Leading randomizing bit (see HBaseWD)● Pre-split regions● Concurrent writes

Page 26: Hadoop at datasift

The Use Cases

Archive continued …

2 years of Tweets

– 11 TB compressed

– <Number of tweets we got>

Page 27: Hadoop at datasift

The Use Cases

Historics– Export archive data

– Slightly different from Exporter● Much larger time lines (1 – 3 months)● Unfltered Input Data● Therefore longer processing time● Hence more optimizations required

Page 28: Hadoop at datasift

The Use Cases

Historics continued ...

Page 29: Hadoop at datasift

The Lessons - HBase

Tune Tune Tune (Default == BAD)

Based on use case tune -

– Heap– Block Size– Memstore size

Keep number of column families low

Be aware of hot-spotting issue when writing time-series data

Use compression (eg. Snappy)

Page 30: Hadoop at datasift

The Lessons - HBase

Ops need intimate understanding of system

Monitor metrics (GC, CPU, Compaction, I/O)

Don't be afraid to fddle with HBase code

Using a distribution is advisable

Page 31: Hadoop at datasift

Questions?