Top Banner
Avoiding big data anti-patterns
99

Avoiding big data antipatterns

Jan 20, 2017

Download

Engineering

grepalex
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Avoiding big data antipatterns

Avoiding big data anti-patterns

Page 2: Avoiding big data antipatterns

whoami

• Alex Holmes

• Software engineer

• @grep_alex

• grepalex.com

Page 3: Avoiding big data antipatterns

Why should I care about anti-patterns?big data^

Page 4: Avoiding big data antipatterns
Page 5: Avoiding big data antipatterns

Agenda

• Walking through anti-patterns

• Looking at why they should be avoided

• Consider some mitigations

• It’s big data!

• A single tool for the job

• Polyglot data integration

• Full scans FTW!

• Tombstones

• Counting with Java built-in collections

• It’s open

I’ll cover: by …

Page 6: Avoiding big data antipatterns

Meet your protagonists

Alex Jade(the amateur) (the pro)

Page 7: Avoiding big data antipatterns

It’s big data!

Page 8: Avoiding big data antipatterns

i want to calculate some statistics on some static user

data . . .

how big is the data?

it’s big data, so huge,20GB!!!

i need to order a hadoop cluster!

Page 9: Avoiding big data antipatterns

What’s the problem?

Page 10: Avoiding big data antipatterns

you think you have big data ...

but you don’t!

Page 11: Avoiding big data antipatterns
Page 12: Avoiding big data antipatterns
Page 13: Avoiding big data antipatterns

Poll: how much RAM can a single server support?

A. 256 GB

B. 512 GB

C. 1TB

Page 14: Avoiding big data antipatterns

http://yourdatafitsinram.com

Page 15: Avoiding big data antipatterns
Page 16: Avoiding big data antipatterns

keep it simple . . .

use MYSQL or POSTGRES

or R/python/matlab

Page 17: Avoiding big data antipatterns

Summary

• Simplify your analytics toolchain when working with small data (especially if you don’t already have an investment in big data tooling)

• Old(er) tools such as OLTP/OLAP/R/Python still have their place for this type of work

Page 18: Avoiding big data antipatterns

A single tool for the job

Page 19: Avoiding big data antipatterns

that looks like a nail!!!

nosql

PROBLEM

Page 20: Avoiding big data antipatterns

What’s the problem?

Page 21: Avoiding big data antipatterns

Big data tools are usually designed to do one thing well

(maybe two)^

Page 22: Avoiding big data antipatterns

Types of workloads

• Low-latency data lookups

• Near real-time processing

• Interactive analytics

• Joins

• Full scans

• Search

• Data movement and integration

• ETL

Page 23: Avoiding big data antipatterns

The old world was simple

OLTP/OLAP

Page 24: Avoiding big data antipatterns

The new world … not so much

Page 25: Avoiding big data antipatterns

You need to research and find the best-in-class for your function

Page 26: Avoiding big data antipatterns

Best-in-class big data tools (in my opinion)

If you want … Consider …

Low-latency lookups Cassandra, memcached

Near real-time processing Storm

Interactive analytics Vertica, Teradata

Full scans, system of record data, ETL, batch processing

HDFS, MapReduce, Hive, Pig

Data movement and integration Kafka

Page 27: Avoiding big data antipatterns

Summary

• There is no single big data tool that does it all

• We live in a polyglot world where new tools are announced every day - don’t believe the hype!

• Test the claims on your own hardware and data; start small and fail fast

Page 28: Avoiding big data antipatterns

Polyglot data integration

Page 29: Avoiding big data antipatterns

i need to move clickstream data

from my application to hadoop

Application

Hadoop Loader

Hadoop

Page 30: Avoiding big data antipatterns

shoot, i need to use that same

data in streamingApplication

Hadoop Loader

Hadoop

JMS

JMS

Page 31: Avoiding big data antipatterns

What’s the problem?

Page 32: Avoiding big data antipatterns

OLTP OLAP /EDW HBase Cassan

draVoldem

ortHadoop

SecurityAnalyticsRec. Engine Search Monitoring Social

Graph

Page 33: Avoiding big data antipatterns

that way new consumers can be added without any work!

we need a central data repository and pipeline to isolate

consumers from the source

let’s use kafka!

OLTP OLAP /EDW HBase Cassan

draVoldem

ortHadoop

SecurityAnalyticsRec. Engine Search Monitoring Social

Graph

kafka

Page 34: Avoiding big data antipatterns

Background

• Apache project

• Originated from LinkedIn

• Open-sourced in 2011

• Written in Scala and Java

• Borrows concepts in messaging systems and logs

• Foundational data movement and integration technology

Page 35: Avoiding big data antipatterns

What’s the big whoop about Kafka?

Page 36: Avoiding big data antipatterns

Throughput

http://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines

Producer

Consumer Consumer Consumer

Kafka Server Kafka Server Kafka Server

Producer Producer

2,024,032 TPS

2,615,968 TPS

2ms

Page 37: Avoiding big data antipatterns

O.S. page cache is leveraged

OS page cache

...11109876543210

ProducerConsumer A Consumer B

Disk

writesreadsreads

Page 38: Avoiding big data antipatterns

Things to look out for

• Leverages ZooKeeper, which is tricky to configure

• Reads can become slow when the page cache is missed and disk needs to be hit

• Lack of security

Page 39: Avoiding big data antipatterns

Summary

• Don’t write your own data integration

• Use Kafka for light-weight, fast and scalable data integration

Page 40: Avoiding big data antipatterns

Full scans FTW!

Page 41: Avoiding big data antipatterns

i heard that hadoop was designed to work with huge

data volumes!

Page 42: Avoiding big data antipatterns

i’m going to stick my data on hadoop . . .

Page 43: Avoiding big data antipatterns

and run some joins

SELECT * FROM huge_table JOIN ON other_huge_table …

Page 44: Avoiding big data antipatterns

What’s the problem?

Page 45: Avoiding big data antipatterns

yes, hadoop is very efficient at batch workloads

files are split into large blocks and distributed throughout the cluster

“data locality” is a first-class concept, where compute is pushed to storage Scheduler

Page 46: Avoiding big data antipatterns

but hadoop doesn’t negate all these optimizations we learned when working on

relational databases

Page 47: Avoiding big data antipatterns

so partition your data according to how you

will most commonly access it

hdfs:/data/tweets/date=20140929/ hdfs:/data/tweets/date=20140930/ hdfs:/data/tweets/date=20140931/

disk io is slow

Page 48: Avoiding big data antipatterns

and then make sure to include a filter in your

queries so that only those partitions are read

... WHERE DATE=20151027

Page 49: Avoiding big data antipatterns

include projections to reduce data that needs to be read from disk or

pushed over the network

SELECT id, name FROM...

Page 50: Avoiding big data antipatterns

hash joins require network io which is slow

Page 51: Avoiding big data antipatterns

65.23VRSN39.54MSFT

526.62GOOGLPriceSymbol

RestonVRSNRedmondMSFTMtn ViewGOOGL

HeadquartersSymbol

merge joins are way more efficient

Records in all datasets sorted by join key

Headquarters

RedmondMtn View

Reston65.23VRSN39.54MSFT

526.62GOOGLPriceSymbol

The merge algorithm streams and performs an

inline merge of the datasets

Page 52: Avoiding big data antipatterns

you’ll have to bucket and sort your data

and tell your query engine to use a sort-

merge-bucket (SMB) join

— Hive properties to enable a SMB join set hive.auto.convert.sortmerge.join=true; set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true;

Page 53: Avoiding big data antipatterns

and look at using a columnar data

format like parquet

Column Strorage

Column 1 (Symbol)

526.62

MSFT

05-10-201405-10-2014

39.54

GOOGL

Column 2 (Date)

Column 3 (Price)

Page 54: Avoiding big data antipatterns

Summary

• Partition, filter and project your data (same as you used to do with relational databases)

• Look at bucketing and sorting your data to support advanced join techniques such as sort-merge-bucket

• Consider storing your data in columnar form

Page 55: Avoiding big data antipatterns

Tombstones

Page 56: Avoiding big data antipatterns

i need to store data in a Highly Available persistent queue

. . .

and we already have Cassandra deployed

. . .

bingo!!!

Page 57: Avoiding big data antipatterns

What is Cassandra?

• Low-latency distributed database

• Apache project modeled after Dynamo and BigTable

• Data replication for fault tolerance and scale

• Multi-datacenter support

• CAP

Page 58: Avoiding big data antipatterns

Node

Node

Node

Node

Node

Node

East West

Page 59: Avoiding big data antipatterns

What’s the problem?

Page 60: Avoiding big data antipatterns

V VVV V VVVVV VKKKKKKKKKKK

V VVV V VVVVV VKKKKKKKKKKK

tombstone markers indicate that the column has been deleted

deletes in Cassandra are soft; deleted columns are marked

with tombstones

these tombstoned columns slow-down reads

Page 61: Avoiding big data antipatterns

by default tombstones stay around for 10 days

IF you want to know why, read up on

gc_grace_secs,

and reappearing deletes

Page 62: Avoiding big data antipatterns

don’t use Cassandra, use kafka

Page 63: Avoiding big data antipatterns

design your schema and read patterns to avoid tombstones

getting in the way of your reads

Page 64: Avoiding big data antipatterns

keep track of consumer offsets, and add a time or bucket semantic to rows. only delete rows after some time has elapsed, or once all

consumers have consumed them.

ID bucket

1 22 1

msg

81583723804

offset

72380381582

msg

81582723803

msg

81581723802

bucketbucket

consumerconsumer

. . .

. . .

ID

12

Page 65: Avoiding big data antipatterns

Summary

• Try to avoid use cases that require high volume deletes and slice queries that scan over tombstone columns

• Design your schema and delete/read patterns with tombstone avoidance in mind

Page 66: Avoiding big data antipatterns

Counting with Java’s built-in collections

Page 67: Avoiding big data antipatterns

I’m going to COUNT THE DISTINCT NUMBER

OF USERS THAT viewed a tweet

Page 68: Avoiding big data antipatterns
Page 69: Avoiding big data antipatterns

What’s the problem?

Page 70: Avoiding big data antipatterns
Page 71: Avoiding big data antipatterns

Poll: what does HashSet<K> use under the covers?

A. K[]

B. Entry<K>[]

C. HashMap<K,V>

D. TreeMap<K,V>

Page 72: Avoiding big data antipatterns
Page 73: Avoiding big data antipatterns

Memory consumption

HashSet = 32 * SIZE + 4 * CAPACITY

String = 8 * (int) ((((no chars) * 2) + 45) / 8)

Average user is 6 characters long = 64 bytes

number of elements in set set capacity (array length)

For 10,000,000 users this is at least 1GiB

Page 74: Avoiding big data antipatterns

USE HYPERLOGLOG TO WORK WITH approximate DISTINCt

COUNTS @SCALE

Page 75: Avoiding big data antipatterns

HyperLogLog

• Cardinality estimation algorithm

• Uses (a lot) less space than sets

• Doesn’t provide exact distinct counts (being “close” is probably good enough)

• Cardinality Estimation for Big Data: http://druid.io/blog/2012/05/04/fast-cheap-and-98-right-cardinality-estimation-for-big-data.html

Page 76: Avoiding big data antipatterns

1 billion distinct elements = 1.5kb memory

standard error = 2%

Page 77: Avoiding big data antipatterns

https://www.flickr.com/photos/redwoodphotography/4356518997

Page 78: Avoiding big data antipatterns

Hashes

10110100100101010101100111001011

Good hash functions should result in each bit having a 50% probability

of occurring

h(entity):

Page 79: Avoiding big data antipatterns

Bit pattern observations

1xxxxxxxxx..x

01xxxxxxxx..x001xxxxxxx..x

0001xxxxxx..x

50% of hashed values will look like:

25% of hashed values will look like:

12.5% of hashed values will look like:

6.25% of hashed values will look like:

Page 80: Avoiding big data antipatterns

0 00

0 00000 00

000

00000

0 000 0

register

Page 81: Avoiding big data antipatterns

0 00

0 00000 00

000

00000

0 000 0

h(entity):

register index: 4

register value: 1

register

harmonic_mean (HLL

estimated cardinality = (= 1

01010 01 0 0 01

1

Page 82: Avoiding big data antipatterns

HLL Java library

• https://github.com/aggregateknowledge/java-hll

• Neat implementation - it automatically promotes internal data structure to HLL once it grows beyond a certain size

Page 83: Avoiding big data antipatterns

Approximate count algorithms

• HyperLogLog (distinct counts)

• CountMinSketch (frequencies of members)

• Bloom Filter (set membership)

Page 84: Avoiding big data antipatterns

Summary

• Data skew is a reality when working at Internet scale

• Java’s builtin collections have a large memory footprint don’t scale

• For high-cardinality data use approximate estimation algorithms

Page 85: Avoiding big data antipatterns

stepping away . . .

MATH

Page 86: Avoiding big data antipatterns

It’s open

Page 87: Avoiding big data antipatterns

prototyping/VIABILITY - DONE

coding - done

testing - done

performance & scalability testing - done

monitoring - done

i’m ready to ship!

Page 88: Avoiding big data antipatterns

What’s the problem?

Page 89: Avoiding big data antipatterns

https://www.flickr.com/photos/arjentoet/8428179166

Page 90: Avoiding big data antipatterns

https://www.flickr.com/photos/gowestphoto/3922495716

Page 91: Avoiding big data antipatterns

https://www.flickr.com/photos/joybot/6026542856

Page 92: Avoiding big data antipatterns

How the old world worked

OLTP/OLAP

Authentication

Authorization

DBA

Page 93: Avoiding big data antipatterns

security’s not my job!!! we disagree

infosec

Page 94: Avoiding big data antipatterns

Important questions to ask

• Is my data encrypted when it’s in motion?

• Is my data encrypted on disk?

• Are there ACL’s defining who has access to what?

• Are these checks enabled by default?

Page 95: Avoiding big data antipatterns

How do tools stack up?

ACL’s At-rest encryption

In-motion encryption

Enabled by default Ease of use

Oracle

Hadoop

Cassandra

ZooKeeper

Kafka

Page 96: Avoiding big data antipatterns

Summary

• Enable security for your tools!

• Include security as part of evaluating a tool

• Ask vendors and project owners to step up to the plate

Page 97: Avoiding big data antipatterns

We’re done!

Page 98: Avoiding big data antipatterns

Conclusions

• Don’t assume that a particular big data technology will work for your use case - verify it for yourself on your own hardware and data early on in the evaluation of a tool

• Be wary of the “new hotness” and vendor claims - they may burn you

• Make sure that load/scale testing is a required part of your go-to-production plan

Page 99: Avoiding big data antipatterns

Thanks for your time!