Top Banner
Measuring Big Data Understanding data by usage Charles Smith Big Data Platform Architecture - Netflix
31
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: OSCON 2015

Measuring Big DataUnderstanding data by usage

Charles SmithBig Data Platform Architecture - Netflix

Page 2: OSCON 2015

About Me ▪Netflix

- I joined Netflix in 2011

- I spend my time working to make big data easy and efficient

- Usually from the perspective of someone trying to use the platform

▪University of Florida

- Research in Information Retrieval

- How much information does a document have

Page 3: OSCON 2015

What would you measure?

Page 4: OSCON 2015

What do you want to know?

Page 5: OSCON 2015

~20 PB of compressed data

~500 billion events a day

~18K data sets

~4200 nodes in our clusters

Page 6: OSCON 2015

Our largest two datasets:

1.4 PB

1.2 PB

Page 7: OSCON 2015

~11K Hive

~3K Pig

~2.5K Presto

Page 8: OSCON 2015
Page 9: OSCON 2015
Page 10: OSCON 2015
Page 11: OSCON 2015

Task Hour Cost = (cost of node)/(tasks per node) * sum(task duration ms)/(60*60*1000)

Page 12: OSCON 2015
Page 13: OSCON 2015

100 Jobs comprise 86% of the cost

Page 14: OSCON 2015

What data is important?

Page 15: OSCON 2015
Page 16: OSCON 2015

Make people tell you the answer: tagging.

Page 17: OSCON 2015
Page 18: OSCON 2015

Manual data doesn’t stay current unless it needs to.

Page 19: OSCON 2015
Page 20: OSCON 2015

How do we actually use the data?

Page 21: OSCON 2015

Parse the job (or ask the tool that parses it)

Page 22: OSCON 2015

CharlottePresto

Sql Parser (Hive)

Sql Parser(Teradata)

Lipstick (Pig)

Metacat*

Page 23: OSCON 2015
Page 24: OSCON 2015

Dataset Distinct Queries… 2000… 1052prodhive/dse/geo_country_d 1009prodhive/dse/ttl_title_d 580… 565… 512… 466… 427… 395… 317

Page 25: OSCON 2015

Dataset Queriesprodhive/dse/geo_country_d 11405prodhive/dse/ttl_title_d 8194… 5928… 5451… 4849… 4654… 4334… 3620… 3046… 2823

Page 26: OSCON 2015

Related To geo_country_d Shared Queriesprodhive/dse/ttl_title_country_r 2277… 1697prodhive/dse/ttl_show_d 1540prodhive/dse/ttl_season_d 1405prodhive/dse/ttl_title_d 1392… 926… 817… 743prodhive/dse/ttl_season_country_r 638… 628

Page 27: OSCON 2015
Page 28: OSCON 2015

Datasets Input Jobs Queriesprodhive/cdn/occ… 2016 66teradata/gdw_stg_prod/seg… 1587 36prodhive/dse/msg… 1527 14prodhive/dse/msg… 1512 30teradata/gdw_stg_prod/seg… 1043 50teradata/gdw_stg_prod/cdn… 970 10teradata/gdw_tbl_prod/seg… 903 1prodhive/rpt/pbe… 811 11prodhive/gps/gro… 904 137prodhive/cdn/ttl… 631 39

Page 29: OSCON 2015
Page 30: OSCON 2015

Challenges ▪Knowing what questions should you try to answer.▪Getting this data isn’t easy.▪The data is noisy.

Page 31: OSCON 2015

Thanks ▪Charles Smith – Big Data Platform Architecture Netflix

▪@charles_s_smith