The evolution of the big data platform @ Netflix (OSCON 2015)

Post on 17-Aug-2015

608 Views

Category:

Engineering

2 Downloads

Preview:

Click to see full reader

Transcript

The Evolution of Big Data Platform@

Netflix

Eva TseJuly 22, 2015

Our biggest challenge is scale

Netflix Key Business Metrics

65+ millionmembers

50 countries 1000+ devices

supported

10 billionhours / quarter

Global Expansion200 countries by end of 2016

Big Data SizeTotal ~20 PB DW on S3 Read ~10% DW dailyWrite ~10% of read data daily

~ 500 billion events daily

~ 350 active users

Our traditional BI stack is our competition

How do we meet the functionality bar and yet make it scale?

How do we make big data bite-size again?

Our North Star

• Infrastructure– No undifferentiated heavy lifting

• Architecture– Scalable and sustainable

• Self-serve– Ecosystem of tools

Cloudapps

Suro/Kafka Ursula

CassandraAegisthus

Dimension Data

Event Data

15 min

Daily

AWS S3

SS Tables

Data Pipelines

Parquet FF

Metacat(Federated metadata service)

Pig workflow visualization

Data movement

Data visualization

(Hadoop clusters)

Job/Cluster perfvisualization

Data lineage

Data quality

Storage Compute Service Tools

(Federated execution service)

AWS S3

Analytics

ETL

Interactive data exploration

Interactive slice & dice

RT analytics & iterative/ML algo

Evolving Big Data Processing Needs

Metacat(Federated metadata service)

Pig workflow visualization

Data movement

Data visualization

Job/Cluster perfvisualization

Data lineage

Data quality

Service Tools

(Federated execution service)

Big Data Portal

API Portal

Big Data APIEvolving Services/Tools Ecosystem

AWS S3 as our DW Storage• S3 as single source of truth (not HDFS)• 11 9’s durability and 4 9’s availability• Separate compute and storage• Key enablement to

– multiple clusters– easy upgrade via r/b deployment

Evolution of Big Data Processing Systems

• Analytics• Hive-QL is close to ANSI SQL syntax• Hive metastore serves as single source

of truth for metadata for big data

• ETL• Better language construct for ETL • Contributions since 0.11• Customization

– Integration with Metacat to Hive Metastore

– Integration with S3

• Interactive data exploration and experimentation• Why we like presto?

– Integration with Hive metastore– Easy integration with S3– Works at petabyte scale– ANSI SQL for usability– Fast

• Our contributions– S3 file system– Query optimizations– Complex types support – Parquet file format integration– Working on predicate pushdown

Parquet

• Columnar file format• Supported across Hive, Pig, Presto, Spark• Performance benefits across different processing engines• Working on vectorized read, lazy load and lazy

materialization

• Interactive dashboard for slicing and dicing• Column-based in-memory data store for time series data• Serves a specific use case very well

• ETL, RT analytics, ML algorithms• Why we like Spark?

– Cohesive environment – batch and ‘stream’ processing– Multiple language support – Scala, Python– Performance benefits– Run on top of YARN for multi-tenancy– Community momentum

Metacat(Federated metadata service)

Pig workflow visualization

Data movement

Data visualization

Job/Cluster perfvisualization

Data lineage

Data quality

Service Tools

(Federated execution service)

Big Data Portal

API Portal

Big Data APIEvolution of Services/Tools

Ecosystem

• Federated execution engine• Expose [your fave big data engine] as a

service • Flexible data model to support future job

types• Cluster configuration management

Metacat• Federated metadata catalog for the whole data platform

– Proxy service to different metadata sources

• Data metrics, data usage, ownership, categorization and retention policy …

• Common interface for tools to interact with metadata

• To be open sourced in 2015 on Netflix OSS

Metacat(Federated metadata service)

Pig workflow visualization

Data movement

Data visualization

Job/Cluster perfvisualization

Data lineage

Data quality

Service Tools

(Federated execution service)

Big Data Portal

API Portal

Big Data API dd

Big Data API• Integration layer for our ecosystem of tools and services• Python library (called Kragle)• Building block for our ETL workflow• Building block for Big Data Portal

Big Data Portal• One stop shop for all big data related tools and services• Built on top of Big Data API

Open source is an integral part of our strategy to achieve scale

Big Data Processing Systems

Services/Tools Ecosystem

Why use Open Source?• Collaborate with other internet scale tech companies• Unchartered area/scale, lock-in is not desirable• Need the flexibility to achieve scalabilityBUT…• Lots of choices• White box approach

Why contribute back?

• Non IP or trade secret • Help shape direction of projects • Don’t want to fork and diverge• Attract top talent

Why contribute our own tool?

• Share our goodness• Set industry standard• Community can help evolve the tool

Is open source right for you?

Measuring big data - understanding data by usage

By Charles Smith, NetflixTomorrow @ 1:40-2:20pm

Eva Tseetse@netflix.comjobs.netflix.com

top related