Top Banner
Growing a Data Pipeline for Analytics Roberto Vitillo, Staff Data Engineer @ Mozilla 26th PyData London Meetup
20

Growing a Data Pipeline for Analytics

Jan 13, 2017

Download

Data & Analytics

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Growing a Data Pipeline for Analytics

Growing a Data Pipeline for Analytics

Roberto Vitillo, Staff Data Engineer @ Mozilla26th PyData London Meetup

Page 2: Growing a Data Pipeline for Analytics
Page 3: Growing a Data Pipeline for Analytics
Page 4: Growing a Data Pipeline for Analytics

brew install apache-spark

Page 5: Growing a Data Pipeline for Analytics
Page 6: Growing a Data Pipeline for Analytics

Don’t do it yourself!

Page 7: Growing a Data Pipeline for Analytics

Input OutputETL

Storage

Page 8: Growing a Data Pipeline for Analytics

JSON

JSON?

Page 9: Growing a Data Pipeline for Analytics
Page 10: Growing a Data Pipeline for Analytics
Page 11: Growing a Data Pipeline for Analytics
Page 12: Growing a Data Pipeline for Analytics
Page 13: Growing a Data Pipeline for Analytics

JSON

Parquet

Spark, Hive, Pig …

Page 14: Growing a Data Pipeline for Analytics

JSON

Parquet

Spark, Hive, Pig … ???

Page 15: Growing a Data Pipeline for Analytics

“The easier it is to ask questions, the more questions will be asked”

Page 16: Growing a Data Pipeline for Analytics
Page 17: Growing a Data Pipeline for Analytics

Modern SQL supports Map, Arrays & Structs

Page 18: Growing a Data Pipeline for Analytics
Page 19: Growing a Data Pipeline for Analytics

JSON

Parquet

Spark, Hive, Pig …

Presto, Re:dash

Page 20: Growing a Data Pipeline for Analytics

TLDR;

• Don’t build your own pipeline unless you really have to

• Use schemas

• Exploit columnar storage

• Use SQL