Top Banner
Simplified Data and Process Scheduling in Hadoop
26
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Simplified Data Management And Process Scheduling in Hadoop

Simplified Data and Process Scheduling in Hadoop

Page 2: Simplified Data Management And Process Scheduling in Hadoop
Page 3: Simplified Data Management And Process Scheduling in Hadoop
Page 4: Simplified Data Management And Process Scheduling in Hadoop

Somebody Still Investigates

Do you think we find the location and the owner of the “streams” dataset today?

Page 5: Simplified Data Management And Process Scheduling in Hadoop

STREAMS{trackId:long, userId:long, ts:timestamp, ...}

hdfs://data/core/streams

avro

etl

official=>true, frequency=>hourly

"UserId started to stream trackId at time ts"

Page 6: Simplified Data Management And Process Scheduling in Hadoop
Page 7: Simplified Data Management And Process Scheduling in Hadoop

users = LOAD 'data.user'

USING HCatLoader();

val users = hiveContext.hql(

"FROM data.user SELECT name, country"

)

users = LOAD

'/data/core/user/part-00000.

avro' USING AvroStorage();Non HCatalog way

in Pig

ID NAME COUNTRY GENDER

1 JOSH US M

2 ADAM PL M

Page 8: Simplified Data Management And Process Scheduling in Hadoop

[FALCON-790]

Page 9: Simplified Data Management And Process Scheduling in Hadoop

[FALCON-790]

Email

Page 10: Simplified Data Management And Process Scheduling in Hadoop

HDFS

HDFS

Page 11: Simplified Data Management And Process Scheduling in Hadoop

[FALCON-790]

Page 12: Simplified Data Management And Process Scheduling in Hadoop
Page 13: Simplified Data Management And Process Scheduling in Hadoop
Page 14: Simplified Data Management And Process Scheduling in Hadoop
Page 15: Simplified Data Management And Process Scheduling in Hadoop

Switching to ORC requires

reimplementing the Reader Code

in hundreds of productions jobs...

Page 16: Simplified Data Management And Process Scheduling in Hadoop

users = LOAD 'data.users' USING HCatLoader();

ORC

Page 17: Simplified Data Management And Process Scheduling in Hadoop

The picture comes from http://hortonworks.com/blog/introduction-apache-falcon-hadoop. Thanks Hortonworks!

Page 18: Simplified Data Management And Process Scheduling in Hadoop

Raw Data Cleansed Data

Conformed Data

Presented Data

Raw Data Presented Data

Page 19: Simplified Data Management And Process Scheduling in Hadoop
Page 20: Simplified Data Management And Process Scheduling in Hadoop
Page 21: Simplified Data Management And Process Scheduling in Hadoop

Which Elephant Is Your?

A. Elephantus Dirtus

B. Elephantus Cleanus

Page 22: Simplified Data Management And Process Scheduling in Hadoop
Page 23: Simplified Data Management And Process Scheduling in Hadoop

Backup Slides

Page 24: Simplified Data Management And Process Scheduling in Hadoop

Falcon’s Adoption

■ Top Level Project since December 2014■ 14 contributors from 3 companies■ Originated and heavily used at inMobi

● 400+ pipelines and 2000+ data feeds■ Also used at Expedia and at some undisclosed companies

Page 25: Simplified Data Management And Process Scheduling in Hadoop

Future Enhancements And Ideas

■ Improved Web UI [FALCON-790]● More extensive search box, more widgets● The “today morning” dashboard [FALCON-994]● Re-running processes

■ Automatic discovery of datasets in HDFS and Hive■ Streaming feeds and processes e.g. Storm, Spark Streaming■ Triage of data processing issues [FALCON-796]■ HDFS snapshots■ High availability of the Falcon server

Page 26: Simplified Data Management And Process Scheduling in Hadoop

[FALCON-790]