Top Banner
Data Science lifecycle with Apache Zeppelin http://zeppelin.apache.org
54

Data science lifecycle with Apache Zeppelin

Jan 07, 2017

Download

Technology

Hadoop Summit
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Data science lifecycle with Apache Zeppelin

Data Science lifecycle with Apache Zeppelinhttp://zeppelin.apache.org

Page 2: Data science lifecycle with Apache Zeppelin

Moon

Creator of Apache Zeppelin

Co-founder NFLabs

Page 3: Data science lifecycle with Apache Zeppelin

Zeppelin

2012. 12 Data analytics solution based on AMP Lab Spark/Shark

Page 4: Data science lifecycle with Apache Zeppelin

Zeppelin

2012. 12 Data analytics solution based on AMP Lab Spark/Shark 2013. 10 Opensource interactive analytics feature as ‘Zeppelin’

2013. 10 2014. 08

Page 5: Data science lifecycle with Apache Zeppelin

Zeppelin

2012. 12 Data analytics solution based on AMP Lab Spark/Shark 2013. 10 Opensource interactive analytics feature as ‘Zeppelin’ 2014. 12 ASF incubation

Incubation Status http://incubator.apache.org/projects/zeppelin.html

Page 6: Data science lifecycle with Apache Zeppelin

Zeppelin

2012. 12 Data analytics solution based on AMP Lab Spark/Shark 2013. 10 Opensource interactive analytics feature as ‘Zeppelin’ 2014. 12 ASF incubation

2016. 10 157 Contributors world wide 2071 Stars on github repo 6 Releases

One of the most popular project in ASF

Page 7: Data science lifecycle with Apache Zeppelin

Collect ETL / Process Analysis

Report

Data Product

Life cycle of big data

Data Engineer

Data Scientist

Business user Customer

Page 8: Data science lifecycle with Apache Zeppelin

ZeppelinA web-based notebook that enables interactive data analytics. You can make beautiful data-driven, interactive and collaborative documents with SQL, Scala and more.

Page 9: Data science lifecycle with Apache Zeppelin

Zeppelin

JDBC

Markdown > _ Shell

Interpreter : pluggable layer for language / processing backend integration

20+ interpreters are supported officially

2016. 03. Interpreters in Zeppelin source tree. Does not include 3rd party interpreters

Page 10: Data science lifecycle with Apache Zeppelin

Zeppelin

Interpreter : pluggable layer for language / processing backend integration

Page 11: Data science lifecycle with Apache Zeppelin

Zeppelin

Interpreter : Easy to extend

public abstract class Interpreter {

public void open(); public void close(); public InterpreterResult interpret(String st, InterpreterContext context);

public void cancel(InterpreterContext context); public int getProgress(InterpreterContext context); public List<String> completion(String buf, int cursor);

public FormType getFormType(); public Scheduler getScheduler();

}

{Must have

{Good to have

Advanced {

Page 12: Data science lifecycle with Apache Zeppelin

Zeppelin

Notebook Repo : pluggable layer for notebook persistence

5+ Notebook repos are supported officially

2016. 03. Notebook repos in Zeppelin source tree. Does not include 3rd party interpreters

ZeppelinHub

Page 13: Data science lifecycle with Apache Zeppelin

Zeppelin

Notebook Repo : Easy to extend

public interface NotebookRepo {

public List<NoteInfo> list() throws IOException; public Note get(String noteId) throws IOException; public void save(Note note) throws IOException; public void remove(String noteId) throws IOException; public void checkpoint(String noteId, String checkPointName) throws IOException; public void close();

}

Page 14: Data science lifecycle with Apache Zeppelin

Zeppelin

Visualizations : 6 Built-in visualizations comes with pivot

Table Bar Pie Area Line Scatter

Free to draw any customized visualizations inside of notebook

Page 15: Data science lifecycle with Apache Zeppelin

He liumHe2

Platform for data analytics application that makes visualization pluggable and more.

http://issues.apache.org/jira/browse/ZEPPELIN-533

https://cwiki.apache.org/confluence/display/ZEPPELIN/Helium+proposalProposal

Umbrella issue

Makes Zeppelin fly!

Page 16: Data science lifecycle with Apache Zeppelin

He liumHe2

RESTful API Websocket

Interpreter Notebook Storage

Spar

k

Flin

k

Geo

de

JDBC …

File

Sys

tem

Amaz

on S

3

Git …

ZeppelinServer

Interpreters and Notebook storage are pluggable

Page 17: Data science lifecycle with Apache Zeppelin

He liumHe2

Interpreter Notebook StorageSp

ark

Flin

k

Geo

de

JDBC …

File

Sys

tem

Amaz

on S

3

Git …

ZeppelinServer

Visualizations

Map

Wor

dClo

ud

We want visualization be pluggable

Page 18: Data science lifecycle with Apache Zeppelin

He liumHe2

Interpreter Notebook StorageSp

ark

Flin

k

Geo

de

JDBC …

File

Sys

tem

Amaz

on S

3

Git …

Application

Visu

aliz

atio

ns

Map

Wor

dClo

ud

Resource PoolSparkContext Flink Environment JDBC connection …

Ana

lytic

s

… …

User object

Extend pluggable visualization to pluggable analytics application

Page 19: Data science lifecycle with Apache Zeppelin

Helium Application: Easy to extend

public abstract class Application {

public Application(ApplicationContext context);

public abstract void run(ResourceSet args);

public abstract void unload();

}

He liumHe2

Page 20: Data science lifecycle with Apache Zeppelin

Launcher: Suggest application according to data type in ResourcePool

He liumHe2

Page 21: Data science lifecycle with Apache Zeppelin

& Enterprise

Page 22: Data science lifecycle with Apache Zeppelin

Jongyoul Lee

PMC of Apache Zeppelin

Software Development Engineer at NFLabs

Page 23: Data science lifecycle with Apache Zeppelin

& Enterprise

More than 1000 employers use Apache Zeppelin

Supports Apache Zeppelin as an internal service Recommendation team uses Apache Zeppelin

Monitors their infrastructures via Apache Zeppelin

Page 24: Data science lifecycle with Apache Zeppelin

& Enterprise

Page 25: Data science lifecycle with Apache Zeppelin

& Enterprise History

~ 0.6• NOTHING!!!

0.6.x• Authentication & Authorization • Note level permission • Note level isolation • Partially supported by Livy

Page 26: Data science lifecycle with Apache Zeppelin

& Enterprise Future

0.7.0• Enterprise Support

• Multi users environment • Impersonation on Spark/JDBC interpreter • Job management

• Interpreter • Improvement on JDBC/Python interpreter

• Frontend performance improvement • Pluggable visualization

Page 27: Data science lifecycle with Apache Zeppelin

& Enterprise Future

0.7.0• Enterprise Support

• Multi users environment• Impersonation on Spark/JDBC interpreter • Job management

• Interpreter • Improvement on JDBC/Python interpreter

• Frontend performance improvement • Pluggable visualization

Page 28: Data science lifecycle with Apache Zeppelin

& Enterprise

RESTful API Websocket

Interpreter Notebook Storage

Spar

k

Flin

k

Geo

de

JDBC …

File

Sys

tem

Amaz

on S

3

Git …

ZeppelinServer

Multi-tenancy

Page 29: Data science lifecycle with Apache Zeppelin

& Enterprise

RESTful API Websocket

Interpreter Notebook Storage

Spar

k

Flin

k

Geo

de

JDBC …

File

Sys

tem

Amaz

on S

3

Git …

ZeppelinServerNO USER

Multi-tenancy

Page 30: Data science lifecycle with Apache Zeppelin

Shared, Isolated, Scoped

Page 31: Data science lifecycle with Apache Zeppelin

& Enterprise

ZeppelinServer

SparkInterpreter

Run P1 on NoteA

Run SparkInterpreter for P1

User1

Multi-tenancy

Page 32: Data science lifecycle with Apache Zeppelin

& Enterprise

ZeppelinServer

SparkInterpreter

Run P1 on NoteA

Run SparkInterpreter for P1

User1

User2

Run P2 on NoteB Run SparkInterpreter for P2

Multi-tenancy

Page 33: Data science lifecycle with Apache Zeppelin

& Enterprise

• Originally implemented • Pros

• Simple structure • Predictable behavior

• Cons • All resources shared • Interference among users

Multi-tenancy

Shared

Page 34: Data science lifecycle with Apache Zeppelin

& Enterprise

ZeppelinServer

SparkInterpreter

Run P1 on NoteA

Run SparkInterpreter for P1

User1

User2

Run P2 on NoteB

Run SparkInterpreter for P2 SparkInterpreter

Multi-tenancy

Page 35: Data science lifecycle with Apache Zeppelin

& Enterprise

• Pros • No pending • No resources shared

• Cons • Lots of memory • Inefficiency of using memory • Limited by resources

Multi-tenancy

Isolated

Page 36: Data science lifecycle with Apache Zeppelin

& Enterprise

ZeppelinServer

SparkInterpreter

Run P1 on NoteA

Run SparkInterpreter for P1

User1

User2

Run P2 on NoteB

Run SparkInterpreter for P2 SparkInterpreter

Multi-tenancy

Page 37: Data science lifecycle with Apache Zeppelin

& Enterprise

ZeppelinServer

JDBCInterpreter

Run P2 on NoteA

Run SparkInterpreter for P2

User1

User2

Run P3 on NoteB

Run SparkInterpreter for P3 JDBCInterpreter

Multi-tenancy

Page 38: Data science lifecycle with Apache Zeppelin

& Enterprise

ZeppelinServer

JDBCInterpreter

Run P2 on NoteA

Run SparkInterpreter for P2

User1

User2

Run P3 on NoteB Run SparkInterpreter for P3

Multi-tenancy

JDBCInstance User1

JDBCInstance User2

Page 39: Data science lifecycle with Apache Zeppelin

& Enterprise

• Pros • Less memory • Some resources Isolated

• Cons • Some resources shared • Big single process

Multi-tenancy

Scoped

Page 40: Data science lifecycle with Apache Zeppelin

& Enterprise Future

0.7.0• Enterprise Support

• Multi users environment • Impersonation on Spark/JDBC interpreter• Job management

• Interpreter • Improvement on JDBC/Python interpreter

• Frontend performance improvement • Pluggable visualization

Page 41: Data science lifecycle with Apache Zeppelin

& Enterprise Impersonation

Page 42: Data science lifecycle with Apache Zeppelin

What if all users use different credentials?

Page 43: Data science lifecycle with Apache Zeppelin

& Enterprise Impersonation

Page 44: Data science lifecycle with Apache Zeppelin

& Enterprise Impersonation

Credentials

• Already merged by Twitter at Mar. 2016 • Never used in any interpreter

Page 45: Data science lifecycle with Apache Zeppelin

& Enterprise Impersonation

Page 46: Data science lifecycle with Apache Zeppelin

• JDBC

• Set user and password in properties

• https://issues.apache.org/jira/browse/ZEPPELIN-1567

• Spark

• Adopt ugi.doAs()

• https://issues.apache.org/jira/browse/ZEPPELIN-1572

& Enterprise Impersonation

Page 47: Data science lifecycle with Apache Zeppelin

& Enterprise Future

0.7.0• Enterprise Support

• Multi users environment • Impersonation on Spark/JDBC interpreter • Job management

• Interpreter • Improvement on JDBC/Python interpreter

• Frontend performance improvement • Pluggable visualization

Page 48: Data science lifecycle with Apache Zeppelin

& Enterprise Job mgmt

Page 49: Data science lifecycle with Apache Zeppelin

& Enterprise Future

0.7.0• Enterprise Support

• Multi users environment • Impersonation on Spark/JDBC interpreter • Job management

• Interpreter • Improvement on JDBC/Python interpreter

• Frontend performance improvement • Pluggable visualization

Page 50: Data science lifecycle with Apache Zeppelin

• JDBC

• Connection pool

• Stabilization for BI

• Python

• Matplot library

• Support on python user

& Enterprise Interpreters

Page 51: Data science lifecycle with Apache Zeppelin

& Enterprise Future

0.7.0• Enterprise Support

• Multi users environment • Impersonation on Spark/JDBC interpreter • Job management

• Interpreter • Improvement on JDBC/Python interpreter

• Frontend performance improvement• Pluggable visualization

Page 52: Data science lifecycle with Apache Zeppelin

• Frontend

• Fine-grained broadcast of WebSocket

• Betterment of rendering DOM

• Pluggable visualization

• lium

& Enterprise Frontend

He2

Page 53: Data science lifecycle with Apache Zeppelin

Zeppelin

Homepage http://zeppelin.apache.org/

Mailing list [email protected] [email protected]

Issue tracker https://issues.apache.org/jira/browse/ZEPPELIN

Github repository http://github.com/apache/zeppelin

Join the community

Page 54: Data science lifecycle with Apache Zeppelin

Thank you

Moon soo Lee [email protected]

https://twitter.com/issuefreaks

Jongyoul Lee [email protected]

https://twitter.com/madeng