Top Banner
Piotr Lusakowski Cooperative Data Exploration with IPython Notebook
12

Cooperative Data Exploration with iPython Notebook

Apr 15, 2017

Download

Technology

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Cooperative Data Exploration with iPython Notebook

Piotr LusakowskiCooperative Data Exploration

with IPython Notebook

Page 2: Cooperative Data Exploration with iPython Notebook

Motivation

1

● Big Data computations require lots of resources

○ CPU○ RAM

● Sharing the results is difficult in most current setups

○ Precomputed datasets○ Trained models○ Insights

Page 3: Cooperative Data Exploration with iPython Notebook

Solution Created for the Seahorse 1.0 release

● Single Spark application as the backend○ Results of other team members easily accessible in-memory○ No unnecessary duplication of data

● Multiple IPython Notebooks as clients

2

Page 4: Cooperative Data Exploration with iPython Notebook

● How to use the SparkContext and SqlContext of an application running on a cluster?

● How to execute Python code on cluster?

Challenges

3

Page 5: Cooperative Data Exploration with iPython Notebook

A library for Python - Java communication

● “Wraps” JVM-based objects

● Exposes their API in Python

● Internally, uses a custom TCP client/server communication

● In JVM: a Gateway Server

● On the Python side: a client called Java Gateway

Py4J

4

Page 6: Cooperative Data Exploration with iPython Notebook

● Spark application exposes its SparkContext and SqlContext

○ It’s actually quite easy, once you know what you’re doing

● Notebook connects to the Spark application via Py4J on startup

○ sc and sqlContext variables are added to user’s environment○ This setup is completely transparent to the user

Using an Existing SparkContext

5

Page 7: Cooperative Data Exploration with iPython Notebook

Notebook Architecture Overview

6

● User’s code is executed by kernels - processes spawned by the Notebook Server

● Kernels execute user’s code on Notebook Server host

Page 8: Cooperative Data Exploration with iPython Notebook

Requirements

7

● User’s code is executed on the Spark driver

● No assumptions about the driver being visible from the Notebook Server

Page 9: Cooperative Data Exploration with iPython Notebook

● Forwarding Kernel

● Executing Kernel

● Message Queue

Custom Kernel

8

Page 10: Cooperative Data Exploration with iPython Notebook

● Storage object accessible via Py4J

○ Each client connected to the Spark application can reuse any entity from the storage

■ DataFrames■ Models■ Even code snippets

○ Access control■ Sharing with only selected colleagues■ Private storage

○ Notifications: “Hey, look, Susan published a new result!”

The Interaction Between Users

9

Page 11: Cooperative Data Exploration with iPython Notebook

● John defines a DataFrame: “Something Interesting”

● Alex explores it

● Susan bases her models on it

● John uses a model shared by Susan

Cooperative Data Exploration

10

Page 12: Cooperative Data Exploration with iPython Notebook

Thank you!

Piotr LusakowskiSenior Software Engineer

[email protected]