Top Banner
@MargrietGr Margriet Groenendijk, PhD Developer Advocate for IBM Cloud Data Services O’Reilly Software Architecture Conference San Francisco 16 November 2016 Cloud Architectures for Data Science
63

Cloud architectures for data science

Apr 16, 2017

Download

Data & Analytics

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Cloud architectures for data science

@MargrietGr

Margriet Groenendijk, PhDDeveloper Advocate for IBM Cloud Data Services

O’Reilly Software Architecture ConferenceSan Francisco16 November 2016

Cloud Architectures for Data Science

Page 2: Cloud architectures for data science

@MargrietGr

About me• Developer Advocate at IBM Cloud Data Services, UK

•Data science•Python, Spark, R, Cloudant, dashDB

• Research Fellow at University of Exeter, UK•Worked with very large observational datasets and the output of global scale climate models

• PhD at Vrije Universiteit Amsterdam, the Netherlands•Explored large observational datasets of carbon uptake by forests

Page 3: Cloud architectures for data science

@MargrietGr

A Brief History of Data Science

• Computer Science• Data Technology• Visualization• Mathematics• Statistics

http://www.datascienceassn.org/content/history-data-science

Page 4: Cloud architectures for data science

@MargrietGr

1781

http://visual.ly/exports-and-imports-scotland

Page 5: Cloud architectures for data science

@MargrietGr

1821

https://en.wikipedia.org/wiki/Charles_Joseph_Minard#/media/File:Minard.png

Page 6: Cloud architectures for data science

@MargrietGr

1855

http://visual.ly/diagram-causes-mortality-army-east

Page 7: Cloud architectures for data science

@MargrietGr

1960s

http://www.computerhistory.org/collections/catalog/102630767

Page 8: Cloud architectures for data science

@MargrietGr

1960s

http://www.climatecentral.org/news/first-climate-model-video-19007

Page 9: Cloud architectures for data science

@MargrietGr

2016

Page 10: Cloud architectures for data science

@MargrietGr

2016

Page 11: Cloud architectures for data science

@MargrietGrhttps://blog.rjmetrics.com/2015/10/05/how-many-data-scientists-are-there/

How many Data Scientists are there?

Page 12: Cloud architectures for data science

@MargrietGrhttps://whatsthebigdata.com/2015/11/08/top-skills-and-backgrounds-of-data-scientists-on-linkedin/

Page 13: Cloud architectures for data science

@MargrietGr

https://whatsthebigdata.com/2015/11/08/top-skills-and-backgrounds-of-data-scientists-on-linkedin/

Page 14: Cloud architectures for data science

@MargrietGr

Toolbox

http://nirvacana.com/thoughts/wp-content/uploads/2013/07/RoadToDataScientist1.png

Page 15: Cloud architectures for data science

@MargrietGr

Data Engineers

Data Scientists

BusinessAnalysts

App Developers

Data Science is a Team Effort

Data

Page 16: Cloud architectures for data science

@MargrietGr

Page 17: Cloud architectures for data science

@MargrietGr

Data Science Workflow

Page 18: Cloud architectures for data science

@MargrietGr

DiscoverData

UseData Publish Data Socialize

Data

Data Science Workflow

Page 19: Cloud architectures for data science

@MargrietGr

Data Science Workflow

DefineQuestion

FindData

ExploreData

CleanData VisualizeandSummarizeData

CreatePredictiveModels

PresentResults

Page 20: Cloud architectures for data science

@MargrietGr

Collect Data

APIs

Open Data

MapsWeb Scraping

Time Series

Page 21: Cloud architectures for data science

@MargrietGr

Store Data

Object Store - binary files

Relational database

Document store - json

Page 22: Cloud architectures for data science

@MargrietGr

Explore Data

Page 23: Cloud architectures for data science

@MargrietGr

ExploreData

CleanDataStoreData

Page 24: Cloud architectures for data science

@MargrietGr

Spark on a Cluster

Page 25: Cloud architectures for data science

@MargrietGr

The Spark Stack

from Karau et al.: Learning Spark

Page 26: Cloud architectures for data science

@MargrietGr

RDDs : Resilient Distributed Datasets• Data does not have to fit on a single machine• Data is separated into partitions

• Creation of RDDs•Load an external dataset•Distribute a collection of objects

• Transformations construct a new RDD from a previous one (lazy!)• Actions compute a result based on an RDD

Page 27: Cloud architectures for data science

@MargrietGr

Run Spark locally in a Python notebook

https://www.continuum.io/downloads

http://spark.apache.org/downloads.html

Create a new kernel to use in a Jupyter notebook

Page 28: Cloud architectures for data science

@MargrietGr

Jupyter Notebooks!

• Server-client application to edit and run notebook documents via a web browser

• Cells with:•Code•Figures and tables•Rich text elements

• Different kernels: Python, R, Scala, Spark

In the Cloud:

Page 29: Cloud architectures for data science

@MargrietGrhttp://datascience.ibm.com/

Page 30: Cloud architectures for data science

@MargrietGr

Page 31: Cloud architectures for data science

@MargrietGr

Page 32: Cloud architectures for data science

@MargrietGr

Page 33: Cloud architectures for data science

@MargrietGr

Weather Data

Page 34: Cloud architectures for data science

@MargrietGr

Define Question

What will the weather be next weekend?

https://unsplash.com/search/autumn?photo=LSF8WGtQmn8https://unsplash.com/search/rain?photo=19tQv51x4-A

Page 35: Cloud architectures for data science

@MargrietGr

Find Data

https://console.ng.bluemix.net/

Page 36: Cloud architectures for data science

@MargrietGr

Explore DataPython packages• requests and json

•API credentials and latitude/longitude of San Francisco•json data returned

• pandas, numpy and datetime•convert json to pandas DataFrame (table with multiple indices)•add time as index

Page 37: Cloud architectures for data science

@MargrietGr

Weather forecast for San Franciscohttps://developer.ibm.com/clouddataservices/2016/10/06/your-own-weather-forecast-in-a-python-notebook/

Visualize DataPython packages• pandas - rolling mean• matplotlib• Basemap

Page 38: Cloud architectures for data science

@MargrietGr

Weather map - example for UK

https://developer.ibm.com/clouddataservices/2016/10/06/your-own-weather-forecast-in-a-python-notebook/

Python packages• matplotlib• Basemap• itertools• urllib

Page 39: Cloud architectures for data science

@MargrietGr Run as a daily cron job

cloudant

Page 40: Cloud architectures for data science

@MargrietGr

Page 41: Cloud architectures for data science

@MargrietGr

Page 42: Cloud architectures for data science

@MargrietGr

Weather,Twitter and Sentiment

Page 43: Cloud architectures for data science

@MargrietGr

Weather, Twitter and Sentiment

• Where to find the data?• Where to store the data?• Where to analyse the data?

• Quick tools to explore

Page 44: Cloud architectures for data science

@MargrietGr

Insights for Twitter

Page 45: Cloud architectures for data science

@MargrietGr

Add sentiment - example

Page 46: Cloud architectures for data science

@MargrietGr

• watson tone analyser

EmotionLanguage style

Social propensities

Analyze how you are coming across to others

Page 47: Cloud architectures for data science

@MargrietGr

Simpler Workflow

Weather Company Data

crontab -e

0 23 * * * /path/to/file/do_something.sh

python do_something.py

TweetsWeatherSentiment

Watson Tone Analyser

Insights for Twitter

Cloudant NoSQL

Page 48: Cloud architectures for data science

@MargrietGr

PixieDust

https://github.com/ibm-cds-labs/pixiedust

Simpler Workflow

Page 49: Cloud architectures for data science

@MargrietGr

PixieDust: an Open Source Library that simplifies and improves Jupyter Python Notebooks• PackageManager• Visualizations• Cloud Integration• Scala Bridge• Extensibility• Embedded Apps

https://developer.ibm.com/clouddataservices/2016/10/11/pixiedust-magic-for-python-notebook/

@DTAIEB55

Page 50: Cloud architectures for data science

@MargrietGr

Install Spark packages or plain jars in your Notebook Python kernel without the need to modify configuration file

Uses the GraphFrame Python APIs

Install GraphFrames Spark Package

Page 51: Cloud architectures for data science

@MargrietGr

One simple API: display()Call the Options dialog

Panning/Zooming options

Performance statistics

Page 52: Cloud architectures for data science

@MargrietGr

Easily export your data to csv, json, html, etc. locally on your laptop or into a cloud-based service like Cloudant or Object Storage

Page 53: Cloud architectures for data science

@MargrietGr

Scala Bridge

Define a Python variable

Use the Python var in Scala

Define a Scala variable

Use the Scala var in Python

Page 54: Cloud architectures for data science

@MargrietGr

Easily extend PixieDust to create your own visualizations using HTML/CSS/JavaScript

Customized Visualization for GraphFrame Graphs

Page 55: Cloud architectures for data science

@MargrietGr

Encapsulate your analytics into compelling User Interfaces better suited for Line of Business Users

Page 56: Cloud architectures for data science

@MargrietGr

Page 57: Cloud architectures for data science

@MargrietGr

https://github.com/ibm-cds-labs/ibmseti/

SETI

Page 58: Cloud architectures for data science

@MargrietGr

• Mission: To explore, understand and explain the origin and nature of life in the universe

• Origins: Started in 1959 by two physicists at Cornell

• NASA became interested in 1970, started working with SETI in 1988, funding cut in 1993

SETI@IBMCloud

http://www.seti.org/node/861

Page 59: Cloud architectures for data science

@MargrietGr

• The Allen Telescope Array•198 million radio events detected in the last decade•400,000 candidate signals identified •5TB data generated in 10 hours

• No modern analysis or machine learning has been performed on this data• 5 TB of special observations on IBM Object Store

SETI@IBMCloud - the Data

https://github.com/ibm-cds-labs/ibmseti/

Page 60: Cloud architectures for data science

@MargrietGr

Public Spark@SETI

4 TB of SETI Data stored in Object Storage

Web API provides Bluemix users access to download SETI data

ObjectStorage

WebAPI Spark Object

Storage

Public Spark@SETI Bluemix Account My Bluemix Account

Spark using Jupyter Notebook and IBM SETI Python Library

Goal: Amateur scientists/data scientists download and analyze SETI data

Page 61: Cloud architectures for data science

@MargrietGr

IBM Watson Data Platform• Data Science Experience• Watson Data Platform• Machine Learning

• Sign up for beta: http://datascience.ibm.com/features#machinelearning

Page 62: Cloud architectures for data science

@MargrietGr

Data Science in the Cloud• Flexible and quick to iterate, play and explore data• APIs

•Streaming data•Cloud databases•Watson

• Scaling up - add storage or Spark kernels• Easy collaboration and presentation

•Store Data•Share your analyses in notebooks

• Some useful packages: pandas, pyspark, requests, matplotlib, cloudant• Notebooks can be extended! PixieDust

Page 63: Cloud architectures for data science

@MargrietGr

https://developer.ibm.com/clouddataservices/author/mgroenen/

Thanks!

Slides will be here: http://www.slideshare.net/MargrietGroenendijk