Top Banner
Introduction to Spark
40

IBM Strategy for Spark

Jan 21, 2018

Download

Technology

Mark Kerzner
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: IBM Strategy for Spark

Introduction to Spark

Page 2: IBM Strategy for Spark

Introductions

Garrett Young ([email protected])

1) Introduction to Spark (10 mins)

2) IBM's Commitment to Spark (5 mins)

3) How Predictive Analytic Lifecycles Typically Work (10 mins)

3) Using Spark to Predict Hospital Readmissions (15 mins)

4) How you can get a free-trial Spark environment from IBM (5 mins)

5) Q&A (15 mins)

Page 3: IBM Strategy for Spark

What is Spark?

• In-memory data processing engine

• Open Source Apache Project

• Cluster Computing Framework

• Can use Scala, Python or R Languages

• Horizontally/Vertically Scalable

• Not a data store

Page 4: IBM Strategy for Spark

IBM | SPARK – The Analytics Operating System“Enabling New Classes of Intelligent Applications Embedded with Analytics”

• Spark unifies data, enabling real-time insights

• Spark processes and analyzes data from any data source

• Spark is complementary to Hadoop, but faster with in-memory performance

• Build models quickly. Iterate faster. Apply intelligence .

Page 5: IBM Strategy for Spark

• Traditional Approach: MapReduce jobs for complex jobs, interactive query, and online event-hub processing involves lots of (slow) disk I/O

How Spark Works

HDFS

Read

HDFS

Write

HDFS

Read

HDFS

WriteInput ResultCPU

Iteration 1

Memory CPU

Iteration 2

Memory

Page 6: IBM Strategy for Spark

• Solution: Keep more data in-memory with a new distributed execution engine

HDFS

ReadInput CPU

Iteration 1

Memory CPU

Iteration 2Memory

faster than

network & disk

Zero

Read/Write

Disk

Bottleneck

How Spark Works

Chain Job

Output

into New Job

Input

Page 7: IBM Strategy for Spark

General Spark Architecture Overview

• Driver Uses Spark Context to talk to the Cluster Manager

• Executors run their own JVM Processes

• Cluster manager distributes the workload based on information from the Worker

Page 8: IBM Strategy for Spark

Key Reasons for the Interest in Spark

Performant In-memory architecture greatly reduces disk I/O

Anywhere from 20-100x faster for common tasks

Productive Concise and expressive syntax, especially compared to prior approaches

Single programming model across a range of use cases and steps in data lifecycle

Integrated with common programming languages – Java, Python, Scala, R

New tools continually reduce skill barrier for access (e.g. SQL for analysts)

Leverages existing

investments

Works well within existing Hadoop ecosystem

Improves with age Large and growing community of contributors continuously improve full analytics stack and extend capabilities

Page 9: IBM Strategy for Spark

What is SparkML?

MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. At a high level, it provides tools such as:

1. ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering

2. Featurization: feature extraction, transformation, dimensionality reduction, and selection3. Persistence: saving and load algorithms, models, and Pipelines4. Utilities: linear algebra, statistics, data handling, etc.

Page 10: IBM Strategy for Spark

What is scikit-learn?

• Used for Data Mining and Data Analysis

• Open Source

• Various classification, regression and clustering algorithms

Page 11: IBM Strategy for Spark

Watson Machine Learning

• Uses both Spark ML and Scikit-Learn plus others

• Built on SPSS plaform

• Can pull from many different data sources

• Integrates with DSX (Beta)

Page 12: IBM Strategy for Spark

Web Service

Data Access:• Easily connect to Behind-

the-Firewall and Public Cloud Data

• Catalogued and Governed Controls through Watson Data Platform

Creating Models:• Single UI and API for

creating ML Models on various Runtimes

• Auto-Modelling and HyperparameterOptimization

Web Service:• Real-time,

Streaming, and Batch Deployment

• Continuous Monitoring and Feedback Loop

Intelligent Apps:• Integrate ML

models with apps, websites, etc.

• Continuously Improve and Adapt with Self-Learning

IBM DSX Machine Learning

IMS

Page 13: IBM Strategy for Spark

IBM Machine Learning in Data Science Experience

API for Jupyter Notebooks Wizard GUI

IBM Machine Learning is provisioned by default in Data Science Experience• Enables Data Scientists to deploy machine learning models as web services• Single UI for creating, collaborating, deploying, monitoring, and feedback• Accessible via API, Wizard GUI, and Canvas

Page 14: IBM Strategy for Spark

IBM's Commitment to Spark

Page 15: IBM Strategy for Spark

Spark Tech Center (STC): IBM’s Commitment to Spark

0

100

200

300

400

500

600

700

800

900

1000

Databricks IBM Hortonworks Cloudera Intel IVU TrafficTechnologies

Tencent

Top 7 Contributing Companies to Spark 2.0.0

25,600 Spark LOC

606 Spark JIRAs

253 SystemML JIRAs

64 Speakers at events

… and all that with 1 Team

1.5 Years

Databricks

Hortonworks

Cloudera

Intel

Tencent

NTT

Other

Page 16: IBM Strategy for Spark

IBM Spark Technology Center – San Francisco, CA

As of March 10, 2016

See what we’re up to …IBM Spark Technology Center

http://www.spark.tc/blog/

Fixing lot’s of issues reported by others

Page 17: IBM Strategy for Spark

Using Spark to Predict Hospital Readmissions &How Predictive Analytic Lifecycles Typically Work

Page 18: IBM Strategy for Spark

Reducing Hospital Readmissions with Predictive Analytics

An Example ‘Proof of Concept’ Using Open Data

Page 19: IBM Strategy for Spark

Outline

Problem

Solution

Details

Results

Summary

Page 20: IBM Strategy for Spark

Problem

Solution

Details

Results

Summary

Problem

Page 21: IBM Strategy for Spark

Problem : 30-Day Hospital Readmissions costs $41B Annually

Source: http://www.hcup-us.ahrq.gov/reports/statbriefs/sb172-Conditions-Readmissions-Payer.pdf

Page 22: IBM Strategy for Spark

Medicare HRRP – Penalties to Hospitals

Source: Kaiser Family Foundationhttp://kff.org/medicare/issue-brief/aiming-for-fewer-hospital-u-turns-the-medicare-hospital-readmission-reduction-program/

Page 23: IBM Strategy for Spark

Problem

Solution

Details

Results

Summary

Solution

Page 24: IBM Strategy for Spark

Get Data: Diabetes Readmissions Dataset• University of California Irvine – Machine Learning Repos.

• Open Data• 130 Hospitals, 1999-2008

• 101,766 rows, 50 columns of data

• Diabetes Readmissions• Top ten for Medicaid, Private Insurance and Uninsured

• Not in top ten for Medicare

https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008

Page 25: IBM Strategy for Spark

Build a Predictive Model : Conceptual View

Step 1: Model Development

Step 2: Perform Predictions

HistoricalData

MachineLearning(Mathematical

Algorithm)

Model

Model PredictionNew Case

Page 26: IBM Strategy for Spark

IBM Bluemix• Bluemix

• Infrastructure, Watson, software and services on Bluemix Cloud Platform • Services such as Big Insights (Hadoop), Data Connect (ETL), and Spark can be almost instantly provisioned

Page 27: IBM Strategy for Spark

Data Science Experience (DSX)• Data Science Experience (DSX)

• Easily execute scala, python and R notebooks• Share notebooks with your data science team

Page 28: IBM Strategy for Spark

Bluemix Services Architecture in the Cloud

BigInsights HDFS(Hadoop)

Data Connect DashDB

Data Science ExperienceCloudantNode.js Web Form

Training Data Convert to CSV

Predictions

New Records

Predictions

Page 29: IBM Strategy for Spark

Problem

Solution

Details

Results

Summary

Details

Page 30: IBM Strategy for Spark

A Look at The Raw Data

Page 31: IBM Strategy for Spark

Data Science Experience – Python Code

Page 32: IBM Strategy for Spark

Problem

Solution

Details

Results

Summary

Results

Page 33: IBM Strategy for Spark

First Pass Results – Are they any good?

AUC = Area Under the Curve

AUC Score 0.6514

0.50 = Random Guessing

1.00 = Perfect Prediction

Page 34: IBM Strategy for Spark

2nd Pass Results – Are they any good?

AUC = Area Under the Curve

AUC Score 0.6750

0.50 = Random Guessing

1.00 = Perfect Prediction

Page 35: IBM Strategy for Spark

How Do Other Readmission Models Perform?

“A comparison of models for predicting early hospital

readmissions”

Journal of Biomedical Informatics Volume 56, August 2015, Pages 229–

238

Source: http://www.sciencedirect.com/science/article/pii/S1532046415000969

Page 36: IBM Strategy for Spark

Which Factors Affect Diabetes Readmission?

Data: Feature Importance from Random Forest Algorithm

The Algorithm can tell us which features (columns) it found important during the training process.

22 columns from original 50

Page 37: IBM Strategy for Spark

Problem

Solution

Details

Results

Summary

Summary

Page 38: IBM Strategy for Spark

Summary

• Readmissions Prediction is an important area of research for using Predictive Analytics in Healthcare

• Patient: Improved Outcome

• Hospital Providers: Avoid Penalties

• Payers: Reduce Costs

• In a short amount of time we were able to develop results comparable to leading research studies

Page 39: IBM Strategy for Spark

How you can get a free-trial Spark Cluster from IBM

Page 40: IBM Strategy for Spark

Sign Up for Free Account

Data Science Experience with IBM ML

https://ibm.box.com/s/y2zvpzk8pje56lto0oja0372tnbydbomhttp://datascience.ibm.com/

Notebook Samples