Top Banner
R at Microsoft David Smith @revodavid Revolution Analytics, a Microsoft company
45
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: R at Microsoft

R at MicrosoftDavid Smith@revodavidRevolution Analytics, a Microsoft company

Page 2: R at Microsoft

• Introduction to R• Applications of R at Microsoft• R Products at Microsoft• What’s coming for R at Microsoft• Q&A

Agenda

Page 3: R at Microsoft

April 6, 2015

“This acquisition will help customers use advanced analytics within Microsoft data platforms.“

Page 4: R at Microsoft

INTRODUCTION TO R

Page 5: R at Microsoft

• Most widely used data analysis software• Used by 2M+ data scientists, statisticians and analysts

• Most powerful statistical programming language• Flexible, extensible and comprehensive for productivity

• Create beautiful and unique data visualizations• As seen in New York Times, The Economist and FlowingData

• Thriving open-source community• Leading edge of analytics research

• Fills the talent gap• New graduates prefer R

What is R?

www.revolutionanalytics.com/what-is-r

Page 6: R at Microsoft

Innovation: The CRAN R Community

Page 7: R at Microsoft

• 1993: Research project in Auckland, NZ• Ross Ihaka and Robert Gentlemen

• 1995: Released as open-source software• Generally compatible with the “S” language

• 1997: R core group formed• 2000: R 1.0.0 released• 2003: R Foundation formed in

Austria• 2004: First international user

conference• 2007: Revolution Analytics

founded• 2009: New York Times article on R• 2013: Revolution R Open released• 2015: Microsoft acquires

Revolution Analytics

A brief history of R

7

Photo credit: Robert Gentleman

Page 8: R at Microsoft

R’s popularity is growing rapidlyMore at blog.revolutionanalytics.com/popularity

R Usage GrowthRexer Data Miner Survey, 2007-

2013

• Rexer Data Miner Survey • IEEE Spectrum, July 2014

#9: R

Language PopularityIEEE Spectrum Top Programming Languages

Page 9: R at Microsoft

Rapid development

New York Times, June 25 2009(3 hours after Michael Jackson’s death)

Page 10: R at Microsoft

R AT MICROSOFT

Page 11: R at Microsoft

Advanced Analytics with Data ScienceBeyond business intelligence

Source: Gartner

VA

LU

E

DIFFICULTY

HINDSIGHT

INSIGHT

FORESIGHT

Descriptive Analytics

DiagnosticAnalytics

Predictive Analytics

Prescriptive Analytics

What happened?

Why did it happen?

What will happen?

How can we make it happen?

Traditional BI Advanced AnalyticsINFORMATION

OPTIMIZATION

Page 12: R at Microsoft

• System monitoring & alerting• Understanding user behavior (how users configure monitoring

platform)• Visualizing infrastructure utilization data• Abnormal login detection• Custom R packages to analyze monitoring data (time series anomaly

detection)

• Capacity Planning• Forecasting hardware purchase requirements (forecast package)• Also RAM requirements for Microsoft IT

Microsoft Azure uses R for Reliability

Page 13: R at Microsoft

• TruSkill Matchmaking System

• Player Churn• Game design• Difficulty curve• Level trouble-spots

• In-game purchase optimization

• Fraud detection• Player communities

Xbox uses R for a great gaming experience

Page 14: R at Microsoft

MICROSOFT PRODUCTS

WITH R

Page 15: R at Microsoft

• Enhanced Open Source R distribution

• Compatible with all R-related software

• Multi-threaded for performance• Focus on reproducibility• Open source (GPLv2 license)• Available for Windows, Mac OS X,

Ubuntu, Red Hat and OpenSUSE • Download from

mran.revolutionanalytics.com

Revolution R Open

15

Page 16: R at Microsoft

• Built on latest R engine• Currently R 3.2.0• Updates released 3 weeks after R• Drop-in replacement for R

• 100% compatible with• R scripts• R packages• Applications with R connections

• Designed to work with RStudio• No configuration required

RRO: 100% Compatibility

16

Page 17: R at Microsoft

• Multithreaded library replaces standard BLAS/LAPACK algorithms• Intel MKL on Windows/Linux ; Accelerate on Mac

• High-performance algorithms• Sequential Parallel

• Uses as many threads as there are available cores

• No need to change any R code• Included with RRO binary

distributions

Multi-threaded performance

17

More at Revolutions blog

Page 18: R at Microsoft

An R Reproducibility Problem

Adapted from http://xkcd.com/234/ CC BY-NC 2.5

Page 19: R at Microsoft

• Static CRAN mirror• CRAN packages fixed with each Revolution R Open update

• Daily CRAN snapshots• Storing every package version since September 2014• Binaries and sources• At mran.revolutionanalytics.com/snapshot

• Easily write and share scripts synced to a specific snapshot• “checkpoint” package installed with RRO

Reproducible R Toolkit in RRO

19

CRAN

RRDaily snapshots

http://mran.revolutionanalytics.com/snapshot/

checkpoint package

library(checkpoint)checkpoint("2014-09-17")

CRAN mirror

http://cran.revolutionanalytics.com/

checkpoint server

Midnight UTC

Page 20: R at Microsoft

• Easy to use: add 2 lines to the top of each scriptlibrary(checkpoint)checkpoint("2014-09-17")

• For the package author:• Use package versions available on the chosen date• Installs packages local to this project• Allows different package versions to be used simultaneously

• For a script collaborator:• Automatically installs required packages• Detects required packages (no need to manually install!)

• Uses same package versions as script author to ensure reproducibility

Using checkpoint

20

Page 21: R at Microsoft

• Download Revolution R Open

• Learn about R and RRO

• Daily CRAN snapshots

• Explore Packages• and dependencies

• Explore Task Views

MRAN: The Managed R Archive Network

21http://mran.revolutionanalytics.com

Page 22: R at Microsoft

Transformational Trends

cloud computing

2011 2016 5x increase

emerging data science talent

Universities filling 300,000 US talent gap

90% of the data in the world today has been created in the last two years alone

data explosion

opensourcee.g. R and Python

Page 23: R at Microsoft

R FORBIG DATA

Page 24: R at Microsoft

• Toolkits for data scientists and numerical analysts to create custom parallel and distributed algorithms• ParallelR: parallel programming for multi-CPU servers and grids• RHadoop: map-reduce programming in R language

• Mainly useful for “embarrassingly parallel” problems, where parallel components work with small amounts of data

• Big Data Predictive Analytics mostly not embarrassingly parallel• 80+ pre-built “parallel external memory algorithms” included with Revolution R

Enterprise• Azure ML Studio includes many ML algorithms

Details at projects.revolutionanalytics.com

R Packages: RHadoop and ParallelR

24

Page 25: R at Microsoft

Revolution R Enterprise

• High Performance, Scalable Analytics

• Portable Across Enterprise Platforms

• Easier to Build & Deploy Analytics

is….the only big data big analytics platform based on open source Rthe defacto statistical computing language for modern analytics

Page 26: R at Microsoft

ScaleR: Dramatic Performance and Capacity

Page 27: R at Microsoft

Naïve Bayes

ScaleR Functions & Algorithms

Data import – Delimited, Fixed, SAS, SPSS, OBDC

Variable creation & transformation Recode variables Factor variables Missing value handling Sort, Merge, Split Aggregate by category (means, sums)

Min / Max, Mean, Median (approx.) Quantiles (approx.) Standard Deviation Variance Correlation Covariance Sum of Squares (cross product matrix for

set variables) Pairwise Cross tabs Risk Ratio & Odds Ratio Cross-Tabulation of Data (standard tables &

long form) Marginal Summaries of Cross Tabulations

Chi Square Test Kendall Rank Correlation Fisher’s Exact Test Student’s t-Test

Subsample (observations & variables) Random Sampling

Data Step Statistical Tests

Sampling

Descriptive Statistics Sum of Squares (cross product matrix for

set variables) Multiple Linear Regression Generalized Linear Models (GLM)

exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions: cauchit, identity, log, logit, probit. User defined distributions & link functions.

Covariance & Correlation Matrices Logistic Regression Classification & Regression Trees Predictions/scoring for models Residuals for all models

Predictive Models K-Means

Decision Trees Decision Forests Gradient Boosted Decision

Trees

Cluster Analysis

Classification

Simulation

Variable Selection Stepwise Regression

Simulation (e.g. Monte Carlo) Parallel Random Number

Generation

Combination

New in

v7.3

PEMA-R API rxDataStep rxExec

Coming in v7.4

Page 28: R at Microsoft

Marketing Attribution Modeling

Page 29: R at Microsoft

• ETL• Marketing channel data• Behavioral variables• Promotional data• Overlay data

• Exploratory data analysis• Time-to-event models• GAM survival models

• Scoring for inference• Scoring for prediction

• 5 billion scores per day per retailer

CUSTOM DATA FORMAT

CUSTOM VARIABLES (PMML)

Page 30: R at Microsoft

R IN THE CLOUD

Page 31: R at Microsoft

• Exposing the expertise of data scientists as APIs

• Bringing the utility of data science to applications

• Addressing the Data Science talent gap

The Opportunity: Data Science as a Service

Page 32: R at Microsoft

Azure: Huge infrastructure scale19 Regions ONLINE…huge datacenter capacity around the world…and we’re growing

100+ datacentersOne of the top 3 networks in the world (coverage, speed, connections) 2 x AWS and 6x Google number of offered regionsG Series – Largest VM available in the market – 32 cores, 448GB Ram, SSD…

Operational Announced

Central USIowa

West USCalifornia

North EuropeIreland

East USVirginia

East US 2Virginia

US GovVirginia

North Central USIllinois

US GovIowa

South Central USTexas

Brazil SouthSao Paulo

West EuropeNetherlands

China North *Beijing

China South *Shanghai

Japan EastSaitama

Japan WestOsakaIndia West

TBD

India EastTBD

East AsiaHong Kong

SE AsiaSingapore

Australia WestMelbourne

Australia EastSydney

* Operated by 21Vianet

Page 33: R at Microsoft

MICROSOFT CONFIDENTIAL – INTERNAL ONLY

Microsoft Azure Machine Learning – Custom Modules in R

Get started for free at gallery.azureml.net

Page 34: R at Microsoft
Page 35: R at Microsoft
Page 36: R at Microsoft

http://blog.revolutionanalytics.com/2015/06/r-build-keynote.html/

Page 37: R at Microsoft

WHAT’S COMING FOR R AT MICROSOFT

Page 38: R at Microsoft

WODA: Write Once, Deploy Anywhere

40

Page 39: R at Microsoft

Data ScientistInteract directly with data

Built-in to SQL Server

Data Developer/DBAManage data and analytics together

SQL Server 2016Built-in in-database analytics

Example Solutions• Fraud detection

• Sales forecasting

• Warehouse efficiency

• Predictive maintenance

Relational Data

Analytic Library

T-SQL Interface

Extensibility

?R

R Integration

010010

100100

010101

Microsoft AzureMachine Learning Marketplace

New R scripts

010010

100100

010101

010010

100100

010101

010010

100100

010101

010010

100100

010101

010010

100100

010101

Page 40: R at Microsoft

In-Database Acceleration5+ hours to 40 seconds: Recommendation is that this now become the defacto productionalization process

rows

min

ute

s

R on a server pulling data via SQL

R on a server

Invoking RRE ScaleR

Inside the EDW

Page 41: R at Microsoft

Wrap-upR is strategic for Microsoft:• Widespread internal use• Enhanced open source R: Revolution R

Open• Big Data R: Revolution R Enterprise• R in the Cloud: Azure ML Studio• In-Database R: SQL Server 2016… and more to come!

Page 42: R at Microsoft

Thank youDownload Revolution R Open:mran.revolutionanalytics.com

More at:blog.revolutionanalytics.com

David SmithR Community LeadRevolution Analytics@[email protected]

Page 43: R at Microsoft

© 2015 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Page 44: R at Microsoft

46

DeployR

• Goal: embed results from R scripts into existing applications, in real time

• Problem:• Exposing arbitrary R functions is

unwise• Need to handle concurrent R

sessions• Solution: DeployR

• R, on a server, behind a firewall• Repository Manager defines entry

points• Expose only authorized R

functions• Automatically creates Web Services

APIs• Manages and monitors pool of R

sessions• Separates roles for R and app

developer• DeployR Open: for prototyping

integrations• Revolution R Enterprise adds grid-

scaling and enterprise authentication

More at deployr.revolutionanalytics.com

Page 45: R at Microsoft

AppIntegration