Top Banner
Revolution Confidential Introduction to R for Data Mining 2013 Webinar Series Joseph B. Rickert February 14, 2013 1
24

Introduction to R for Data Mining (Feb 2013)

Jan 26, 2015

Download

Technology

Presented: Thursday, February 14, 2013
Presenter: Joseph Rickert, Technical Marketing Manager, Revolution Analytics

We at Revolution Analytics are often asked “What is the best way to learn R?” While acknowledging that there may be as many effective learning styles as there are people we have identified three factors that greatly facilitate learning R. For a quick start:

Find a way of orienting yourself in the open source R world
Have a definite application area in mind
Set an initial goal of doing something useful and then build on it
In this webinar, we focus on data mining as the application area and show how anyone with just a basic knowledge of elementary data mining techniques can become immediately productive in R. We will:

Provide an orientation to R’s data mining resources
Show how to use the "point and click" open source data mining GUI, rattle, to perform the basic data mining functions of exploring and visualizing data, building classification models on training data sets, and using these models to classify new data.
Show the simple R commands to accomplish these same tasks without the GUI
Demonstrate how to build on these fundamental skills to gain further competence in R
Move away from using small test data sets and show with the same level of skill one could analyze some fairly large data sets with RevoScaleR
Data scientists and analysts using other statistical software as well as students who are new to data mining should come away with a plan for getting started with R.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Introduction to R for Data Mining (Feb 2013)

Revolution Confidential

Introduc tion to R for

Data Mining

2013 Webinar S eries

J os eph B . R ic kert

F ebruary 14, 2013

1

Page 2: Introduction to R for Data Mining (Feb 2013)

Revolution Confidential F irs t P olling Ques tion

What is your favorite data mining software tool? 1. R 2. SAS 3. MapReduce 4. Weka 5. Other

2

Page 3: Introduction to R for Data Mining (Feb 2013)

Revolution Confidential

My goal for today’s webinar is to convince you that:

3

R is a serious

platform for

data mining

Revolution R Enterprise

is the platform for

serious data mining

Seriously, it is not difficult to learn enough R to do some serious data

mining

Page 4: Introduction to R for Data Mining (Feb 2013)

Revolution Confidential

A word about Data Mining

We assume that you know a little bit about data mining and this is

your context for learning R

4

Page 5: Introduction to R for Data Mining (Feb 2013)

Revolution Confidential Data Mining

5

Applications

Credit Scoring

Fraud Detection

Ad Optimization

Targeted Marketing

Gene Detection

Recommendation systems

Social Networks

Actions

Acquire Data

Prepare

Classify

Predict

Visualize

Optimize

Interpret

Algorithms

CART

Random Forests

SVM

KMeans

Hierarchical clustering

Ensemble Techniques

Page 6: Introduction to R for Data Mining (Feb 2013)

Revolution Confidential

WHAT IS R ? Getting Orientated

6

Page 7: Introduction to R for Data Mining (Feb 2013)

Revolution Confidential Is :

The way to do statistical computing A full blown programming language The home of nearly every data mining

algorithm known to data science. A vibrant world-wide community

7

R was written in early 1990’s by

Robert Gentleman Ross Ihaka

Since 1997 a core group of ~ 20

developers guides the evolution of the

language

Page 8: Introduction to R for Data Mining (Feb 2013)

Revolution Confidential

is organized into libraries of functions c alled pac kages

CRAN R download Base Recommended packages

User contributed packages

8

R Package Growth 4,332 packages as of 2/13/13

Page 10: Introduction to R for Data Mining (Feb 2013)

Revolution Confidential

T HE S T R UC T UR E OF R FA C IL ITAT E S L E A R NING

Learning R

10

Page 11: Introduction to R for Data Mining (Feb 2013)

Revolution Confidential L earning R ?

11

Levels of R Skill Write production grade code Write an R package Write code and algorithms Use R functions Use a GUI

R developer

R contributor

R programmer

R user

R aware

Hours of use

10 10,000

The Malcolm Gladwell “Outlier” Scale

Page 12: Introduction to R for Data Mining (Feb 2013)

Revolution Confidential B as ic Mac hine L earning F unc tions

12

Function Library Description Cluster hclust stats Hierarchical cluster analysis

kmeans stats Kmeans clustering Classifiers glm stats Logistic Regression

rpart rpart Recursive partitioning and regression trees

ksvm kernlab Support Vector Machine apriori arules Rule based classification

Ensemble ada ada Stochastic boosting randomForest randomForest Random Forests classification and

regression

Page 13: Introduction to R for Data Mining (Feb 2013)

Revolution Confidential Noteworthy Data Mining P ac kages

13

Package Comment caret Well organized and remarkably complete

collection of functions to facilitate model building for regression and classification problems

rattle A very intuitive GUI for data mining that produces useful R code

Page 14: Introduction to R for Data Mining (Feb 2013)

Revolution Confidential

T IME TO R UN S OME C ODE Doing a lot with a little R

14

Script 1 GETTING STARTED .R 2 ROLL with RATTLE .R 3 IN THE TREES . R 4 INTRO to CARET .R 5 BIG DATA with RevoScaleR .R 6 WORDCLOUD .R

The R Scripts are available at: https://gist.github.com/joseph-rickert/4742529

Page 15: Introduction to R for Data Mining (Feb 2013)

Revolution Confidential S ec ond P olling Ques tion

What are your favorite data mining techniques? 1. Clustering techniques such as K-means 2. Single model classifiers such as decision trees,

or SVMs 3. Ensemble classifiers such as Random Forests

or boosting models 4. Text mining techniques 5. Other

15

Page 16: Introduction to R for Data Mining (Feb 2013)

Revolution Confidential

T hird P olling Ques tion (ins ert after running s cript IN T HE T R E E S

What kind of data do you analyze? 1. Financial data 2. Customer data (e.g. for recommendations) 3. Website data (e.g. for ads) 4. Health Care data 5. Other

16

Page 17: Introduction to R for Data Mining (Feb 2013)

Revolution Confidential

Working with B ig Data

RevoScaleR and Revolution R Enterprise

17

Page 18: Introduction to R for Data Mining (Feb 2013)

Revolution Confidential Too B ig for Open S ourc e R

18

mortDF <- rxXdfToDataFrame(mdata,maxRowsByCols=300000000) model <- glm(default ~ .,data=mortDF,family="binomial")

Page 19: Introduction to R for Data Mining (Feb 2013)

Revolution Confidential

R evoS caleR brings the power of B ig Data to R

19

Distributed Statistical Algorithms

Communications Framework

Data Source API

R Language Interface

Parallel External Memory Algorithms that are distributed among available compute resources (cores & computers) independent of platform

Abstracted layer for providing

communication between compute nodes in a cluster

(MPI, MapReduce, In-Database)

API for integrating external data sources (files, databases, HDFS) that provides optimized reading of rows and columns in blocks

Familiar, high-prodictivity

programming paradigm for R users

Page 20: Introduction to R for Data Mining (Feb 2013)

Revolution Confidential

R evoS caleR P E MA s P arallel E xternal Memory A lgorithms

20

Block 1

Block 2

Block i

Block i + 1

Block i + 2

XDF File

Block i Block i + 1

Block i + 2

Read blocks and compute intermediate results in parallel, iterating as necessary Block 1

results

Block i results

Block i+1 results

Block i+2 results

Results from last block

2nd pass

3rd pass

1st pass

R based algorithms Work on blocks of data Inherently parallel and distributed Do not require all data to be in memory at one time Can deal with distributed and streaming data

Page 21: Introduction to R for Data Mining (Feb 2013)

Revolution Confidential

WHE R E TO G O F R OM HE R E ? More than code, R is a community

21

Page 22: Introduction to R for Data Mining (Feb 2013)

Revolution Confidential C ontinuing to L earn R

Resources RevoJoe: How to Learn R More R Documentation

The R Journal Books Reference Card and more

Classes Coursera Revolution Analytics

Examples Thomson Nguyen on the Heritage

Health Prize Shannon Terry & Ben Ogorek

(Nationwide Insurance): A Direct Marketing In-Flight Forecasting System

Jeffrey Breen: Mining Twitter for Airline Consumer Sentiment

Joe Rothermich: Alternative Data Sources for Measuring Market Sentiment and Events (Using R)

22

Page 23: Introduction to R for Data Mining (Feb 2013)

Revolution Confidential S ome B ooks

23

Page 24: Introduction to R for Data Mining (Feb 2013)

Revolution Confidential

24

The R Scripts are available at: https://gist.github.com/joseph-rickert/4742529