Top Banner
CERTIFICATE PROGRAM – USING R Statistics.com THE INSTITUTE FOR STATISTICS EDUCATION
29

Getting Up to Speed with R: Certificate Program in R for Statistical Analysis, Visualization and Modeling

Jun 12, 2015

Download

Technology

The Institute for Statistics Education at Statistics.com offers a graduate-level certificate program in R for those who want to use the R statistical programming environment for statistical analysis, visualization and modeling. The Institute offers continuing education credits as well as a Program completion certificate. Courses are offered year-round (there is no semester system) on a flexible schedule. The content of the Program is the equivalent of 18 credits, in the US academic system. Faculty include R core development team members, package developers, authors of books on R: Paul Murrell, Hadley Wickham, Thomas Lumley, Sudha Purohit, Luis Torgo, John Verzani, others.

Join this webinar to learn about the structure of the certificate and available courses through the Institute, which are offered in 3 categories:

Basic programming skills in R
Statistical methods implemented in R
R applied to specific domains
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Getting Up to Speed with R: Certificate Program in R for Statistical Analysis, Visualization and Modeling

CERTIFICATE PROGRAM – USING R

Statistics.comTHE INSTITUTE FOR STATISTICS EDUCATION

Page 2: Getting Up to Speed with R: Certificate Program in R for Statistical Analysis, Visualization and Modeling

About Statistics.com

• First course 2002 (resampling methods)• 2003-2004 added courses in data mining, modeling, intro stats• Now 100+ courses• Hybrid model between

• Professional development (topic centered, scheduling accommodates working professionals)

• Academic (homework and assessment)

• Taught by noted authorities• Statistics, predictive modeling, data mining, R, optimization, risk

modeling, clinical trials…

Page 3: Getting Up to Speed with R: Certificate Program in R for Statistical Analysis, Visualization and Modeling

Why learn R?

Why take classes?

Page 4: Getting Up to Speed with R: Certificate Program in R for Statistical Analysis, Visualization and Modeling

The spread of R:

Phase 1: R started in 1993 by academics, and gained popularity in universities around the world – open source & free!

Phase 2: PhD statisticians who used R in university took it to their complex quant modeling jobs in industry.

Phase 3 (now): R is ubiquitous:•Industry - now that R is seeded by the PhD statisticians, other analysts in their companies need to know it.•Academic – researchers in a variety of fields who do statistics but are not primarily statisticians use R

Page 5: Getting Up to Speed with R: Certificate Program in R for Statistical Analysis, Visualization and Modeling

SAS

SPSS

R

Why learn R? Let’s look at what employers are looking for:

Source: Statistics.com survey of approx. 4000 analytics/statistics job postings on various job sites, May, 2012

Relative proportion of mentions of statistical tools in “job requirements” section of job postings. A single job may mention more than one tool.

Page 6: Getting Up to Speed with R: Certificate Program in R for Statistical Analysis, Visualization and Modeling

ConAgra Foods’ Human Capital Analytics/Reporting (HCA/R) program is searching for a project manager/statistician ….development of predictive modeling processes to answer different business issues. Excellent computer skills specifically with advanced Excel (v-lookups, pivot tables, macros), R, and other open source software. Experience in configuration of data to support complex data mining & statistical analysis.

SAS- is seeking a Research Statistician to apply cutting-edge econometric models ... demonstrated experience or knowledge of computer programming; ... particularly with applied econometric modeling or time series analysis; the SAS system; statistical software products, such as WinBugs, R, Stata, EViews, OxMetrics, or S-Plus.

AmazonLocal, Applied Machine Learning Scientist · Run sampling, clustering, classification, etc on large datasets using a variety of analytics software (e.g. SAS, Python, R, etc).

SRA International, ...* Use of statistical algorithms, techniques and models to define data for data integrity and process analytics* Use of data mining techniques to define data...* Experience and demonstrated expertise with at multiple data mining tools including SAS, SPSS, R, Weka, etc.

Need for R goes hand-in-hand with need for higher-level stats skills.

Page 7: Getting Up to Speed with R: Certificate Program in R for Statistical Analysis, Visualization and Modeling

How is R being used in the real world?

Page 8: Getting Up to Speed with R: Certificate Program in R for Statistical Analysis, Visualization and Modeling

Who are your fellow students?

Industry

Government

Academia

Database marketer, international bank Survey researcher

Project manager, large consulting firm

Statistical geneticist

Casualty actuary

Web developer

Forecaster, Walt Disney

Circulation manager, Countryside Pubs.

Farmer, Calif. Central Valley

Commodities analyst, hedge fund

Digital Marketer J&J

Risk analyst, agriculture dept.

CDC epidemiologist

Team coordinator aerospace medicineForest

monitor

Researcher, K-12 school dist.

PhD candidate in epidemiology

Prof. of medicine

Health researcher

PhD student in animal embryology

Plant ecologist

Anthropologist, human remains

Bioinformaticist

AustraliaGermany

Brazil

Canada

China

Denmark

UK

India

Netherlands

Page 9: Getting Up to Speed with R: Certificate Program in R for Statistical Analysis, Visualization and Modeling

Executive and an assistant professor in an academic medical center: I have extensive experience with SPSS …I see R as the future for quantitative work and need to begin doing more of my work in R.

Analyst with state government natural resource agency: We have survey designs for regional monitoring that we continually need to evaluate and improve. Currently, I program in C and rely heavily on Monte Carlo modeling. I plot in Excel and have wanted to learn R to get greater flexibility.

Analyst with health and human service agency: My job is mostly data analysis and some statistical modeling which is handled via SAS and PL/SQL. Other agencies have incorporated R. I am looking to be prepared should our agency adopt R as well as understand how R compares with SAS with the hope of drawing from the strengths of both in the future.

Marketing analyst, international banking: Since we are manipulating tons of data at customer level for more than 27 countries, R would be the perfect complement tool (we have been using SAS) for customer analytics.

Analyst with non-profit organization: We do quite a bit of data analysis (mostly descriptive work and GIS mapping) and I started teaching myself R a few years ago in order to automate our routine data cleaning.

Database marketer, banking: I have used SAS for 8 years, also have experience in FICO Model Builder, but am new to R and want to learn those comprehensive packages which are not available in base SAS to do more advance analytics.

Commodities analyst at a hedge fund: I'm looking to use R to build more robust, stable and dynamic econometric models.

Page 10: Getting Up to Speed with R: Certificate Program in R for Statistical Analysis, Visualization and Modeling

Why take classes? Why not learn on your own?

• R is not like SAS, SPSS• SAS has two very distinct user types:

• Programmer• Statistical modeler & analyst

• SPSS the latter• R is powerful, but has more programming and “messiness,” even when used purely in

analysis/modeling mode.

• Often it is helpful to have an expert on hand while learning R• 4-week courses allow an iterative process – a short intensive learning

period, ask lots of questions. Apply what you learn. Come back later for another 4-week class. Learn more. Apply. Repeat.

Page 11: Getting Up to Speed with R: Certificate Program in R for Statistical Analysis, Visualization and Modeling

Certificate Program ContentPREREQUISITES: None for entry into program, but introductory statistics is a prerequisite for some courses.

6 REQUIRED

•Intro to R – Data Handling•Intro to R – Statistical Analysis•Programming in R•Programming in R – Adv.•Modeling in R•Graphics in R

6 ELECTIVESR-Specific:

•Data mining•Spatial•Microarray•SVM•Clinical Trials Apps•ggplot2•Smoothing with P-splines•Survey Analysis

Include R:

•Probability Distributions•Resampling•Bootstrap•Logistic Regression•GLM•Count data

Page 12: Getting Up to Speed with R: Certificate Program in R for Statistical Analysis, Visualization and Modeling

1. The principles of R programming:• Introduction to R – Data Handling (Paul Murrell) introduces basic

expressions, symbols, assignment, functions, packages, use of code editors (emacs), workspace, data types & structures, subsetting, assessor functions, classes, type coercion, text files, binary files, large files, memory management, apply function, tabulate, aggregation, merging and splitting data, reshape, text processing.

• Programming in R (2 courses with Hadley Wickham) covers lexical scoping, dynamic scoping, frames, environments, namespaces, active bindings, quoting, evaluation, calling from other functions, string processing (stringr), dates and times (lubridate), regular expressions, xml and xpath, extracting data with SQL, executing SQL in R, writing compact and efficient code (helper function, lapply), anonymous functions, first class functions, object oriented programming, S3, tips for producing reliable code, functions and options to help debug, speed, testing.

Page 13: Getting Up to Speed with R: Certificate Program in R for Statistical Analysis, Visualization and Modeling

2. Plotting and visualizing data in R:

Graphics in R (Paul Murrell, covers the core R capabilities for graphing, and teaches you to produce key statistical plots such as scatterplots, )

R ggplot2 (Hadley Wickham teaches how to use his package, which is a package with its own language that rests on R, to create graphs)

Page 14: Getting Up to Speed with R: Certificate Program in R for Statistical Analysis, Visualization and Modeling

3. Application/method/domain specific:

Other classes are application oriented, where syntax and programming are discussed, as necessary, on the path to getting R to accomplish something specific. Intro stats, statistical modeling, microarray analysis, data mining, survey analysis.

In the most basic of these, Introduction to R – Statistical Analysis, some familiarity with statistical procedures is assumed and you learn R by executing these procedures (t-tests, chi-square, correlation, regression, etc.) in R. In other cases, the emphasis is on learning the method and R is simply the chosen tool.

Let’s see an example from the Statistical Analysis course.

Page 15: Getting Up to Speed with R: Certificate Program in R for Statistical Analysis, Visualization and Modeling

Snapshot: Regression. The instructions are given step-by-step in Lesson 3 of “Introduction to R –Statistical Analysis.”

The lm function will estimate the regression parameters for thesimple linear regression model. For the two models specified abovewe have:

> lm(total ~ w.class, data = d)

Call:lm(formula = total ~ w.class, data = d)

Coefficients:(Intercept) w.class

159.815 2.732

which gives estimates ˆb0 = 159.815 and ˆb1 = 2.732.

Page 16: Getting Up to Speed with R: Certificate Program in R for Statistical Analysis, Visualization and Modeling

KIM ASKS

Hi John,I was plotting the residuals from a linear regression (example on page 19 of the lesson 3), and there was a delay before the plots would show. The message on the R console was "Waiting to confirm page change." By clicking on the graphics, I could switch from one plot to the next. Is there anyway to make them tile so I can see all of them at once, or any way to go back and forth once they've 'printed' on the graphics page?

JOHN VERZANI REPLIES:

A couple of possibilities exist:

You can partition your graphics device so that more than one graphic will appear. For example, par(mfrow=c(2,2)) will set up a 2 by 2 grid, perfect for the plot function called on the output of the lm function.On some implementations you can record plots and scroll back through them. For windows users, the RGui application (your basic interface) allows you to turn on recording, I think by right clicking on a plot (if I'm wrong let me know, and I'll check).For RStudio, the graphs are already recorded. There are arrows to scroll.Hope one of those works for you. --J

SABINA CHIMES IN

Where do you type it? In the plot command? I have tried:> plot(res.pipeline, par(mfrow=c(2,2)))and getError in plot.lm(res.pipeline, par(mfrow = c(2, 2))) : 'which' must be in 1:6....How do you keep track of all these different ways of doing things. I find that your comments are amazing...

JOHN REPLIES

The par settings are done in their own command (well some are). Try:par(mfrow=c(2,2))plot(res.pipeline)

The ".lm" extra bit isn't necessary (though doesn't hurt), as R will use the class of res.pipeline to find that function in most usual cases.

Let me know if that doesn't help.

Page 17: Getting Up to Speed with R: Certificate Program in R for Statistical Analysis, Visualization and Modeling

ALTA ASKS

John, what does masked in the following error message mean? and what is '.GlobeEnv'? thnx in advance>attach(kid.weights)The following object(S) are masked _by_ '.GlobeEnv':age

JOHN REPLIES

R looks for objects by traversing a series of nested environments. In this case, when you attach(kid.weights) it includes a variable 'age'. However you already have a variable 'age' in your global workspace (.GlobalEnv is the secret name for that). Which one do you want? Well, R is answering which one it will find. In this case the one in the global workspace, not that in kid.weights. For that one, you will need to work harder (using $ or with or ...)

Does that help?

gotcha! very helpful--thnx

Page 18: Getting Up to Speed with R: Certificate Program in R for Statistical Analysis, Visualization and Modeling

• Polling question #3 – how many analytics professionals

Page 19: Getting Up to Speed with R: Certificate Program in R for Statistical Analysis, Visualization and Modeling

How courses work

Readings, notes, videos

HomeworkDiscussion Forum

Page 20: Getting Up to Speed with R: Certificate Program in R for Statistical Analysis, Visualization and Modeling

Weekly Course Schedule~ March 2013 ~

Sun Mon Tue Wed Thu Fri Sat1Lesson 1 opens

2

3 4 5 6 7 8Lesson 2 opens

9

10Homework 1 due

11 12Feedback Homework 1

13 14 15Lesson 3 opens

16

17Homework 2 due

18 19Feedback Homework 2

20 21 22Lesson 4 opens

23

24Homework 3 due

25 26Feedback Homework 3

27 28 29 30

31Homework 4 due

April 1 2Feedback Homework 4

3 4 5 6

Most courses are 4 weeks.

Page 21: Getting Up to Speed with R: Certificate Program in R for Statistical Analysis, Visualization and Modeling

Time Required

• Estimate 15 hours per week• Don’t need to be online at particular times or days• Time zone does not matter• Best not to leave all work until the end of the week• Materials remain open for a couple of weeks after end-of-course• Most students are working professionals, take courses one at a time

Page 23: Getting Up to Speed with R: Certificate Program in R for Statistical Analysis, Visualization and Modeling

Typical Course Contents – R Programming

• “Headquarters” Page• Lesson Page• Readings/notes/videos• Homework• Discussion Forum

Page 24: Getting Up to Speed with R: Certificate Program in R for Statistical Analysis, Visualization and Modeling

Typical Course Contents – R Programming

• “Headquarters” Page• Lesson Page• Readings/notes/videos• Homework• Discussion Forum

Page 25: Getting Up to Speed with R: Certificate Program in R for Statistical Analysis, Visualization and Modeling

Typical Course Contents – R Programming

• “Headquarters” Page• Lesson Page• Readings/notes/videos• Homework• Discussion Forum

Page 26: Getting Up to Speed with R: Certificate Program in R for Statistical Analysis, Visualization and Modeling

Typical Course Contents – R Programming

• “Headquarters” Page• Lesson Page• Readings/notes/videos• Homework• Discussion Forum

Page 27: Getting Up to Speed with R: Certificate Program in R for Statistical Analysis, Visualization and Modeling

Typical Course Contents – R Programming

• “Headquarters” Page• Lesson Page• Readings/notes/videos• Homework• Discussion Forum

Page 28: Getting Up to Speed with R: Certificate Program in R for Statistical Analysis, Visualization and Modeling

Equiv. to 18 credits, US system

$5900 approx.

Page 29: Getting Up to Speed with R: Certificate Program in R for Statistical Analysis, Visualization and Modeling

Next Step.

For certificate program application, contact [email protected] or call 1-855-GET-REVO (1-855-438-7386)

• Application fee will be waived (until July 30th)• Up to 50% discount offered for Revolution Analytics

software when purchased in combination with training