CERTIFICATE PROGRAM – USING R Statistics.com THE INSTITUTE FOR STATISTICS EDUCATION
Jun 12, 2015
CERTIFICATE PROGRAM – USING R
Statistics.comTHE INSTITUTE FOR STATISTICS EDUCATION
About Statistics.com
• First course 2002 (resampling methods)• 2003-2004 added courses in data mining, modeling, intro stats• Now 100+ courses• Hybrid model between
• Professional development (topic centered, scheduling accommodates working professionals)
• Academic (homework and assessment)
• Taught by noted authorities• Statistics, predictive modeling, data mining, R, optimization, risk
modeling, clinical trials…
Why learn R?
Why take classes?
The spread of R:
Phase 1: R started in 1993 by academics, and gained popularity in universities around the world – open source & free!
Phase 2: PhD statisticians who used R in university took it to their complex quant modeling jobs in industry.
Phase 3 (now): R is ubiquitous:•Industry - now that R is seeded by the PhD statisticians, other analysts in their companies need to know it.•Academic – researchers in a variety of fields who do statistics but are not primarily statisticians use R
SAS
SPSS
R
Why learn R? Let’s look at what employers are looking for:
Source: Statistics.com survey of approx. 4000 analytics/statistics job postings on various job sites, May, 2012
Relative proportion of mentions of statistical tools in “job requirements” section of job postings. A single job may mention more than one tool.
ConAgra Foods’ Human Capital Analytics/Reporting (HCA/R) program is searching for a project manager/statistician ….development of predictive modeling processes to answer different business issues. Excellent computer skills specifically with advanced Excel (v-lookups, pivot tables, macros), R, and other open source software. Experience in configuration of data to support complex data mining & statistical analysis.
SAS- is seeking a Research Statistician to apply cutting-edge econometric models ... demonstrated experience or knowledge of computer programming; ... particularly with applied econometric modeling or time series analysis; the SAS system; statistical software products, such as WinBugs, R, Stata, EViews, OxMetrics, or S-Plus.
AmazonLocal, Applied Machine Learning Scientist · Run sampling, clustering, classification, etc on large datasets using a variety of analytics software (e.g. SAS, Python, R, etc).
SRA International, ...* Use of statistical algorithms, techniques and models to define data for data integrity and process analytics* Use of data mining techniques to define data...* Experience and demonstrated expertise with at multiple data mining tools including SAS, SPSS, R, Weka, etc.
Need for R goes hand-in-hand with need for higher-level stats skills.
How is R being used in the real world?
Who are your fellow students?
Industry
Government
Academia
Database marketer, international bank Survey researcher
Project manager, large consulting firm
Statistical geneticist
Casualty actuary
Web developer
Forecaster, Walt Disney
Circulation manager, Countryside Pubs.
Farmer, Calif. Central Valley
Commodities analyst, hedge fund
Digital Marketer J&J
Risk analyst, agriculture dept.
CDC epidemiologist
Team coordinator aerospace medicineForest
monitor
Researcher, K-12 school dist.
PhD candidate in epidemiology
Prof. of medicine
Health researcher
PhD student in animal embryology
Plant ecologist
Anthropologist, human remains
Bioinformaticist
AustraliaGermany
Brazil
Canada
China
Denmark
UK
India
Netherlands
Executive and an assistant professor in an academic medical center: I have extensive experience with SPSS …I see R as the future for quantitative work and need to begin doing more of my work in R.
Analyst with state government natural resource agency: We have survey designs for regional monitoring that we continually need to evaluate and improve. Currently, I program in C and rely heavily on Monte Carlo modeling. I plot in Excel and have wanted to learn R to get greater flexibility.
Analyst with health and human service agency: My job is mostly data analysis and some statistical modeling which is handled via SAS and PL/SQL. Other agencies have incorporated R. I am looking to be prepared should our agency adopt R as well as understand how R compares with SAS with the hope of drawing from the strengths of both in the future.
Marketing analyst, international banking: Since we are manipulating tons of data at customer level for more than 27 countries, R would be the perfect complement tool (we have been using SAS) for customer analytics.
Analyst with non-profit organization: We do quite a bit of data analysis (mostly descriptive work and GIS mapping) and I started teaching myself R a few years ago in order to automate our routine data cleaning.
Database marketer, banking: I have used SAS for 8 years, also have experience in FICO Model Builder, but am new to R and want to learn those comprehensive packages which are not available in base SAS to do more advance analytics.
Commodities analyst at a hedge fund: I'm looking to use R to build more robust, stable and dynamic econometric models.
Why take classes? Why not learn on your own?
• R is not like SAS, SPSS• SAS has two very distinct user types:
• Programmer• Statistical modeler & analyst
• SPSS the latter• R is powerful, but has more programming and “messiness,” even when used purely in
analysis/modeling mode.
• Often it is helpful to have an expert on hand while learning R• 4-week courses allow an iterative process – a short intensive learning
period, ask lots of questions. Apply what you learn. Come back later for another 4-week class. Learn more. Apply. Repeat.
Certificate Program ContentPREREQUISITES: None for entry into program, but introductory statistics is a prerequisite for some courses.
6 REQUIRED
•Intro to R – Data Handling•Intro to R – Statistical Analysis•Programming in R•Programming in R – Adv.•Modeling in R•Graphics in R
6 ELECTIVESR-Specific:
•Data mining•Spatial•Microarray•SVM•Clinical Trials Apps•ggplot2•Smoothing with P-splines•Survey Analysis
Include R:
•Probability Distributions•Resampling•Bootstrap•Logistic Regression•GLM•Count data
1. The principles of R programming:• Introduction to R – Data Handling (Paul Murrell) introduces basic
expressions, symbols, assignment, functions, packages, use of code editors (emacs), workspace, data types & structures, subsetting, assessor functions, classes, type coercion, text files, binary files, large files, memory management, apply function, tabulate, aggregation, merging and splitting data, reshape, text processing.
• Programming in R (2 courses with Hadley Wickham) covers lexical scoping, dynamic scoping, frames, environments, namespaces, active bindings, quoting, evaluation, calling from other functions, string processing (stringr), dates and times (lubridate), regular expressions, xml and xpath, extracting data with SQL, executing SQL in R, writing compact and efficient code (helper function, lapply), anonymous functions, first class functions, object oriented programming, S3, tips for producing reliable code, functions and options to help debug, speed, testing.
2. Plotting and visualizing data in R:
Graphics in R (Paul Murrell, covers the core R capabilities for graphing, and teaches you to produce key statistical plots such as scatterplots, )
R ggplot2 (Hadley Wickham teaches how to use his package, which is a package with its own language that rests on R, to create graphs)
3. Application/method/domain specific:
Other classes are application oriented, where syntax and programming are discussed, as necessary, on the path to getting R to accomplish something specific. Intro stats, statistical modeling, microarray analysis, data mining, survey analysis.
In the most basic of these, Introduction to R – Statistical Analysis, some familiarity with statistical procedures is assumed and you learn R by executing these procedures (t-tests, chi-square, correlation, regression, etc.) in R. In other cases, the emphasis is on learning the method and R is simply the chosen tool.
Let’s see an example from the Statistical Analysis course.
Snapshot: Regression. The instructions are given step-by-step in Lesson 3 of “Introduction to R –Statistical Analysis.”
The lm function will estimate the regression parameters for thesimple linear regression model. For the two models specified abovewe have:
> lm(total ~ w.class, data = d)
Call:lm(formula = total ~ w.class, data = d)
Coefficients:(Intercept) w.class
159.815 2.732
which gives estimates ˆb0 = 159.815 and ˆb1 = 2.732.
KIM ASKS
Hi John,I was plotting the residuals from a linear regression (example on page 19 of the lesson 3), and there was a delay before the plots would show. The message on the R console was "Waiting to confirm page change." By clicking on the graphics, I could switch from one plot to the next. Is there anyway to make them tile so I can see all of them at once, or any way to go back and forth once they've 'printed' on the graphics page?
JOHN VERZANI REPLIES:
A couple of possibilities exist:
You can partition your graphics device so that more than one graphic will appear. For example, par(mfrow=c(2,2)) will set up a 2 by 2 grid, perfect for the plot function called on the output of the lm function.On some implementations you can record plots and scroll back through them. For windows users, the RGui application (your basic interface) allows you to turn on recording, I think by right clicking on a plot (if I'm wrong let me know, and I'll check).For RStudio, the graphs are already recorded. There are arrows to scroll.Hope one of those works for you. --J
SABINA CHIMES IN
Where do you type it? In the plot command? I have tried:> plot(res.pipeline, par(mfrow=c(2,2)))and getError in plot.lm(res.pipeline, par(mfrow = c(2, 2))) : 'which' must be in 1:6....How do you keep track of all these different ways of doing things. I find that your comments are amazing...
JOHN REPLIES
The par settings are done in their own command (well some are). Try:par(mfrow=c(2,2))plot(res.pipeline)
The ".lm" extra bit isn't necessary (though doesn't hurt), as R will use the class of res.pipeline to find that function in most usual cases.
Let me know if that doesn't help.
ALTA ASKS
John, what does masked in the following error message mean? and what is '.GlobeEnv'? thnx in advance>attach(kid.weights)The following object(S) are masked _by_ '.GlobeEnv':age
JOHN REPLIES
R looks for objects by traversing a series of nested environments. In this case, when you attach(kid.weights) it includes a variable 'age'. However you already have a variable 'age' in your global workspace (.GlobalEnv is the secret name for that). Which one do you want? Well, R is answering which one it will find. In this case the one in the global workspace, not that in kid.weights. For that one, you will need to work harder (using $ or with or ...)
Does that help?
gotcha! very helpful--thnx
• Polling question #3 – how many analytics professionals
How courses work
Readings, notes, videos
HomeworkDiscussion Forum
Weekly Course Schedule~ March 2013 ~
Sun Mon Tue Wed Thu Fri Sat1Lesson 1 opens
2
3 4 5 6 7 8Lesson 2 opens
9
10Homework 1 due
11 12Feedback Homework 1
13 14 15Lesson 3 opens
16
17Homework 2 due
18 19Feedback Homework 2
20 21 22Lesson 4 opens
23
24Homework 3 due
25 26Feedback Homework 3
27 28 29 30
31Homework 4 due
April 1 2Feedback Homework 4
3 4 5 6
Most courses are 4 weeks.
Time Required
• Estimate 15 hours per week• Don’t need to be online at particular times or days• Time zone does not matter• Best not to leave all work until the end of the week• Materials remain open for a couple of weeks after end-of-course• Most students are working professionals, take courses one at a time
Faculty
Paul Murrell John Verzani Hadley Wickham Sudha Purohit
Luis Torgo David Unwin Thomas Lumley Din Chen
Karl Peace GarrettGrolemund
Brian Marx Paul Eilers
Typical Course Contents – R Programming
• “Headquarters” Page• Lesson Page• Readings/notes/videos• Homework• Discussion Forum
Typical Course Contents – R Programming
• “Headquarters” Page• Lesson Page• Readings/notes/videos• Homework• Discussion Forum
Typical Course Contents – R Programming
• “Headquarters” Page• Lesson Page• Readings/notes/videos• Homework• Discussion Forum
Typical Course Contents – R Programming
• “Headquarters” Page• Lesson Page• Readings/notes/videos• Homework• Discussion Forum
Typical Course Contents – R Programming
• “Headquarters” Page• Lesson Page• Readings/notes/videos• Homework• Discussion Forum
Equiv. to 18 credits, US system
$5900 approx.
Next Step.
For certificate program application, contact [email protected] or call 1-855-GET-REVO (1-855-438-7386)
• Application fee will be waived (until July 30th)• Up to 50% discount offered for Revolution Analytics
software when purchased in combination with training