R - scripted data History Language Packages Tools RPubs Slidify Shiny
A Brief History of R
– 1976 S - Bell Labs; Fortran– John Chambers
– 1988 S Version 3; C language
● 1991 R Created– Ross Ihaka and Robert Gentleman
● 1993 R Announced– 1993 S licensed to StatSci (now Insightful)
● 2000 R Version 1.0.0 released– 2004 S purchased from Lucent (2MM)
– 2008 TIBCO acquires Insightful (25MM)
Other “Stats” Tools
● R – additional, commercial support
Oracle: “Big Data Appliance” - R + Hadoop + Linux + NoSQL + Exadata(H/W)
IBM: R executing in Hadoop (massively parallel in-databse analytics)
● SAS (SAS Institute) dev. 1966, 1st rel 1972● SPSS (IBM) 1st rel 1968
Model Development and Execution Comparison
http://inside-bigdata.com/2014/06/25/revolution-r-enterprise-vs-sas-performance/
Oracle + INTEL Libraries
https://blogs.oracle.com/R/entry/oracle_r_distribution_performance_benchmark
Language
● Derviative of S (S PLUS)● Portable (includes Playstation 3)● Interpreted, calls into C libraries● Functional!● GPL● 40 year old technology● Open Source (you want it, you do it)
Data Types
● Symbols refer to objects● Object attributes
– names
– dimnames
– dimensions
– class
– length
– user defined attributes/metadata
Data Types
● Object types – single class, except list– List
(may have mixed classes)
– Vectors(scalar is a vector of length 1)
– Matrices(vector with 'dimension' attribute)
(column major order)
Data Types
● Object types– Factors
● Categorical data (like an enumeration)
– Data frames● Special list, each element has same length● Elements are columns with length rows● Each elements (column) has its own type● row.names() attribute to name the rows● Convert to matrix with data.matrix()● Load with read.table(), read.csv()
Data Types
● Object “atomic” classes– character
– numeric (double precision real)
– integer
– complex
– logical (booleans)
Numeric and Integer include Inf and NaN
1 / Inf == 0 !
any class can be NA
NaN is NA, NA is not NaN
Data Types
● Dates– “Date” class
– Days since epoch (1970-01-01)
● Times– “POSIXct” or “POSIXlt” class
– Seconds since epoch
● Coerce to string with as.Date()
● Generic functions include 'weekdays()', months()', 'quarters()'
Operators
● Grouping: ()
● Assignment: to<-from AND from->to
● Vectorized: + - ! * / ^ %% & |
● ~ ? : %/% %*% %o% %x% %in% < > == >= <= && ||
● Element access: [[]] [] $
● Function argument types:– symbol, symbol=default, ...
Apply
● apply – apply functions over arrays● lapply – apply functions over list / vector● sapply – apply function to data frames● tapply – apply function over ragged array● mapply – apply function to multiple objects
Functions
● Functions are objects● Functional closure consists of:
– Formal argument list
– Function body (definition)
– Environment
● Each of these can be assigned to● Assign to environment can eliminate
unwanted environment capture
Packages
● CRAN (Comprehensive R Archive Network)– Main site, includes R download
● Bioconductor– Analysis of genomic data
– Next generation high-throughput sequencing
● R-forge● GitHub and Personal repositories
Packages
● Analysis– Statistical analysis (stats, linprog)
● Linear (and general linear) modeling● Tree models● Analysis of variance
– Machine learning (caret, kernlab)● Clustering (forests, k-means, knn, etc)● Training and predictions● Cross validation and error analysis
Packages
● Graphics– Base graphics
● Plot: plot, hist, ...● Annotate: text, lines, points, axis, ...
– Lattice● Single command: xyplot, bwplot, ...
– Ggplot2● Single command: qplot● Defining objects: aesthetics, geoms● Chain commands: ggplot, geom_*, ...
Packages
● Data visualization– rCharts (GitHub), converts visualizations to
Javascript (e.g. d3.js)
http://www.google.com/trends/explore#q=R%20language%2C%20Data%20Visualization%2C%20D3.js%2C%20Processing.js&cmpt=q
Tools
● Command line● Rstudio (can run on remote Linux server)● Rkward● Rcommander (tcl/tk)● JGR – Java (GUI for R)● Rattle - RGtk2
Tools
● Debugging– Print statements!
– Interactive tools:● traceback() – stack trace on error● debug() – flags function for stepping● browser() - stops function and enters debug ● trace() - insert trace statements● recover() - modify error behavior, can
browse call stack
Tools
● Profiling– “We should forget about small efficiencies,
say about 97% of the time: premature optimization is the root of all evil”
– Donald Knuth
– system.time() - CPU, wall times
– Rprof() - use symmaryRprof() to see results● Do not use Rprof() and system.time()
together● Calls to C/Fortran libraries not profiled
Data Exploration
● Script it!– If you can't repeat it, it didn't happen
● Get the data (ingest)– Functions to download, uncompress,
unarchive, store, read, and organize
● Clean the data– Handle missing and incomplete data,
impute values, identify outliers
Data Exploration
● Look at the data (models, visualization)– Model – regressions (linear, logistic),
clustering, ANOVA
– Refine models and plot the result● Look for systematic issues – unexpected
trends, bias, unexplained variance, error estimates, residual analysis
● Explore complexity – number of explanatory factors
– Plot the models● What does it look like?
Reproducible Research
● Allows others to validate the work● Ensures that the results are accepted● Reduces the chance of errors propagating
– http://youtu.be/7gYIs7uYbMo
– 2010 Anil Potti resigns from Duke after research was found flawed (off by 1!)
● Clinical trials based on the flawed research was finally cancelled
● Closed data, non-reproducible research exacerbated the problem
Reproducible Research
● Don't do things by hand – especially editing spreadsheets to “clean up” data (removing outliers, validating, editing) or dowloading files
● Actions taken by hand need very detailed documentation to reproduce – such as download sites and what files were downloaded to
● GUIs are convenient, but can't be repeated
Reproducible Research
● Capture the steps in a script:– download.file(“http://...”, “localfile.zip”)
● Can be repeated as long as the link is available. Can keep and manage the downloaded file if that is an issue
– Use version control● Capture small steps at a time (git is good
for this!)● Can track changes and revert if needed● Can use GitHub, BitBucket, SouceForge to
publish the results as well
Reproducible Research
● Capture environment – OS, tools, versions● Don't save outputs – regenerate
– Ok to cache results while in use, but don't store the results, just the code+data that produced it
– If you keep intermediate files, document how they were created
● Set random seed
Sharing Research
● Rmarkdown – markdown with embedded R– knitr package executes the R fragments
and embeds the code and results into markdown, which can convert to HTML or PDF
– Literate programming!
● Hosted documentation– Rpubs (rpubs.com)
– GitHub gh-pages (github.io)
Sharing Research
● Embedded presentations– Author using slidify package
– Rmarkdown with embedded R code
– Creates HTML5 presentation slide deck
– Can include inline quizes