Top Banner
R - scripted data History Language Packages Tools RPubs Slidify Shiny
31
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: R - the language

R - scripted data

History

Language

Packages

Tools

RPubs

Slidify

Shiny

Page 2: R - the language

A Brief History of R

– 1976 S - Bell Labs; Fortran– John Chambers

– 1988 S Version 3; C language

● 1991 R Created– Ross Ihaka and Robert Gentleman

● 1993 R Announced– 1993 S licensed to StatSci (now Insightful)

● 2000 R Version 1.0.0 released– 2004 S purchased from Lucent (2MM)

– 2008 TIBCO acquires Insightful (25MM)

Page 3: R - the language

Other “Stats” Tools

● R – additional, commercial support

Oracle: “Big Data Appliance” - R + Hadoop + Linux + NoSQL + Exadata(H/W)

IBM: R executing in Hadoop (massively parallel in-databse analytics)

● SAS (SAS Institute) dev. 1966, 1st rel 1972● SPSS (IBM) 1st rel 1968

Page 4: R - the language

Model Development and Execution Comparison

http://inside-bigdata.com/2014/06/25/revolution-r-enterprise-vs-sas-performance/

Page 5: R - the language

Oracle + INTEL Libraries

https://blogs.oracle.com/R/entry/oracle_r_distribution_performance_benchmark

Page 6: R - the language

Language

● Derviative of S (S PLUS)● Portable (includes Playstation 3)● Interpreted, calls into C libraries● Functional!● GPL● 40 year old technology● Open Source (you want it, you do it)

Page 7: R - the language

Data Types

● Symbols refer to objects● Object attributes

– names

– dimnames

– dimensions

– class

– length

– user defined attributes/metadata

Page 8: R - the language

Data Types

● Object types – single class, except list– List

(may have mixed classes)

– Vectors(scalar is a vector of length 1)

– Matrices(vector with 'dimension' attribute)

(column major order)

Page 9: R - the language

Data Types

● Object types– Factors

● Categorical data (like an enumeration)

– Data frames● Special list, each element has same length● Elements are columns with length rows● Each elements (column) has its own type● row.names() attribute to name the rows● Convert to matrix with data.matrix()● Load with read.table(), read.csv()

Page 10: R - the language

Data Types

● Object “atomic” classes– character

– numeric (double precision real)

– integer

– complex

– logical (booleans)

Numeric and Integer include Inf and NaN

1 / Inf == 0 !

any class can be NA

NaN is NA, NA is not NaN

Page 11: R - the language

Data Types

● Dates– “Date” class

– Days since epoch (1970-01-01)

● Times– “POSIXct” or “POSIXlt” class

– Seconds since epoch

● Coerce to string with as.Date()

● Generic functions include 'weekdays()', months()', 'quarters()'

Page 12: R - the language

Operators

● Grouping: ()

● Assignment: to<-from AND from->to

● Vectorized: + - ! * / ^ %% & |

● ~ ? : %/% %*% %o% %x% %in% < > == >= <= && ||

● Element access: [[]] [] $

● Function argument types:– symbol, symbol=default, ...

Page 13: R - the language

Control Structures

● if, else

● for

● while

● repeat

● break, next, return

Page 14: R - the language

Apply

● apply – apply functions over arrays● lapply – apply functions over list / vector● sapply – apply function to data frames● tapply – apply function over ragged array● mapply – apply function to multiple objects

Page 15: R - the language

Functions

● Functions are objects● Functional closure consists of:

– Formal argument list

– Function body (definition)

– Environment

● Each of these can be assigned to● Assign to environment can eliminate

unwanted environment capture

Page 16: R - the language

Packages

● CRAN (Comprehensive R Archive Network)– Main site, includes R download

● Bioconductor– Analysis of genomic data

– Next generation high-throughput sequencing

● R-forge● GitHub and Personal repositories

Page 17: R - the language

Packages

● Analysis– Statistical analysis (stats, linprog)

● Linear (and general linear) modeling● Tree models● Analysis of variance

– Machine learning (caret, kernlab)● Clustering (forests, k-means, knn, etc)● Training and predictions● Cross validation and error analysis

Page 18: R - the language

Packages

● Graphics– Base graphics

● Plot: plot, hist, ...● Annotate: text, lines, points, axis, ...

– Lattice● Single command: xyplot, bwplot, ...

– Ggplot2● Single command: qplot● Defining objects: aesthetics, geoms● Chain commands: ggplot, geom_*, ...

Page 19: R - the language

Packages

● Data visualization– rCharts (GitHub), converts visualizations to

Javascript (e.g. d3.js)

http://www.google.com/trends/explore#q=R%20language%2C%20Data%20Visualization%2C%20D3.js%2C%20Processing.js&cmpt=q

Page 20: R - the language

Tools

● Command line● Rstudio (can run on remote Linux server)● Rkward● Rcommander (tcl/tk)● JGR – Java (GUI for R)● Rattle - RGtk2

Page 21: R - the language

Tools

● Debugging– Print statements!

– Interactive tools:● traceback() – stack trace on error● debug() – flags function for stepping● browser() - stops function and enters debug ● trace() - insert trace statements● recover() - modify error behavior, can

browse call stack

Page 22: R - the language

Tools

● Profiling– “We should forget about small efficiencies,

say about 97% of the time: premature optimization is the root of all evil”

– Donald Knuth

– system.time() - CPU, wall times

– Rprof() - use symmaryRprof() to see results● Do not use Rprof() and system.time()

together● Calls to C/Fortran libraries not profiled

Page 23: R - the language

Data Exploration

● Script it!– If you can't repeat it, it didn't happen

● Get the data (ingest)– Functions to download, uncompress,

unarchive, store, read, and organize

● Clean the data– Handle missing and incomplete data,

impute values, identify outliers

Page 24: R - the language

Data Exploration

● Look at the data (models, visualization)– Model – regressions (linear, logistic),

clustering, ANOVA

– Refine models and plot the result● Look for systematic issues – unexpected

trends, bias, unexplained variance, error estimates, residual analysis

● Explore complexity – number of explanatory factors

– Plot the models● What does it look like?

Page 25: R - the language

Reproducible Research

● Allows others to validate the work● Ensures that the results are accepted● Reduces the chance of errors propagating

– http://youtu.be/7gYIs7uYbMo

– 2010 Anil Potti resigns from Duke after research was found flawed (off by 1!)

● Clinical trials based on the flawed research was finally cancelled

● Closed data, non-reproducible research exacerbated the problem

Page 26: R - the language

Reproducible Research

● Don't do things by hand – especially editing spreadsheets to “clean up” data (removing outliers, validating, editing) or dowloading files

● Actions taken by hand need very detailed documentation to reproduce – such as download sites and what files were downloaded to

● GUIs are convenient, but can't be repeated

Page 27: R - the language

Reproducible Research

● Capture the steps in a script:– download.file(“http://...”, “localfile.zip”)

● Can be repeated as long as the link is available. Can keep and manage the downloaded file if that is an issue

– Use version control● Capture small steps at a time (git is good

for this!)● Can track changes and revert if needed● Can use GitHub, BitBucket, SouceForge to

publish the results as well

Page 28: R - the language

Reproducible Research

● Capture environment – OS, tools, versions● Don't save outputs – regenerate

– Ok to cache results while in use, but don't store the results, just the code+data that produced it

– If you keep intermediate files, document how they were created

● Set random seed

Page 29: R - the language

Sharing Research

● Rmarkdown – markdown with embedded R– knitr package executes the R fragments

and embeds the code and results into markdown, which can convert to HTML or PDF

– Literate programming!

● Hosted documentation– Rpubs (rpubs.com)

– GitHub gh-pages (github.io)

Page 30: R - the language

Sharing Research

● Embedded presentations– Author using slidify package

– Rmarkdown with embedded R code

– Creates HTML5 presentation slide deck

– Can include inline quizes

Page 31: R - the language

Data Products

● Interactive visualizations– shiny, shinyapp packages

– RStudio includes interactive display of shiny applications during development

– Generates bootstrap + HTML5 + javascript + d3 application

● Hosted!– Hosted at shinyapp.io

– Private? Server images available (for purchase)