Advanced Statistics Using R Stephen Cox [email protected]Spring 2007 Advanced Statistics Advanced Statistics using . using . Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write. H. G. Wells Data analysis is an aid to thinking and not a replacement for it. Richard Shillington “Organic chemist!”, said Tilley disdainfully. “Probably knows no statistics whatever.” Nigel Balchin The Small Back Room Statistics means never having to say you’re certain. Philip Stark Before the curse of statistics fell upon mankind we lived a happy, innocent life, full of merriment and go, and informed by fairly good judgment. Hilaire Belloc The Silence of the Sea
39
Embed
Spring 2007Advanced Statistics Using RStephen Cox [email protected] Advanced Statistics using. Statistical thinking will one day be as necessary for.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write.
H. G. Wells
Data analysis is an aid to thinking and not a replacement for it.Richard Shillington
“Organic chemist!”, said Tilley disdainfully. “Probably knows no statistics whatever.”
Nigel BalchinThe Small Back Room
Statistics means never having to say you’re certain. Philip Stark
Before the curse of statistics fell upon mankind we lived a happy, innocent life, full of merriment and go, and informed by fairly good judgment. Hilaire Belloc The Silence of the Sea
Why R?Why R?• An open source environment for statistical computing An open source environment for statistical computing
and visualizationand visualization– GNU/GPL version of the S Language from Bell LaboratoriesGNU/GPL version of the S Language from Bell Laboratories– Highly extensible (i.e., customizable)Highly extensible (i.e., customizable)
• Integrated suite of software facilities for data Integrated suite of software facilities for data manipulation, calculation, analysis, and graphical manipulation, calculation, analysis, and graphical displaydisplay– Effective data handling and storage facilityEffective data handling and storage facility– Large, coherent, integrated collection of tools for data analysisLarge, coherent, integrated collection of tools for data analysis– Graphical facilities for data analysis and displayGraphical facilities for data analysis and display– A well-developed, simple, and powerful programming A well-developed, simple, and powerful programming
““The term "environment" is intended to characterize it as a fully The term "environment" is intended to characterize it as a fully planned and coherent system, rather than an incremental planned and coherent system, rather than an incremental accretion of very specific and inflexible tools, as is frequently the accretion of very specific and inflexible tools, as is frequently the case with other data analysis software.”case with other data analysis software.”
R is free :)R is free :)
Binaries available for Windows, Mac, Linux, Unix, … Binaries available for Windows, Mac, Linux, Unix, …
– R Scripts (text file with name - R Scripts (text file with name - file_name.Rfile_name.R))• Save a long list of commands in a text fileSave a long list of commands in a text file• Run the script using Run the script using source()source()
# calculates confidence interval based upon Fieller's thm.# calculates confidence interval based upon Fieller's thm.# modified version of EC50.calc found in P&B Fig 7.22# modified version of EC50.calc found in P&B Fig 7.22# now allows other link functions, using the calculations# now allows other link functions, using the calculations # found in dose.p (MASS)# found in dose.p (MASS)# SBC 19 May 05# SBC 19 May 05
R Libraries (aka Packages)R Libraries (aka Packages)• Suites of predefined R codeSuites of predefined R code• Available for a wide variety of topics and Available for a wide variety of topics and
specific analysesspecific analyses• Useful examplesUseful examples
– drc: Analysis of dose-response curvesdrc: Analysis of dose-response curves– survival: Survival analysis, including penalised likelihoodsurvival: Survival analysis, including penalised likelihood– nlme: Linear and nonlinear mixed effects modelsnlme: Linear and nonlinear mixed effects models– NADA: Nondetects And Data Analysis for environmental dataNADA: Nondetects And Data Analysis for environmental data– ade4: Analysis of Environmental Data : Exploratory and Euclidean methodade4: Analysis of Environmental Data : Exploratory and Euclidean method– Rcmdr: R Commander (GUI)Rcmdr: R Commander (GUI)
• Install the ‘base’ R packageInstall the ‘base’ R package– Self-extracting installerSelf-extracting installer
• Find, install R libraries (i.e., extensions)Find, install R libraries (i.e., extensions)– Listing of many contributed packagesListing of many contributed packages
Getting data in \ out…Getting data in \ out…• Generally, two import/export optionsGenerally, two import/export options
– Exchange via delimited ASCII fileExchange via delimited ASCII file• R method R method read.table()read.table() (and variants) (and variants)
– Exchange with external file formats via add-on R packageExchange with external file formats via add-on R package • RDBMSRDBMS
– ROracle: Oracle database interface for RROracle: Oracle database interface for R– RODBC: ODBC database accessRODBC: ODBC database access
• Commercial Statistics PackagesCommercial Statistics Packages – RODBC: ODBC database accessRODBC: ODBC database access– foreign: Read Data Stored by Minitab, S, SAS, SPSS, Stata, foreign: Read Data Stored by Minitab, S, SAS, SPSS, Stata,
Systat, dBase,Systat, dBase,– R.matlab: Read and write of MAT files together with R-to-Matlab R.matlab: Read and write of MAT files together with R-to-Matlab
Getting data in \ out…Getting data in \ out…• ASCII Data import: the read() methodASCII Data import: the read() method
– read.table()read.table(): reads comma-delimited ASCII file, creates data : reads comma-delimited ASCII file, creates data frameframe
• read.csv(),read.csv(), read.delim()...read.delim()... also create data frame also create data frame• But have different default input parametersBut have different default input parameters
– read.fwf():read.fwf(): reads fixed-width format ASCII file reads fixed-width format ASCII file– scan():scan(): Read data into a vector or list from the console OR Read data into a vector or list from the console OR
file.file.
• ASCII Data ExportASCII Data Export– write.table(): write.table(): writes data to an ASCII text filewrites data to an ASCII text file
Models in RModels in R• R is built on the notion that statistical analysis R is built on the notion that statistical analysis
can be viewed as an exercise in statistical can be viewed as an exercise in statistical modeling, an exercise that modeling, an exercise that is tightly linkedis tightly linked to to the original scientific question.the original scientific question.
• This view provides a coherent framework forThis view provides a coherent framework for– conducting standard hypothesis tests, conducting standard hypothesis tests, andand– dealing with data that contain complexities that dealing with data that contain complexities that
restrict the use of standard hypothesis testsrestrict the use of standard hypothesis tests– estimating effect sizesestimating effect sizes– prediction prediction
Example modelExample model• We think that the concentration of a blood enzyme (Y) is the result of We think that the concentration of a blood enzyme (Y) is the result of
exposure to Pb. We design an experiment and expose organisms to a exposure to Pb. We design an experiment and expose organisms to a series of concentrations of Pb (series of concentrations of Pb ().).
Example modelExample model• We think that the concentration of a blood enzyme (Y) is the result of We think that the concentration of a blood enzyme (Y) is the result of
exposure to Pb. We design an experiment and expose organisms to a exposure to Pb. We design an experiment and expose organisms to a series of concentrations of Pb (series of concentrations of Pb ().).
Yij = + i + ij i. ~ N(0,2)
Grand mean of all Yij
Effect of concentration i
Random variability in Y after accounting for Pb concentration
Example modelExample model• We think that the concentration of a blood enzyme (Y) is the result of We think that the concentration of a blood enzyme (Y) is the result of
exposure to Pb. We design an experiment and expose organisms to a exposure to Pb. We design an experiment and expose organisms to a series of concentrations of Pb (series of concentrations of Pb ().).
Errors within each level of are normally distributed with mean=0 and variance =2
Example modelExample model• We think that the concentration of a blood enzyme (Y) is the result of We think that the concentration of a blood enzyme (Y) is the result of
exposure to Pb. We design an experiment and expose organisms to a exposure to Pb. We design an experiment and expose organisms to a series of concentrations of Pb (series of concentrations of Pb ().).
An alternative modelAn alternative model• We think that the concentration of a blood enzyme (Y) is the result of We think that the concentration of a blood enzyme (Y) is the result of
exposure to Pb. We design an experiment and expose organisms to a exposure to Pb. We design an experiment and expose organisms to a series of concentrations of Pb. series of concentrations of Pb. Let’s consider Pb as a continuous Let’s consider Pb as a continuous variable (X).variable (X).
An alternative modelAn alternative model• We think that the concentration of a blood enzyme (Y) is the result of We think that the concentration of a blood enzyme (Y) is the result of
exposure to Pb. We design an experiment and expose organisms to a exposure to Pb. We design an experiment and expose organisms to a series of concentrations of Pb. series of concentrations of Pb. Let’s consider Pb as a continuous Let’s consider Pb as a continuous variable (X).variable (X).
Dummy VariablesDummy Variables• We could rewrite the ANOVA model using the regression “terminology” via We could rewrite the ANOVA model using the regression “terminology” via
dummy variables. For example, assume 3 concentrations. dummy variables. For example, assume 3 concentrations. • StrategyStrategy
– Recode the independent variables (XRecode the independent variables (X ii) using 0 or 1 to represent treatment levels.) using 0 or 1 to represent treatment levels.
Analysis of Variance (ANOVA)
XX11 XX22
11 00 00
22 11 00
33 00 11
Yi = 0 + 1X1 + 2X2 + i
Contrast Matrix:
The way we perform the coding of dummy variables determines how to interpret model parameters. This coding scheme is called “Treatment Contrasts” - the default in R
A further complicationA further complication• We think that the concentration of a blood enzyme (Y) is the result of We think that the concentration of a blood enzyme (Y) is the result of
exposure to Pb. We design an experiment and expose organisms to a exposure to Pb. We design an experiment and expose organisms to a series of concentrations of Pb (series of concentrations of Pb (). ). Assume we also want to get rid of Assume we also want to get rid of the possibly confounding effects of body size (S).the possibly confounding effects of body size (S).
A further complicationA further complication• We think that the concentration of a blood enzyme (Y) is the result of We think that the concentration of a blood enzyme (Y) is the result of
exposure to Pb. We design an experiment and expose organisms to a exposure to Pb. We design an experiment and expose organisms to a series of concentrations of Pb (series of concentrations of Pb (). ). Assume we also want to get rid of Assume we also want to get rid of the possibly confounding effects of body size (S).the possibly confounding effects of body size (S).
The general linear modelThe general linear model• Forms the basis for most classical statisticsForms the basis for most classical statistics
• Implemented in R through Implemented in R through lm()lm()> m1 = lm(y ~ x, data) > m1 = lm(y ~ x, data) # fit the model and save output as “m1”# fit the model and save output as “m1”> summary(m1) > summary(m1) # print a table summary of model information# print a table summary of model information> anova(m1) > anova(m1) # summarize results in an ANOVA table# summarize results in an ANOVA tableYi = 0 + 1X1 + 2X2 + … + pXp + i