Getting Data into R & Bioconductor Aedín Culhane [email protected] http://www.hsph.harvard.edu/research/aedin-culhane/ http://www.hsph.harvard.edu/research/aedin-culhane/
Jan 17, 2016
Getting Data into R & Bioconductor
Aedín Culhane
http://www.hsph.harvard.edu/research/aedin-culhane/http://www.hsph.harvard.edu/research/aedin-culhane/
Simple Excel SpreadSheet data
• Already described – Read.table()– Read.csv()– scan()
• Are other formats eg netcdf
• However more datatype specialized.– Look at Technologies on BiocViews.– http://www.bioconductor.org/packages/release/BiocViews.html
22
Some common data types
• Microarray
• SNP
• Increasingly NGS
May 2011May 2011 33
A Microarray OverviewA Microarray Overview
44
Reading Affymetrix Data
library(affy)
require(affy) # Alternative
affybatch <- ReadAffy(celfile.path="[Location of your data]")
eSet<-justRMA()
May 2011May 2011 55
Sample R code
66
ExpressionSet Class in R
May 2011May 2011 77
Assessing Data Quality
May 2011May 2011 88
Public Microarray Data
ArrayExpress • 21997 Studies (622,617 profiles,)
GEO • 22,735 Studies (558,074 profiles)
Statistics May 2011Statistics May 2011
>500,000 arrays x $500 = $250,000,000
Cancer Studies account for >14% of all studies in databases…
R Code
May 2011May 2011 1111
More on GEOquery
May 2011May 2011 1212
require(GEOquery) require(GEOquery)
Let's try to load the GDS810 dataset which contains data on Let's try to load the GDS810 dataset which contains data on Alzheimer's disease at various stages of severity. Alzheimer's disease at various stages of severity.
GDS810<-getGEO("GDS810") GDS810<-getGEO("GDS810")
The The getGEOgetGEO function returns an object of class function returns an object of class GEODataGEOData. You can . You can get a description of this class like this: get a description of this class like this: help("GEOData-class") help("GEOData-class")
Meta(GDS810) Meta(GDS810) Columns(GDS810) Columns(GDS810) head(Table(GDS810)) head(Table(GDS810))
Affy SNP Arrays
May 2011May 2011 1313
Process – Affy SNP Arrays (Oligo package)
May 2011May 2011 1414
Other Arrays
• Illumina– Lumi package
• 2 color spotted arrays– Limma package
• Other arrays– http://www.bioconductor.org/help/workflows/
oligo-arrays/
May 2011May 2011 1515
Next Generation Sequencing Data
R Code
May 2011May 2011 1717
Exercise
• From GEO bring down GSE
• Download the dataset GSE1297 using getGEO
• This data will be downloaded as an eSet, so to see the expression data and phenoData, use pData and exprs
• Use ArrayQualityMetrics to Assess the data quality of these data
May 2011May 2011 1818
• With thanks to
• www.bioconductor.org/help/course.../Bioconductor-Introduction-lab.pdf
May 2011May 2011 1919
A B
Quick Aside: Interpreting hierarchical clustering trees
Hierarchical analysis results viewed using a dendrogram (tree)
• Distance between nodes (Scale)• Ordering of nodes not important (like baby mobile)
Tree A and B are equivalentTree A and B are equivalent