Discrimination Models and Variance Stabilizing Transformations of Metabolomic NMR Data Institute on Research and Statistics, Sacramento 04/08/04 Parul Vora Purohit
Mar 27, 2015
Discrimination Models and Variance Stabilizing Transformations of Metabolomic NMR Data
Institute on Research and Statistics, Sacramento04/08/04Parul Vora Purohit
Biodata and ‘omics
Genome Project Genomics - Study of Genes Proteomics - Study of proteins Metabolomics - Study of metabolites *
cellomics, CHOmics, chromonoics, etc.
Analytical techniques Microarray Spectroscopy Mass Spectroscopy NMR Spectroscopy *
NMR Spectroscopy
Curtsey ~ Joseph Medendorp / Public Information / University of Kentucky
Intense homogenous and magnetic field
High Powered RF transmittor capable of delivering short pulses ~ 500 MHz stimulate 1H nuclear spin transitions
Probe which enables the coils used to excite and detect the signal
Plot of signal vs shift in frequency from original pulse
Measured in ppm (ratio from the original signal)
NMR Data
Allows detection of compounds with H content Shift characterizes the chemicals (metabolites) Examples:
2.14 ppm – glutamine – γ CH2 group 2.27 ppm - valine – β CH group 6.91 ppm – tyrosine – C3, 5H ring
~65,000 points (variables) per sample
Questions
Classification ~ Can we distinguish sick organisms from the healthy ones?
Identification ~ Which metabolites play a role in the disease (biomarker)?
DIFFERENCES IN THE DETAILS!
Abalone Data
A set of 18 abalone 8 healthy, 5 stunted, 5 sick
Tissue from muscle
Questions : Can we classify the abalone accurately ? Can we detect any metabolites that are markers?
Problems / Solutions Multivariate Techniques
Matrix of 65,000 (variables) x 18 (samples)
Too many variables as compared to the number of samples Dimension Reduction by Binning
Classification and metabolite marker identification using PCA and Cluster Analysis
Methods assume that the data is normally distributed with a constant variance
Generalized Log Transformation improves results!
NMR Data Pre-Processing
Background Subtraction
‘TMSP Peak (standard at 0 ppm removed)
Water Peak Removal 4.72-4.96 ppm removed)
Normalization Integrated Intensity normalized to
1.0 to remove the effects of systematic intensity changes between abalone
Binning / Size
Binned Spectrum
Bin Size Range = 0.00125 ppm – 0.7 ppm
Intensity of Bin = Integrated Intensity of all points in Bin
Restricted Region of interest to 0.2 ppm – 10.0 ppm
Bin Size = .04 ppm
239 Bins
Principal Component Analysis (PCA)
Technique that allows for the explanation of the variance-covariance of the variables in terms of a linear combination of them
X = t1pT1 + t2pT
2 + …+ tkpTk + E pi - eigenvectors
Projections of the original data matrix on these components give the relations between the samples – Scores Plot
A plot of the eigenvectors of the covariance matrix gives a relationship between the variables – Loadings Plot
Reduces the dimension of the problem; a few components suffice to explain the variance
* Courtesy Wise, B. M. and Gallagher, N. B., PLS_Toolbox 2.1
PCA Results
Scores Plot Loadings Plot
Cluster Analysis - Hierarchical
Transformed Data – Groups Clearly Identified
Untransformed Data
Generalized Log Transformation
Shown* that a transformation of the form
f(y) = ln( y + (y2 + c) )
can lead to a variance stabilizing effect on the data
The parameter c can be obtained by Maximum
Likelihood or ANOVA methods and is ~ of the value
c ~ σ2 / S2
where σ2 is the variance of the noise and S2 the variance of the high peaks
*Durbin, B., Hardin, J., Rocke, D. M., Bioinformatics, 2002, 18, s105-s110
* Sue Geller, Jeff Gregg, Paul Hagerman, David Rocke, Transformation and Normalization of Oligonucleotide Microarray Data, 2003
Maximum Likelihood*
Need replicates to determine accurate the SSE (c)
Find c for the minimum SSE
Find c steps using Newton’s method or educated intervals
* Box, G. and Cox. D.R. (1964) An Analysis of transformations. J. roy. Stat. Soc.. Series B (Methodological), 26, 211.
lvec
SSEv
ec
2.2*10^-7 2.4*10^-7 2.6*10^-7 2.8*10^-7 3*10^-7 3.2*10^-7
1.88
6*10
^-9
1.88
7*10
^-9
1.88
8*10
^-9
1.88
9*10
^-9
c
Err
or S
um o
f S
quar
es
Transformed Spectrum
Bin Size = .04 ppm239 Bins, c = 2.7e-7
Calculate ‘c’ using the replicate data by maximum likelihood methodsUse transformation of the form using replicates,
Transform data to stabilize the variancef(y) = ln( y + (y2 + c) )
Stabilized Variance
Bin Size = .04ppm
Bin Size = .04ppm
C = 2.7E-7
Scores Plot – Transformation Effects
Untransformed Data Transformed Data
Loadings Plot – Transformation Effects
Untransformed Data Transformed Data
Cluster Analysis - Hierarchical
Transformed Data – Groups Clearly Identified
Untransformed Data
Raw Spectra – Significant Bins
Bin 124 – 5.38 ppm Bin 76 – 3.22 ppm
Bin 125 – 5.42 ppm Bin 77 – 3.26 ppm
Bin 126 – 5.46 ppm Bin 78 – 3.3 ppm
Healthy Stunt. SickHealthy Stunt. Sick
Glycogen, Sucrose, Fructose ?
Conclusions
Demonstrated the use of data reduction techniques, multi-variate techniques for studying NMR and Mass Spectrometer data
Demonstrated the use of these techniques to identify metabolite and protein bio-markers
Showed the usefulness of transformations in rendering the data more useful
Acknowledgements
David M. Rocke, CIPIC
David L. Woodruff, CIPIC
Mark R. Viant, U. of Birmingham, U. K.