Top Banner
Advanced Strategies for Metabolomic Data Analysis Dmitry Grapov, PhD Metabolomic Data Analysis
33

Lecture 2 Multivariate Data Analysis and Visualization

Oct 28, 2015

Download

Documents

dgrapov

west coast metabolomics center, data analysis, metabolomics, summer sessions in metabolomics 2013, dmitry grapov
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Lecture 2 Multivariate Data Analysis and Visualization

Advanced Strategies for Metabolomic Data Analysis

Dmitry Grapov, PhD

Met

abol

omic

Dat

a An

alys

is

Page 2: Lecture 2 Multivariate Data Analysis and Visualization

Analysis at the Metabolomic Scale

Page 3: Lecture 2 Multivariate Data Analysis and Visualization

Multivariate Analysis

Samples

variables

Page 4: Lecture 2 Multivariate Data Analysis and Visualization

Multivariate Analysis

• Visualization• Clustering• Projection• Modeling • Networks

Simultaneous analysis of many variables

Page 5: Lecture 2 Multivariate Data Analysis and Visualization

ClusteringIdentify

•patterns

•group structure

•relationships

•Evaluate/refine hypothesis

•Reduce complexity

Artist: Chuck Close

Page 6: Lecture 2 Multivariate Data Analysis and Visualization

Cluster AnalysisUse the concept similarity/dissimilarity to group a collection of samples or variables

Approaches•hierarchical (HCA)•non-hierarchical (k-NN, k-means)•distribution (mixtures models)•density (DBSCAN)•self organizing maps (SOM)

Linkage k-means

Distribution Density

Page 7: Lecture 2 Multivariate Data Analysis and Visualization

Hierarchical Cluster Analysis• similarity/dissimilarity defines “nearness” or distance

X

Y

euclidean

X

Y

manhattan Mahalanobis

X

Y*

non-euclidean

Page 8: Lecture 2 Multivariate Data Analysis and Visualization

Hierarchical Cluster Analysis

single complete centroid average

Agglomerative/linkage algorithm defines how points are grouped

Page 9: Lecture 2 Multivariate Data Analysis and Visualization

Hierarchical Cluster Analysis (cont.)

Sim

ilarit

y

x

xx

x

Page 10: Lecture 2 Multivariate Data Analysis and Visualization

Overview Confirmation

How does my metadata match my data structure?

Hierarchical Cluster Analysis (cont.)

Page 11: Lecture 2 Multivariate Data Analysis and Visualization

Multidimensional Scaling

PLoS ONE 7(11): e48852. doi:10.1371/journal.pone.0048852

Page 12: Lecture 2 Multivariate Data Analysis and Visualization

Projection of Data

The algorithm defines the position of the light sourcePrincipal Components Analysis (PCA)

• unsupervised• maximize variance (X)

Partial Least Squares Projection to Latent Structures (PLS)

• supervised• maximize covariance (Y ~ X)

Page 13: Lecture 2 Multivariate Data Analysis and Visualization

PCA: GoalsPrincipal Components (PCs)

•non-supervised

•projection of the data which maximize variance explained

Results

1.eigenvalues = variance explained

2.scores = new coordinates for samples (rows)

3.loadings = linear combination of original variables

James X. Li, 2009, VisuMap Tech.

Page 14: Lecture 2 Multivariate Data Analysis and Visualization

Interpreting PCA Results

Variance explained (eigenvalues)

Row (sample) scores and column (variable) loadings

Page 15: Lecture 2 Multivariate Data Analysis and Visualization

PCA Example

*no scaling or centering

glucose

219021

Page 16: Lecture 2 Multivariate Data Analysis and Visualization

How are scores and loadings related?

Page 17: Lecture 2 Multivariate Data Analysis and Visualization

Centering and Scaling

van den Berg RA, Hoefsloot HC, Westerhuis JA, Smilde AK, van der Werf MJ (2006) Centering, scaling, and transformations: improving the biological information content of metabolomics data. BMC Genomics 7: 142.

Page 18: Lecture 2 Multivariate Data Analysis and Visualization

Data scaling is very important!

*autoscaling (unit variance and centered)

glucose (GC/TOF)

glucose (clinical)

219021

Page 19: Lecture 2 Multivariate Data Analysis and Visualization

Use PLS to test a hypothesis

Loadings on the first latent variable (x-axis) can be used to interpret the multivariate changes in metabolites which are correlated with time

time = 0 120 min.

Page 20: Lecture 2 Multivariate Data Analysis and Visualization

Modeling multifactorial relationships

dynamic changes among groups~two-way ANOVA

Page 21: Lecture 2 Multivariate Data Analysis and Visualization

“goodness” of the model is all about the perspective

Determine in-sample (Q2) and out-of-sample error (RMSEP) and compare to a random model

•permutation tests

•training/testing

Page 22: Lecture 2 Multivariate Data Analysis and Visualization

Biological Interpretation

• Visualization• Enrichment• Networks

– biochemical– structural– spectral– empirical

Projection or mapping of analysis results into a biological context.

Page 23: Lecture 2 Multivariate Data Analysis and Visualization

Ingredients for Network Analysis

1. Determine connections• biochemical (substrate/product) • chemical similarity• spectral similarity• empirical dependency (correlation)

2. Determine vertex properties• magnitude• importance• direction• relationships

Page 24: Lecture 2 Multivariate Data Analysis and Visualization

Organism specific biochemical relationships and information

Multiple organism DBs

•KEGG

•BioCyc

•Reactome

•Human

•HMDB

•SMPDB

Making Connections Based on Biochemistry

Page 25: Lecture 2 Multivariate Data Analysis and Visualization

Biochemical Networks

Page 26: Lecture 2 Multivariate Data Analysis and Visualization

•Use structure to generate molecular fingerprint

•Calculate similarities between metabolites based on fingerprint

•PubChem service for similarity calculations•http://pubchem.ncbi.nlm.nih.gov//score_matrix/score_matrix.cgi

•online tools•http://uranus.fiehnlab.ucdavis.edu:8080/MetaMapp/homePage

BMC Bioinformatics 2012, 13:99 doi:10.1186/1471-2105-13-99

Making Connections Based on structural similarity

Page 27: Lecture 2 Multivariate Data Analysis and Visualization

Structural Similarity Network

Page 28: Lecture 2 Multivariate Data Analysis and Visualization

Making Connections Based on spectral similarity

Watrous J et al. PNAS 2012;109:E1743-E1752

•Connect molecules based on EI or MS/MS spectral similarity

•Useful for linking annotated analytes (known) to unknown

Page 29: Lecture 2 Multivariate Data Analysis and Visualization

Spectral Similarity Network

Watrous J et al. PNAS 2012;109:E1743-E1752

Page 30: Lecture 2 Multivariate Data Analysis and Visualization

Making connections based on empirical relationships

•Connect molecules based on strength of correlation or partial-correlation

Page 31: Lecture 2 Multivariate Data Analysis and Visualization

Treatment Effects Network

=

MetabolitesShape = increase/decreaseSize = importance (loading)Color = correlation

Connectionsred = Biochemical relationships violet = Structural similarity

Page 32: Lecture 2 Multivariate Data Analysis and Visualization

Summary

Multivariate analysis is useful for: • Visualization• Exploration and overview• Complexity reduction• Identification of multidimensional

relationships and trends• Mapping to networks• Generating holistic summaries of

findings

Page 33: Lecture 2 Multivariate Data Analysis and Visualization

Resource

•Mapping tools (review)• Brief Bioinform (2012) doi: 10.1093/bib/bbs055

•Tutorials and Examples• http://imdevsoftware.wordpress.com/category/uncategorized/ • https://github.com/dgrapov/TeachingDemos

•Chemical Translations Services• CTS: http://cts.fiehnlab.ucdavis.edu/

•R-interface: https://github.com/dgrapov/CTSgetR • CIR: http://cactus.nci.nih.gov/chemical/structure

•R-interface: https://github.com/dgrapov/CIRgetR