Metabolomic data: combining wavelet representation with learning approaches Nathalie Villa-Vialaneix http://www.nathalievilla.org In collaboration with Noslen Hernández (CENATAV, La Havane, Cuba) & Philippe Besse IUT de Carcassonne (UPVD) & Institut de Mathématiques de Toulouse Groupe de travail BioPuces, INRA de Castanet May 19th, 2010 1 / 23 Nathalie Villa-Vialaneix N
41
Embed
Metabolomic data: combining wavelet representation with learning approaches
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Metabolomic data: combining waveletrepresentation with learning approaches
In collaboration with Noslen Hernández (CENATAV, La
Havane, Cuba) & Philippe Besse
IUT de Carcassonne (UPVD)
& Institut de Mathématiques de Toulouse
Groupe de travail BioPuces, INRA de Castanet
May 19th, 2010
1 / 23Nathalie Villa-Vialaneix
N
Présentation générale
1 Presentation of the data
2 Wavelet preprocessing and normalization
3 Learning methods
4 Identification of relevant metabolites
2 / 23Nathalie Villa-Vialaneix
N
Presentation of the data
Presentation of the data
Data have been provided by Alain Paris (INRA): they aremetabolomic spectra (H NMR) from mice urine and consist of950 variables (from 0.50 ppm to 9.99 ppm).
Peaks have been aligned and baseline has been removed.
3 / 23Nathalie Villa-Vialaneix
N
Presentation of the data
Presentation of the data
Data have been provided by Alain Paris (INRA): they aremetabolomic spectra (H NMR) from mice urine and consist of950 variables (from 0.50 ppm to 9.99 ppm).
Peaks have been aligned and baseline has been removed.
3 / 23Nathalie Villa-Vialaneix
N
Presentation of the data
Presentation of the data
Data have been provided by Alain Paris (INRA): they aremetabolomic spectra (H NMR) from mice urine and consist of950 variables (from 0.50 ppm to 9.99 ppm).
Peaks have been aligned and baseline has been removed.3 / 23
Nathalie Villa-VialaneixN
Presentation of the data
Biologic question
Study the effets of Hypochoeris radicata (HR) ingestion on themetabolism: HR flowers are responsible for a mortal disease forhorses, the “Australian stringhalt” (nervous system attack,trembling...)
Experiences have been made with 72 mice.
4 / 23Nathalie Villa-Vialaneix
N
Presentation of the data
Biologic question
Study the effets of Hypochoeris radicata (HR) ingestion on themetabolism: HR flowers are responsible for a mortal disease forhorses, the “Australian stringhalt” (nervous system attack,trembling...)Experiences have been made with 72 mice.
4 / 23Nathalie Villa-Vialaneix
N
Presentation of the data
Description of the experiments
Mice are divided into several groups according to:
PCA for the coef-ficients: the dayof measure for thecontrol group isemphasized onaxis 2 and 4
11 / 23Nathalie Villa-Vialaneix
N
Wavelet preprocessing and normalization
Normalization
Find median and variance of the coefficients for each day ofmeasure based on the control group.
Use these values for the normalization of all the observations(according to the day of measure).
●
●
●
●
0 1 4 8 11 15 18 21
−0.
20.
00.
20.
40.
6
D2.444
Day
Wav
elet
coe
ffici
ents
●
●
●
●
●
0 1 4 8 11 15 18 21
−0.
20−
0.10
0.00
0.10
D.78
Day
Wav
elet
coe
ffici
ents
●
●
●
0 1 4 8 11 15 18 21
0.0
0.5
1.0
1.5
2.0
2.5
D.332
Day
Wav
elet
coe
ffici
ents
●
●●
●●
●
●
0 1 4 8 11 15 18 21
−1.
5−
1.0
−0.
5
D2.289
Day
Wav
elet
coe
ffici
ents
●
●
●
●
0 1 4 8 11 18
−2
−1
01
2
D2.444
Day
Wav
elet
coe
ffici
ents
●
●
●●
●
0 1 4 8 11 18
−3
−1
01
2
D.78
Day
Wav
elet
coe
ffici
ents
●
● ●
0 1 4 8 11 18
−3
−1
01
23
D.332
Day
Wav
elet
bco
effic
ient
s
●
●●
●●
●
●
0 1 4 8 11 18
−3
−1
01
23
D2.289
Day
Wav
elet
coe
ffici
ents
Before After
12 / 23Nathalie Villa-Vialaneix
N
Wavelet preprocessing and normalization
Normalization
Find median and variance of the coefficients for each day ofmeasure based on the control group.
Use these values for the normalization of all the observations(according to the day of measure).
●
●
●
●
0 1 4 8 11 15 18 21
−0.
20.
00.
20.
40.
6
D2.444
Day
Wav
elet
coe
ffici
ents
●
●
●
●
●
0 1 4 8 11 15 18 21
−0.
20−
0.10
0.00
0.10
D.78
Day
Wav
elet
coe
ffici
ents
●
●
●
0 1 4 8 11 15 18 21
0.0
0.5
1.0
1.5
2.0
2.5
D.332
Day
Wav
elet
coe
ffici
ents
●
●●
●●
●
●
0 1 4 8 11 15 18 21
−1.
5−
1.0
−0.
5
D2.289
Day
Wav
elet
coe
ffici
ents
●
●
●
●
0 1 4 8 11 18
−2
−1
01
2
D2.444
Day
Wav
elet
coe
ffici
ents
●
●
●●
●
0 1 4 8 11 18
−3
−1
01
2
D.78
Day
Wav
elet
coe
ffici
ents
●
● ●
0 1 4 8 11 18
−3
−1
01
23
D.332
Day
Wav
elet
bco
effic
ient
s
●
●●
●●
●
●
0 1 4 8 11 18
−3
−1
01
23
D2.289
Day
Wav
elet
coe
ffici
ents
Before After 12 / 23Nathalie Villa-Vialaneix
N
Wavelet preprocessing and normalization
PCA after normalization
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●●●
●
● ●
●
●●
●●
● ●●●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
● ●
●
● ●●
●
●● ●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
−10 −5 0 5 10 15
02
46
810
PC1 vs. PC2
PC1
PC
2
●
●
●
●
●
●
●
●
Day 0Day 1Day 4Day 8Day 11Day 15Day 18Day 21
●
●
●
●●
●
●●
●
●●
●
● ●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●● ●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●●
●
●●●
●● ●
●●
●
●
●
●
●●
●
●
●
● ●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●
−10 −5 0 5 10 15
−10
−5
05
1015
PC1 vs. PC3
PC1
PC
3
●
●
●
●
●
●
●
●
Day 0Day 1Day 4Day 8Day 11Day 15Day 18Day 21
●
●
●
●●
●
●
●●
●
●●
● ●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●●
●
●● ●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
−10 −5 0 5 10 15
−5
05
PC1 vs. PC4
PC1
PC
4
●
●
●
●
●
●
●
●
Day 0Day 1Day 4Day 8Day 11Day 15Day 18Day 21
●
●
●
●●
●
●●
●
●●
●
●●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ● ●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●●
●
● ●●
●●●
●●
●
●
●
●
●●
●
●
●
●●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●●
●
0 2 4 6 8 10 12−
10−
50
510
15
PC2 vs. PC3
PC2
PC
3
●
●
●
●
●
●
●
●
Day 0Day 1Day 4Day 8Day 11Day 15Day 18Day 21
●
●
●
●●
●
●
● ●
●
●●
●●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
● ●
●●
●●
●
●
●
●●
●
●●●
●
●●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
0 2 4 6 8 10 12
−5
05
PC2 vs. PC4
PC2
PC
4
●
●
●
●
●
●
●
●
Day 0Day 1Day 4Day 8Day 11Day 15Day 18Day 21
●
●
●
●●
●
●
●●
●
●●
●●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●●
● ●
●●
●
●
●
●●
●
●●●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
−10 −5 0 5 10 15
−5
05
PC3 vs. PC4
PC3
PC
4
●
●
●
●
●
●
●
●
Day 0Day 1Day 4Day 8Day 11Day 15Day 18Day 21
13 / 23Nathalie Villa-Vialaneix
N
Learning methods
Motivations
Purpose: Validation of the impact of HR ingestion on metabolismby predicting from the spectra the total HR dose ingested. Ifthe prediction is accurate, the impact is not an artefact of the dataand the biological dependency is validated.
Compared methods :
random forest (R package randomForest)
ridge regression (R package glmnet)
LASSO (R package glmnet)
Elasticnet (R package glmnet)
Partial Least Squares (PLS) (R package mixOmics)
sparse PLS (R package mixOmics)
14 / 23Nathalie Villa-Vialaneix
N
Learning methods
Motivations
Purpose: Validation of the impact of HR ingestion on metabolismby predicting from the spectra the total HR dose ingested. Ifthe prediction is accurate, the impact is not an artefact of the dataand the biological dependency is validated.Compared methods :
random forest (R package randomForest)
ridge regression (R package glmnet)
LASSO (R package glmnet)
Elasticnet (R package glmnet)
Partial Least Squares (PLS) (R package mixOmics)
sparse PLS (R package mixOmics)
14 / 23Nathalie Villa-Vialaneix
N
Learning methods
Methodology
Split the data into train and test sets that are balanced according tothe groups;
Preprocess (or not), scale and normalize the data with wavelets;
Learn each of the 6 methods (for each of the 7 kinds ofpreprocessing) on the train set with a cross-validation strategy totune the parameters;
Calculate the mean squared error on the test set.
Repeat the previous scheme 250 times.
15 / 23Nathalie Villa-Vialaneix
N
Learning methods
Methodology
Split the data into train and test sets that are balanced according tothe groups;
Preprocess (or not), scale and normalize the data with wavelets;
Learn each of the 6 methods (for each of the 7 kinds ofpreprocessing) on the train set with a cross-validation strategy totune the parameters;
Calculate the mean squared error on the test set.
Repeat the previous scheme 250 times.
15 / 23Nathalie Villa-Vialaneix
N
Learning methods
Mean performances in test
Methods Original Daubechies Daubechies Daubechies Haar Haar Haar- Details - Full - Threshold - Details - Full - Threshold
Hence, due to the preprocessing step, the coefficients selectedby ELN are not directly related to metabolites (or to localizationon the spectra).
19 / 23Nathalie Villa-VialaneixN
Identification of relevant metabolites
Adaptation of the importance measure
for Each of the 950 variables, v, of the original data set doRandomize the observations of the variable vCompute the full Daubechies wavelet representationwith the randomized observations for vScale and normalize according to the true values mean,median or variancefor Each test set, i do
Calculate new predictions with false values of vand corresponding mse: msev ,i
Calculate decrease in accuracy for test set: DAi =1 − msei
msev ,iend forAverage over i, DAi , to obtain Importance of v
Some havealready been identified: the most important is scyllo-inositol; oneof the orange is probably valine; one of the light yellow is probablytrimethylamine. The others are new.
22 / 23Nathalie Villa-Vialaneix
N
Identification of relevant metabolites
What next?
Identification of the metabolites, study of the correlation betweenthe ones found and the ones previously emphasized.Questions? Propositions?