Top Banner
Data integration in plant biology: the O2PLS method for combined modeling of transcript and metabolite data Max Bylesjo ¨ 1,† , Daniel Eriksson 2,† , Miyako Kusano 2,3 , Thomas Moritz 2 and Johan Trygg 1,* 1 Research group for Chemometrics, Department of Chemistry, Umea ˚ University, SE-901 87 Umea ˚ , Sweden, 2 Umea ˚ Plant Science Centre, Department of Forest Genetics and Plant Physiology, Swedish University of Agricultural Sciences, SE-901 83 Umea ˚ , Sweden, and 3 RIKEN Plant Science Center, 1-7-22 Tsurumi, Yokohama City, Kanagawa, 230-0045, Japan Received 7 June 2007; revised 23 July 2007; accepted 7 August 2007. * For correspondent (fax þ46 (0)90 138885; e-mail [email protected]). These authors contributed equally to this work. Summary The technological advances in the instrumentation employed in life sciences have enabled the collection of a virtually unlimited quantity of data from multiple sources. By gathering data from several analytical platforms, with the aim of parallel monitoring of, e.g. transcriptomic, metabolomic or proteomic events, one hopes to answer and understand biological questions and observations. This ‘systems biology’ approach typically involves advanced statistics to facilitate the interpretation of the data. In the present study, we demonstrate that the O2PLS multivariate regression method can be used for combining ‘omics’ types of data. With this methodology, systematic variation that overlaps across analytical platforms can be separated from platform- specific systematic variation. A study of Populus tremula · Populus tremuloides, investigating short-day- induced effects at transcript and metabolite levels, is employed to demonstrate the benefits of the methodology. We show how the models can be validated and interpreted to identify biologically relevant events, and discuss the results in relation to a pairwise univariate correlation approach and principal component analysis. Keywords: combined profiling, O2PLS, chemometrics, Populus, multivariate regression. Introduction In the post-genomics era, the development of life science technologies that enable transcriptomic, proteomic and metabolomic events to be analyzed in detail in the same biological systems has revolutionized biological studies. Instead of relating biological observations to a small number of variables, it is now possible to study biological systems with a global analytical approach. This is sometimes called systems biology, and the purpose is to study organisms as integrated systems of genetic, protein, metabolic, cellular, and pathway events. A systems biology approach based on data collected from many different analytical platforms involves advanced sta- tistics to be able to interpret the data. Numerous examples exist in the literature regarding the use of data from parallel sources (e.g. Carrari et al., 2006; Clish et al., 2004; Gygi et al., 1999; Hirai et al., 2004, 2005; Kleno et al., 2004; Kolbe et al., 2006; Oresic et al., 2004; Rischer et al., 2006; Tohge et al., 2005). In the majority of studies, pairwise univariate correlations between measured variables in different data- sets have been utilized to elucidate joint effects and under- lying mechanisms. In the plant field Professor Saito’s group have made substantial contributions (Hirai et al., 2004, 2005; Tohge et al., 2005), primarily concerning the integration of transcript and metabolite data for Arabidopsis thaliana. The general approach has been to analyze and interpret the data sources in parallel, formulate hypotheses independently for each platform and finally outline a consensus theory based on prior knowledge of existing pathways to unravel novel trends and processes. The main intricacies associated with a unified approach lie in the subsequent analysis and interpretation: i.e. to unwrap systematic and biologically relevant information from noisy, ª 2007 The Authors 1 Journal compilation ª 2007 Blackwell Publishing Ltd The Plant Journal (2007) doi: 10.1111/j.1365-313X.2007.03293.x
11

Data integration in plant biology: the O2PLS method for combined modeling of transcript and metabolite data

May 10, 2023

Download

Documents

Anna Norin
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Data integration in plant biology: the O2PLS method for combined modeling of transcript and metabolite data

Data integration in plant biology: the O2PLS method forcombined modeling of transcript and metabolite data

Max Bylesjo1,†, Daniel Eriksson2,†, Miyako Kusano2,3, Thomas Moritz2 and Johan Trygg1,*

1Research group for Chemometrics, Department of Chemistry, Umea University, SE-901 87 Umea, Sweden,2Umea Plant Science Centre, Department of Forest Genetics and Plant Physiology, Swedish University of Agricultural Sciences,

SE-901 83 Umea, Sweden, and3RIKEN Plant Science Center, 1-7-22 Tsurumi, Yokohama City, Kanagawa, 230-0045, Japan

Received 7 June 2007; revised 23 July 2007; accepted 7 August 2007.*For correspondent (fax þ46 (0)90 138885; e-mail [email protected]).†These authors contributed equally to this work.

Summary

The technological advances in the instrumentation employed in life sciences have enabled the collection of a

virtually unlimited quantity of data from multiple sources. By gathering data from several analytical platforms,

with the aim of parallel monitoring of, e.g. transcriptomic, metabolomic or proteomic events, one hopes to

answer and understand biological questions and observations. This ‘systems biology’ approach typically

involves advanced statistics to facilitate the interpretation of the data. In the present study, we demonstrate

that the O2PLS multivariate regression method can be used for combining ‘omics’ types of data. With this

methodology, systematic variation that overlaps across analytical platforms can be separated from platform-

specific systematic variation. A study of Populus tremula · Populus tremuloides, investigating short-day-

induced effects at transcript and metabolite levels, is employed to demonstrate the benefits of the

methodology. We show how the models can be validated and interpreted to identify biologically relevant

events, and discuss the results in relation to a pairwise univariate correlation approach and principal

component analysis.

Keywords: combined profiling, O2PLS, chemometrics, Populus, multivariate regression.

Introduction

In the post-genomics era, the development of life science

technologies that enable transcriptomic, proteomic and

metabolomic events to be analyzed in detail in the same

biological systems has revolutionized biological studies.

Instead of relating biological observations to a small number

of variables, it is now possible to study biological systems

with a global analytical approach. This is sometimes called

systems biology, and the purpose is to study organisms as

integrated systems of genetic, protein, metabolic, cellular,

and pathway events.

A systems biology approach based on data collected from

many different analytical platforms involves advanced sta-

tistics to be able to interpret the data. Numerous examples

exist in the literature regarding the use of data from parallel

sources (e.g. Carrari et al., 2006; Clish et al., 2004; Gygi

et al., 1999; Hirai et al., 2004, 2005; Kleno et al., 2004; Kolbe

et al., 2006; Oresic et al., 2004; Rischer et al., 2006; Tohge

et al., 2005). In the majority of studies, pairwise univariate

correlations between measured variables in different data-

sets have been utilized to elucidate joint effects and under-

lying mechanisms. In the plant field Professor Saito’s group

have made substantial contributions (Hirai et al., 2004, 2005;

Tohge et al., 2005), primarily concerning the integration of

transcript and metabolite data for Arabidopsis thaliana. The

general approach has been to analyze and interpret the data

sources in parallel, formulate hypotheses independently for

each platform and finally outline a consensus theory based

on prior knowledge of existing pathways to unravel novel

trends and processes.

The main intricacies associated with a unified approach lie

in the subsequent analysis and interpretation: i.e. to unwrap

systematic and biologically relevant information from noisy,

ª 2007 The Authors 1Journal compilation ª 2007 Blackwell Publishing Ltd

The Plant Journal (2007) doi: 10.1111/j.1365-313X.2007.03293.x

Page 2: Data integration in plant biology: the O2PLS method for combined modeling of transcript and metabolite data

high-order data structures collected at several levels. Here

we describe a general methodology that is complementary

to previously outlined approaches, with the potential to

provide further information and insights into the field of

plant biology. The strategy builds on recent advances in

multivariate regression methods, specifically the O2PLS

method (Trygg, 2002; Trygg and Wold, 2003) that has the

capacity to integrate data from multiple sources (e.g. data

collected from different analytical platforms). This enables

one to separate joint (overlapping) information across

multiple analytical platforms from systematic variation that

is unique to each platform. The O2PLS method identifies

systematic trends across any pair of datasets: for instance

transcript and metabolite data that will be utilized in the

presented work. This process is schematically outlined in

Figure 1. A recent study investigating metabonomic and

proteomic correlations from mice samples using this

approach is available in the literature (Rantalainen et al.,

2006).

The fundamentals of the O2PLS method builds on the

work by Trygg and Wold (2002) who outlined orthogonal

projections to latent structures (OPLS), which is a supervised

multivariate regression method. OPLS combines the exist-

ing theory of partial least squares (PLS) regression (Wold

et al., 1984, 2001) and orthogonal signal correction (OSC)

(Wold et al., 1998). OPLS is primarily used when one has a

single data source, e.g. transcript data from cDNA micro-

arrays, and want to study the relationship with some other

property, for instance concentration levels of a set of

compounds. One can then separate information that is

related to these concentration levels from other sources of

variation that are unrelated to the problem formulation, for

instance overall disparities for each microarray slide caused

by production inaccuracies. OPLS can also be used for

discriminant analysis (OPLS-DA) to differentiate between

classes (Bylesjo et al., 2006).

The potentials of the OPLS and O2PLS methodologies for

integrating transcript and metabolite data are also obvious

when considering the following hypothetical example. We

have collected a dataset consisting of peak-resolved GC/MS

data where one wants to investigate the metabolomic

differences between two plant genotypes. During the data

collection, however, treatment-independent systematic

noise (baseline fluctuations) has affected the intensity

values of the spectra, which introduces false univariate

correlation patterns. Using OPLS the data will be character-

ized by multivariate latent structures that describe these

properties independently from one another, which is feasi-

ble because the baseline fluctuations are independent

(orthogonal in mathematical notation) of the genotype

effect. It will thus be possible to interpret the covariance

(predictive) structure related to the genotype effect and the

baseline effect separately. If we also generate transcriptomic

data from the same biological samples, the O2PLS method-

ology would be useful to complement the analysis by

discovering trends between the metabolite (GC/MS) and

transcript (microarray) datasets (Figure 1a–b). This has the

potential to provide additional information in order to

enhance the understanding of a possibly complex biological

process.

(a)

(b)

Figure 1. Overview of the O2PLS method.

Different representations of the model structures

of the O2PLS method and the practical set-up of

the data are depicted. In (a), the six different

model structures of the O2PLS model are

described for the particular example of integrat-

ing transcript with metabolite data. Using the

O2PLS methodology, it is possible to separate

the predictive variation (e.g. used to predict

metabolite levels from transcript profiles) from

the variation that is unique to each platform as

well as residual variation. In (b), different plant

samples, exemplified by wild-type and mutant

genotypes, are measured using several analyti-

cal platforms, exemplified by cDNA microarrays

and GC/MS.

2 Max Bylesjo et al.

ª 2007 The AuthorsJournal compilation ª 2007 Blackwell Publishing Ltd, The Plant Journal, (2007), doi: 10.1111/j.1365-313X.2007.03293.x

Page 3: Data integration in plant biology: the O2PLS method for combined modeling of transcript and metabolite data

Characteristics of the O2PLS model

As previously stated, O2PLS identifies joint variation

between two datasets, in this case transcript and metabolite

data, respectively, as well as systematic variation that is

unique to each dataset (Figure 1a). Each of these sources of

variation are composed of smaller entities referred to as

latent variables (Kvalheim, 1992), which describe indepen-

dent effects in the data. The O2PLS method finds systematic

trends across two datasets, which could be any pair of

datasets. The effect of applying O2PLS to a problem of this

form is described schematically in a simplified form in

Figure 1(b) for the special case of integrating transcript and

metabolite data for a plant experiment. Although the O2PLS

method typically relies on a terminology from the field of

linear algebra, we will aim to avoid the usage of complex

mathematical notation whenever possible. When such

usage is required to make a clear point we will attempt a

parallel generalized explanation for clarity. A more detailed

explanation of the underlying mathematics is partly avail-

able in the Supplementary material, but is given more

comprehensively in the works of Trygg (Trygg, 2002; Trygg

and Wold, 2003).

As seen in Figure 1(a), the O2PLS model contains six

distinctive structures. Without a loss of generality, we will

assume that one dataset contains transcript data and the

other contains metabolite data, although there is no general

restriction towards usage of these particular data types. The

transcriptomic dataset is decomposed into three distinct

structures in the O2PLS model. One is transcript-predictive,

i.e. describes gene expression profiles that are useful for

predicting metabolite profiles. There is also a transcript-

unique structure, which describes systematic effects in the

transcript data that are not useful for predicting the meta-

bolite profiles. As this structure describes systematic

changes in the transcript levels that have no general

correlation patterns with the metabolites, it could be linked,

for example, to systematic array bias. Finally, the remaining

variation in the transcript data ends up in the transcript

residual: capturing stochastic or noise-effects. The corre-

sponding structures for the metabolite dataset will be

denoted metabolite predictive, metabolite unique and

metabolite residual, respectively. The term joint variation,

which will be used throughout the paper, refers to the

transcript-predictive structure together with the metabolite-

predictive structure (Figure 1a). The residual structures will

generally be excluded from illustrations and elaborations

regarding the results.

All of the different model structures maintain the dimen-

sionality of the original datasets. This implies that one can

interpret the most dominating correlation and covariance

trends both in the sample directions (for instance, to identify

interesting tendencies, clusters among samples or highly

deviating biological replicates) and in the variable directions

(for instance, to identify influential transcripts and potential

metabolites). This applies to all six model structures indi-

vidually. This model transparency is a unique property of the

O2PLS method, and will be demonstrated in numerous ways

throughout the paper.

O2PLS model estimation

O2PLS modeling requires estimation of the complexity of

the different structures in Figure 1(a–b). Specifically, this

determines what fraction of the total variation is dispersed

into the predictive, unique and residual structures, respec-

tively. This is equivalent to finding a suitable number of

latent variables (Kvalheim, 1992) for each type of structure

(Figure 1a–b). In essence, there are two different categories

of unfavorable outcomes that should, optimally, be avoided.

The first case is model underfitting, where the complexity of

the model is too low compared with the complexity of the

dataset. Systematic structures exist in the datasets but the

model fails to capture them; thus hampering both predic-

tions of future (unknown) samples and model interpretation.

The second case is model overfitting, where the complexity

of the model is set too high. Systematic structures as well as

noise-related (stochastic) features will be incorporated,

causing the generality of the solution to weaken as the sto-

chastic features are unlikely to be representative for the data.

Obviously, the two outcomes are mutually exclusive, thus

requiring a suitable intermediate solution. This is basically a

dataset-specific problem that is frequently approximated by

means of resampling methods. Here, Monte Carlo cross-

validation (MCCV) (Shao, 1993) has been utilized for this

purpose; further details are available in the Experimental

procedures and in the Supplementary material.

Study summary

In the present case study, we show how the O2PLS method

can be used for integration of transcriptomics data, in the

form of dual-channel cDNA microarray data (Schena et al.,

1995), and metabolomics data in the form of GC/MS data

(de Hoffmann and Stroobant, 2001). All biological samples

used in this study originate from an experiment investi-

gating short-day-induced effects on wild-type hybrid aspen

(Populus tremula · Populus tremuloides). The trees have

been grown under different light conditions: long day (LD)

and short day (SD). Material has subsequently been col-

lected at various time points, rendering the LD0, LD2, SD2

and SD6 sample categories, respectively (see the Experi-

mental procedures for further details). Finally, the utilized

biological samples have been measured in parallel for

estimation of metabolite and transcript abundances. We will

adapt a non-stringent use of the terms ‘transcript’ and

‘metabolite’ to refer to microarray elements and resolved

peaks from the GC/MS spectra, respectively.

Integration of plant omics data by O2PLS 3

ª 2007 The AuthorsJournal compilation ª 2007 Blackwell Publishing Ltd, The Plant Journal, (2007), doi: 10.1111/j.1365-313X.2007.03293.x

Page 4: Data integration in plant biology: the O2PLS method for combined modeling of transcript and metabolite data

The study is outlined as a two-step procedure: first the

predictive versus unique variation will be identified with-

out incorporating prior knowledge of the classes (treat-

ments) using O2PLS. The modeled data will be visualized

and interpreted primarily based on graph methods as a

proof of concept, but no comprehensive biological inves-

tigations will be conducted as this is peripheral to the aim

of the presented work. Because of the simplicity of this

proof-of-principle study we can only expect to identify

certain transcript–metabolite correlations. These effects

are bound to be temporally linked (as the time-dimension

is unavailable) and related across the different datasets

and treatments (LD versus SD). See Rischer et al. (2006)

for a demonstration of the potential differences in

temporal response for transcripts and metabolites.

Subsequently, the predictive systematic variation

from the two datasets (from the O2PLS model) will be

used to discriminate between the classes to show that the

related structures capture the implicit class information by

means of OPLS-DA (Bylesjo et al., 2006). This process is

depicted in Figure 2. In the second step the two most

diverse categories will be used for model training,

whereas the remaining two categories of samples will

be used for validation where class belonging is defined a

priori.

Results

Step 1: integration of the transcript and metabolite

datasets using O2PLS

Variance dispersion of the six model structures in Figure 1(a)

was determined using class-balanced MCCV (Shao, 1993)

based on the six biological replicates from the LD0 and SD6

treatments (12 samples in total). Model statistics are avail-

able in the Supplementary material. In short, three latent

variables were identified for the joint variation that is useful

for predicting metabolite levels based on the transcript

profiles, and vice versa. The joint variation can thus be

considered to consist of three separate subsets with sys-

tematic properties. This roughly corresponds to indepen-

dent clusters of transcripts and potential metabolites with a

high within-cluster correlation, but with a minimal between-

cluster correlation, which could be interpreted as identifi-

cation of different unrelated but distinct biological patterns.

The transcript-predictive structure accounts for 38.0% of the

total variation in the transcriptomic dataset. For the meta-

bolomic dataset, the corresponding metabolite-predictive

structure accounts for 60.8% of the total variation in the

dataset. No remaining transcript-unique structured variation

was available for the transcript dataset, but one metabolite-

(b)

(a) Figure 2. Schematic view of the general strategy

for integration and discrimination.

At the initial step, transcriptomic and metabolo-

mic data from the transcript and metabolite

sources are integrated using the O2PLS method

without acknowledging class information. The

predictive structure is subsequently subjected to

discriminant analysis and is shown to capture the

latent class information.

4 Max Bylesjo et al.

ª 2007 The AuthorsJournal compilation ª 2007 Blackwell Publishing Ltd, The Plant Journal, (2007), doi: 10.1111/j.1365-313X.2007.03293.x

Page 5: Data integration in plant biology: the O2PLS method for combined modeling of transcript and metabolite data

unique structure was identified for the metabolomic dataset,

accounting for 9.7% of the total variation in that dataset.

From each of the three latent variables forming the joint

variation in Figures 1 and 2, significantly predictive tran-

scripts and metabolites were subsequently identified using a

permutation procedure described in the Experimental pro-

cedures. It is important to note that the resulting subset of

transcripts and metabolites exhibit strong multivariate cor-

relation patterns, and can be considered candidates when

connecting regulatory expression patterns to metabolite

levels. Because of the simplicity of the study, we can only

expect to find temporally linked transcript-metabolite effects

as the time dimension is unavailable [see e.g. Rischer et al.

(2006) for discussions regarding this issue]. The total

number of identified transcripts and metabolites for each

latent variable (cluster) is available in Table 1. A complete

list of identified transcripts and metabolites is available in

the Supplementary material.

One interesting aspect of this type of analysis is the

interpretation of the results. We have utilized graph theo-

retical methods for visualization purposes. For each of the

latent variables in the joint variation, pairwise Pearson

correlations were calculated between all identified tran-

scripts and metabolites. Note that we are basing the

pairwise correlations on the O2PLS model data, thus

implicitly exploiting the exposed multivariate correlation

patterns while ignoring the remaining patterns. For visual-

ization clarity, the transcripts have been categorized based

on the gene ontology biological process (GO-BP) annotation

(http://www.geneontology.org). For further details on the

construction of the graphs see the Supplementary material.

We will exemplify the visualization and discuss the

relevance of the identified transcripts and potential meta-

bolites using mainly the second latent variable (cluster) of

the joint variation, which is depicted in Figure 3. All identi-

fied GO-BP groups are related to the nucleotide metabolism.

Specifically, GO-BP groups all belong to the pyrimidine

metabolism, with the exception of two groups belonging to

purine metabolism (e.g. GTP metabolism). Nucleotide

metabolism is known to play an important role in plant

biochemical and developmental processes (Boldt and

Zrenner, 2003). Purine and pyrimidines are generated from

different amino acids and other small molecules via de novo

pathways, and from nucleobases and nucleosides by the

salvage pathways, and are frequently involved in joint

pathways. Purines and pyrimidines are regulators of

products such as sucrose, polysaccharides, sugars, phos-

pholipids and secondary products (Stasolla et al., 2003), and

thereby influence plants growth and development. Many of

the identified metabolites are in fact carbohydrates, includ-

ing sucrose, fructose and glucose derivatives (glucaric acid

and gluconic acid), as well as undetermined carbohydrates.

Although the metabolomics methodology used here does

not cover the analysis of purine and pyrimidine derivatives,

the metabolite data still supports the putative connection

between transcripts and metabolites suggested here.

The recent sequencing of the Arabidopsis genome has

increased the possibilities to analyze the role of purines

and pyrimidines in biological functions of higher plants,

which was previously assumed to have identical roles, as

in microorganisms and in yeast (Boldt and Zrenner, 2003;

Kafer et al., 2004). The pyrimidines CTP and UTP seem to

affect different areas of the metabolism (Dowhan, 1997;

Ostrander et al., 1998). CTP is required for phospholipid

metabolism, where one can spot the constituents of

phosphoric acid and lipid compounds amongst the iden-

tified metabolites in the present study. UTP is required for

sugar metabolism mainly in the form of UDP-glucose,

which is a precursor for components in the cell wall,

glycoproteins, glycolipids and sulfolipids (Kafer et al.,

2004; Zrenner et al., 2006). We have shown that glucose

derivatives as well as undetermined carbohydrates consti-

tute a large portion of the identified metabolites (Figure 3).

The purines (GTP and ATP) play an important role for the

assimilation of nitrogen (Smith and Atkins, 2002), as well

as in cell division and energy metabolism. Interestingly, in

another latent variable (subgroup) of the joint variation, the

majority of the GO-BP groups are related to mitosis effects

(not shown). For instance, the anaphase in cell division is

seen together with microtubule polymerization processes,

which are known to be altered during mitosis. Other

GO-BP groups are connected to various kinds of trans-

ports, two of which are coupled to transports of purines

and nucleobases; both important parts in the mitosis

process. Such effects on cell division are frequently seen

in dormancy-related experiments (Devitt and Stafstrom,

1995; Horvath et al., 2003), similar to the present experi-

mental setup.

Step 2: classification of the biological samples

OPLS-DA model training was performed using the two most

diverse samples, LD0 and SD6, as different classes as illus-

trated in Figure 2 (step 2). The remaining samples, LD2 and

SD2, were subsequently predicted and assigned to belong to

any of the two classes as validation. Based on previous

experiments we know that SD induces growth cessation,

which suggests that (i) the LD2 class should be highly similar

to the LD0 class, and (ii) the SD2 class should be an

approximate intermediate of the LD0 and SD6 classes

Table 1 The identified number of transcripts and metabolites foreach latent variable (LV)

LV 1 LV 2 LV 3 Total unique

Transcripts 104 90 206 410Metabolites 3 39 25 64

Integration of plant omics data by O2PLS 5

ª 2007 The AuthorsJournal compilation ª 2007 Blackwell Publishing Ltd, The Plant Journal, (2007), doi: 10.1111/j.1365-313X.2007.03293.x

Page 6: Data integration in plant biology: the O2PLS method for combined modeling of transcript and metabolite data

(Thomas and Vince-Prue, 1997; M. Kusano et al., unpub-

lished data).

OPLS can superficially be seen as a one-way O2PLS; out of

the six structures described in Figure 1(a), only three of

these remain in the OPLS model. These will be denoted as

predictive, unique and residual for consistency with the

previous notation for the O2PLS model. Properties of the

predictive and unique structures were determined using

class-balanced MCCV (Shao, 1993), recommending one

predictive and no unique latent variable for the OPLS-DA

classification model. Model statistics are available in the

Supplementary material. For the training samples, perfect

classification was achieved for the LD0 and SD6 classes. All

of the six LD2 samples were accurately predicted to belong

to the LD0 class as an external test set. Out of the six SD2

samples, two were predicted to belong to the LD0 class, and

the remaining four were predicted to belong to the SD6 class,

roughly as expected. This is depicted in Figure 4, where

densities of the class predictions are shown.

Comparison with related methods

Comparison with univariate correlation studies A permuta-

tion test was utilized to identify significant transcript–

metabolite connections based on the Pearson’s correlation

coefficient for all possible pairwise combinations. The

identified interval of interest was by permutation deter-

mined to be ()0.943, 0.948), i.e. only the transcripts and

metabolites with a pairwise correlation outside this interval

were retained (see Supplementary material for details). This

resulted in the identification of a total of 102 unique tran-

scripts and 109 metabolites. By pooling the transcripts by

means of the GO-BP annotations, using the same procedure

as described in the O2PLS example, the identified groups

Figure 3. Visualization of the identified transcripts and metabolites.

A graph of the identified elements of the second latent variable of the joint variation is illustrated. The transcripts are grouped using the gene ontology biological

process (GO-BP) annotations (boxes) together with the identified metabolites (circles). The edges (connecting lines) are colored according to the correlation, where

darker lines denote stronger correlations and vice versa. The displayed correlations span the interval 0.82–0.98. The label ‘UPSC unknown’ of potential metabolites

refers to database matches that have been seen in previous (internal) studies, but the identity of which currently remains unknown.

6 Max Bylesjo et al.

ª 2007 The AuthorsJournal compilation ª 2007 Blackwell Publishing Ltd, The Plant Journal, (2007), doi: 10.1111/j.1365-313X.2007.03293.x

Page 7: Data integration in plant biology: the O2PLS method for combined modeling of transcript and metabolite data

exclusively describe metal ion homeostasis. Only eight out

of 102 (7.8%) of the identified transcripts overlap between

the pairwise univariate correlation and the O2PLS approach.

As for the potential metabolites, the identified entries con-

tain a multitude of unknown hits with a slight dominance of

carbohydrates (not shown). Consistent with the transcripts,

only a small proportion of the identified variables overlap

with the O2PLS approach (11 out of 109; 10.1%), as shown in

Figure 5.

Comparison with principal component analysis. Principal

component analysis (PCA; Jolliffe, 2002; Wold et al., 1987) is

a multivariate projection method that has been employed in

several systems biology studies for exploratory analysis of

data from multiple sources (see e.g. Clish et al., 2004;

Rischer et al., 2006, for examples of such usage). The gen-

eral approach is to concatenate multiple datasets and project

these according to the maximum variance, e.g. to find

interesting trends or for the detection of outlying samples.

For comparison, we have utilized this technique by concat-

enating the transcript and metabolite datasets, and we

subsequently calculated three principal components. This

was followed by the same permutation strategy to identify

significant transcripts and metabolites as that used for the

O2PLS method. See the Supplementary material for more

details of the conceptual similarities and differences

regarding the PCA and O2PLS methods.

The permutation test resulted in the identification of a

total of 321 significant transcripts and 45 metabolites

(Figure 5). By pooling the transcripts by means of the GO-

BP annotations, using the same procedure as described in

the O2PLS example, the identified groups belong to varying

categories, such as response to wounding and cell-wall

catabolism, with no distinct consensus. As for the potential

metabolites, the identified entries contain a multitude of

unknown hits, where the metabolites overlapping with the

metabolites identified using O2PLS are mostly carbohy-

drates (not shown). Only 59 out of 321 (14.4%) of the

identified transcripts are overlapping between the PCA and

the O2PLS approach. For the metabolites, 28 out of 45

(62.2%) do overlap with the O2PLS approach, suggesting

that the methods perform similarly for this particular data-

set. The differences observed between PCA and O2PLS

models are related to the difference in how they model the

Figure 4. Predictions of class assignment by

means of OPLS-DA.

In (a), densities of the observations predicted by

the full OPLS-DA model for the SD6 (dashed line)

and LD0 (solid line) classes are presented, show-

ing distinct bimodal properties and little overlap,

as expected. A theoretical separation point (deci-

sion boundary) between the classes is shown

using a dotted vertical line. In (b), the SD2

(dashed line) and LD2 (solid line) classes are

displayed as predicted by the same OPLS-DA

model as in (a). LD2 behaves approximately as

LD0, whereas SD2 exhibits a density that is

roughly an intermediate between LD0 and SD6.

The same decision boundary as in (a) is shown

using a dotted vertical line.

Figure 5. Venn diagram of the overlap of iden-

tified transcripts and metabolites across the

methods.

An overview of the overlap between the different

methods reveals a greater overlap between the

results from O2PLS and PCA compared with the

results based on univariate correlations.

Integration of plant omics data by O2PLS 7

ª 2007 The AuthorsJournal compilation ª 2007 Blackwell Publishing Ltd, The Plant Journal, (2007), doi: 10.1111/j.1365-313X.2007.03293.x

Page 8: Data integration in plant biology: the O2PLS method for combined modeling of transcript and metabolite data

variation in the data tables. The PCA model will find the

direction in the data with the highest variability. Part of

the variability in one dataset is not necessarily related to the

other dataset, and hence in that situation the outcome will

be different from the results of the O2PLS model. Any

systematic effects in the data, caused by technical issues in

one dataset, for example, will negatively affect the estimated

principal components. Such effects can be quite consider-

able; in the presented material, approximately 10% of the

variation in the metabolite data is related to a systematic,

platform-specific effect. As the datasets are concatenated in

the PCA calculations, this effect will propagate and also

influence the identified transcripts, as seen in Figure 5. In the

O2PLS method this effect is identified and can be studied

separately, which is beneficial from an interpretational

perspective.

Summary of comparisons. An overview of the relative

overlap between the evaluated methods is available in

Figure 5, showing a greater consensus between the

respective results from O2PLS and PCA compared with the

results based on univariate correlations. This is particularly

apparent for the metabolite dataset. Additionally, PCA has

the fewest relative number of unique transcripts and

metabolites, suggesting that it is an approximate inter-

mediate between univariate correlation analysis and O2PLS.

Another striking feature is that only one transcript (vacuolar

calcium-binding protein) and four metabolites (one iden-

tified as galactinol, whereas three remain of unknown

identity) are found using all three methods.

Discussion

In the present study, we have employed the O2PLS multi-

variate regression method to identify joint systematic vari-

ation across transcriptomic and metabolomic datasets.

One could ask what is conceptually different when

employing the presented methodology compared with

traditional studies, which are almost exclusively based on

pairwise univariate correlation studies between the different

datasets. By utilizing the O2PLS method, all results are

founded on an underlying multivariate model of the data.

This implies that one could predict the properties of new

(unknown) samples, as well as estimate the predictive

performance of the model (exemplified here using cross-

validation), which yields statistical measures of confidence

so that not merely stochastic events are characterized. The

most unique attribute of O2PLS is, however, the fact that one

can separate systematic predictive variation from platform-

unique systematic variation as well as non-systematic

structures captured in the residuals. This is most clearly

seen in the direct comparison with PCA, which is also a

multivariate method, where this variation is incorporated

into the model structures. The transparency of the O2PLS

model enables straightforward interpretation of both the

predictive and the unique sources of variation, both with

respect to samples as well as based on individual variables

(e.g. genes, metabolites). Furthermore, unlike pairwise

correlation studies, one can assess model outliers, which

is a fairly common notion in biological studies because of

the technical issues involved.

As we introduce a novel methodology for the integration

of multiple datasets in the plant field, it is natural to

additionally compare the performance of the method

directly to the current ‘gold standard’, which is the employ-

ment of pairwise univariate correlations. One can clearly see

that the outcomes differ; in fact only 10% of the transcripts

and potential metabolites intersect between the two variable

selection strategies. The discrepancies observed between

the two approaches (multivariate versus univariate) are

related to the underlying basis of the correlation calcula-

tions. This subject is handled in more depth in the Exper-

imental procedures section and in the Supplementary

material.

The additional comparison to PCA reveals a greater

overlap between the methods, in particular for the metab-

olite dataset, which is partly explained by the more closely

related properties of PCA and O2PLS. Nonetheless, from

Figure 5 we can see that the majority of the transcripts

(82.6% on average) and metabolites (61.5% on average) are

completely unique to each method. The evaluated methods

are evidently quite dissimilar and can be seen as comple-

mentary for elucidating joint regulatory patterns, as it is not

possible to claim the superiority of one methodology over

another based solely on the evidence provided here. How-

ever, as demonstrated in the presented material, the O2PLS

method has the distinctive capability to separate dataset-

predictive from dataset-unique variation. This property is

likely to be advantageous from an interpretational perspec-

tive in the general case.

Although we only explicitly describe a situation where

one wants to study the relationship between two different

datasets, it is easy to imagine a case where three or more

datasets are available. A natural extension would, for

instance, be to complement the transcriptomic and meta-

bolomic dataset with proteomic data (for example based on

2D-DIGE or LC/MS; de Hoffmann and Stroobant, 2001). As

can be seen by the model structures in Figure 1, the O2PLS

method does not support such a situation without proper

modifications. A generalization of the O2PLS methodology

to handle multiple datasets is something we intend to study

in a future paper.

Conclusions

Outlined is a general methodology for integrating data

structures from multiple sources based on the O2PLS multi-

variate regression method. The benefits of the methodology

8 Max Bylesjo et al.

ª 2007 The AuthorsJournal compilation ª 2007 Blackwell Publishing Ltd, The Plant Journal, (2007), doi: 10.1111/j.1365-313X.2007.03293.x

Page 9: Data integration in plant biology: the O2PLS method for combined modeling of transcript and metabolite data

are demonstrated using a study investigating SD-induced

effects of hybrid aspen (P. tremula · P. tremuloides) based

on transcriptomic (dual-channel cDNA) as well as meta-

bolomic (GC/MS) data. We show how one can identify tran-

scripts and metabolites that exhibit strong multivariate

correlation patterns, and relate these to the underlying

molecular functions based on corresponding annotations.

Identified effects are mainly related to changes in energy

metabolism and mitosis-associated effects, which are bio-

logically sound given the studied system. The methodology

is compared with frequently utilized methods such as uni-

variate correlations and PCA, where results and conclusions

deviate across methods.

Experimental procedures

Plant material and sampling

Twenty-four hybrid aspen (P. tremula · P. tremuloides) trees weregrown in a growth chamber under LD conditions with 12 h ofphotosynthetically active radiation (PAR) light (400 lEin m)2 s)1)and a 6-h daylength extension with low light (30 lEin m)2 s)1). After3 months, leaves 15–17 (with the first leaf below the apex ‡ 1-cmlong being designated as leaf 1) from six plants were sampled (LD0

samples). Two days later another six plants were sampled (LD2

samples), and the daylength was changed to SD conditions, with12 h of PAR light. After both 2 and 6 days with short photoperiodssix plants were sampled (denoted SD2 and SD6, respectively), asdescribed above.

Microarray preparation

The utilized POP2.3 microarray layout consist of 27 648 single-spotted cDNA clones from a previous assembly of more than100 000 expressed sequence tags (ESTs) from the Populus genus(Sterky et al., 2004). All sequence information is available in thePopulusDB (http://www.populus.db.umu.se) online sequence data-base; see also Tuskan et al. (2006) for further information regardingthe Populus genome. A full array layout is available for downloadfrom the UPSC-BASE (http://www.upscbase.db.umu.se) onlinemicroarray database (Sjodin et al., 2006).

All microarray slides were printed using a QARRAY arrayer(Genetix, http://www.genetix.com). The preparation, labeling andhybridization of cDNA clones and mRNA samples were carried outaccording to the protocol described by Smith et al. (2004). Thearrays were scanned on a ScanArray 4000 (Perkin-Elmer, http://www.perkinelmer.com) at 10-lm resolution to obtain raw imagefiles for the Cy5 and Cy3 dye channels.

Microarray pre-processing

Gridding and segmentation of all images were performed in GENE-

PIX PRO 5.1 (Molecular Devices, http://www.moleculardevices.com).Quantification was based on median foreground intensity values,and data was subsequently within-slide normalized using the print-tip lowess method (Yang et al., 2002). All original image files as wellas raw and normalized data are available online for download at theUPSC-BASE microarray database from experiment UMA-0068.

The four samples LD0, LD2, SD2 and SD6 were hybridized in a loopdesign (see Supplementary material) independently for each of the

six biological replicates. Expression ratios were subsequentlyresolved using the generalized (Moore–Penrose) inverse. Theresulting ratios for all four samples are thus calculated against aweighted mean, which can be seen as being compared against avirtual reference (in silico reference).

Metabolomics analysis

The samples were extracted by chloroform:MeOH:H2O, and theirmetabolites profiles were analyzed by GC/TOFMS essentiallyaccording to the method described by Gullberg et al. (2004). Allnon-processed MS-files from the metabolic analysis were exportedfrom the CHROMATOF software in NetCDF format to MATLAB� ver-sion 7.0 (Mathworks, http://www.mathworks.com), in which all datapre-treatment procedures, such as baseline correction chromato-gram alignment, data compression and hierarchical multivariatecurve resolution (H-MCR) were performed using in-house producedscripts according to Jonsson et al. (2005). All manual peak inte-grations were performed using in-house scripts.

O2PLS

Dataset pre-processing. The transcript dataset is composed of27 648 microarray in silico expression ratios for all six biologicalreplicates from the LD0 and SD6 treatments, rendering 12 samples intotal. The metabolite dataset consists of the peak-resolved GC/MSdata (Jonsson et al., 2005): a total of 453 peaks for the same 12samples. See Figure 2 (step 1) for a depiction of the different data-sets. The transcript dataset has been mean-centered per microarrayelement, whereas the GC/MS data matrix set has been both mean-centered and scaled to unit variance for each resolved peak prior tomodeling. Scaling to unit variance implies dividing each potentialmetabolite by its standard deviation to reduce deviations in mag-nitude for the different metabolite concentrations. Both datasetswere subsequently scaled to an equal total sum of squares to avoiddominance of any of the matrices. Further details are available in theSupplementary material.

Variable selection. From the joint variation, influential variables(transcripts and metabolites) were subsequently identified using apermutation test. The basis of the variable selection process is thecorrelation-scaled loadings (see Supplementary material). Theselection procedure is similar to the one described in (Johanssonet al., 2003) and can be briefly described as follows. The data isreshuffled multiple times so that the original correlation structuresare partially or completely destroyed. O2PLS models are generatedfrom each rearranged dataset to estimate what degree of correlationcould be expected by chance from a dataset with equal properties.This degree of correlation is later used as a threshold to selecttranscripts and metabolites with unusually high correlations (sig-nificantly higher than chance). The result is a small collection oftranscripts and metabolites that are of particular interest whenstudying the correlation patterns across the two datasets. Furtherdetails of the variable selection procedure are available in theSupplementary material.

Both the presented O2PLS method and the univariate approachesutilize correlations of sorts to rank influential associations. In thatcontext the discrepancies observed between the two approaches(multivariate versus univariate) are related to the selection of thecorrelation calculations. In the univariate case, all possible pairwisecorrelations are calculated between a transcript profile i and anotherpotential metabolite profile j. In the multivariate approach we selectonly one representative variable from each respective dataset (the

Integration of plant omics data by O2PLS 9

ª 2007 The AuthorsJournal compilation ª 2007 Blackwell Publishing Ltd, The Plant Journal, (2007), doi: 10.1111/j.1365-313X.2007.03293.x

Page 10: Data integration in plant biology: the O2PLS method for combined modeling of transcript and metabolite data

latent variable) that is optimal for predicting the respective datasetin a least-squares sense. A latent variable corresponds to aweighted average of the original variables, and can not generallybe replaced by any single variable in the original datasets. Readersfamiliar with similar feature extraction strategies, such as PCA(Wold et al., 1987), should note that the latent variables in theO2PLS model are not principal components, as they are derivedbased on a different objective function (covariance/correlationversus variance). From these O2PLS latent variables, one cansubsequently calculate the correlations between the latent variableand the original variables. This strategy reduces the number ofcorrelation calculations, and will typically provide stability in thecorrelation estimations as the risk for spurious correlations isreduced. This procedure represents the outlined method to rankvariable importance, but is generalized to several latent variablesand incorporates the dataset-unique structures.

Acknowledgements

This work was supported by grants from The Swedish Foundationfor Strategic Research (MB, TM), The Knut and Alice WallenbergFoundation (JT), The Swedish Research Council (MB, TM, JT), TheFunctional Genomics Initiative at Swedish University of AgriculturalSciences (DE), FORMAS (TM) and The KEMPE Foundation (TM).

Supplementary Material

The following supplementary material is available for this articleonline:Figure S1. Comparison between univariate correlations and thecorrelation loadings from the O2PLS model.Figure S2. Cross-validated generalization errors for the O2PLSmodel.Figure S3. Cross-validated classification accuracy for the OPLS-DAmodel.Figure S4. The utilized microarray design.Table S1. The threshold values used for the transcript dataset.Table S2. The threshold values used for the metabolite dataset.Table S3. The identified transcripts with gene ontology biologicalprocess (GO-BP) annotations.Table S4. The identified metabolites.Appendix S1. Text describing properties of the O2PLS method, andgiving details of the variable selection methods, dataset pre-processing and cross-validation.Algorithm S1. The utilized Monte Carlo cross-validation (MCCV)procedure.This material is available as part of the online article from http://www.blackwell-synergy.com.

References

Boldt, R. and Zrenner, R. (2003) Purine and pyrimidine biosynthesisin higher plants. Physiol. Plant. 117, 297–304.

Bylesjo, M., Rantalainen, M., Cloarec, O., Nicholson, J.K., Holmes, E.

and Trygg, J. (2006) OPLS discriminant analysis: combining thestrengths of PLS-DA and SIMCA classification. J. Chemometrics,20, 341–351.

Carrari, F., Baxter, C., Usadel, B. et al. (2006) Integrated analysis ofmetabolite and transcript levels reveals the metabolic shifts thatunderlie tomato fruit development and highlight regulatoryaspects of metabolic network behavior. Plant Physiol. 142, 1380–1396.

Clish, C.B., Davidov, E., Oresic, M. et al. (2004) Integrative biologicalanalysis of the APOE*3-leiden transgenic mouse. Omics, 8, 3–13.

Devitt, M.L. and Stafstrom, J.P. (1995) Cell cycle regulation duringgrowth-dormancy cycles in pea axillary buds. Plant Mol. Biol. 29,255–265.

Dowhan, W. (1997) Molecular basis for membrane phospholipiddiversity: why are there so many lipids? Annu. Rev. Biochem. 66,199–232.

Gullberg, J., Jonsson, P., Nordstrom, A., Sjostrom, M. and Moritz,

T. (2004) Design of experiments: an efficient strategy to identifyfactors influencing extraction and derivatization of Arabidopsisthaliana samples in metabolomic studies with gas chromatogra-phy/mass spectrometry. Anal. Biochem. 331, 283–295.

Gygi, S.P., Rochon, Y., Franza, B.R. and Aebersold, R. (1999) Cor-relation between protein and mRNA abundance in yeast. Mol.Cell. Biol. 19, 1720–1730.

Hirai, M.Y., Yano, M., Goodenowe, D.B., Kanaya, S., Kimura, T.,

Awazuhara, M., Arita, M., Fujiwara, T. and Saito, K. (2004) Inte-gration of transcriptomics and metabolomics for understandingof global responses to nutritional stresses in Arabidopsis thali-ana. Proc. Natl Acad. Sci. USA, 101, 10205–10210.

Hirai, M.Y., Klein, M., Fujikawa, Y. et al. (2005) Elucidation of gene-to-gene and metabolite-to-gene networks in arabidopsis by inte-gration of metabolomics and transcriptomics. J. Biol. Chem. 280,25590–25595.

de Hoffmann, E. and Stroobant, V. (2001) Mass Spectrometry: Prin-ciples and Applications, 2 edn. John Wiley & Sons, Chichester, UK.

Horvath, D.P., Anderson, J.V., Chao, W.S. and Foley, M.E. (2003)Knowing when to grow: signals regulating bud dormancy. TrendsPlant Sci. 8, 534–540.

Johansson, D., Lindgren, P. and Berglund, A. (2003) A multivariateapproach applied to microarray data for identification of geneswith cell cycle-coupled transcription. Bioinformatics, 19, 467–473.

Jolliffe, I.T. (2002) Principal Component Analysis, 2 edn. Springer,New York, USA.

Jonsson, P., Johansson, A.I., Gullberg, J., Trygg, J., A, J., Grung, B.,

Marklund, S., Sjostrom, M., Antti, H. and Moritz, T. (2005) High-throughput data analysis for detecting and identifying differencesbetween samples in GC/MS-based metabolomic analyses. Anal.Chem. 77, 5635–5642.

Kafer, C., Zhou, L., Santoso, D., Guirgis, A., Weers, B., Park, S. and

Thornburg, R. (2004) Regulation of pyrimidine metabolism inplants. Front. Biosci. 9, 1611–1625.

Kleno, T.G., Kiehr, B., Baunsgaard, D. and Sidelmann, U.G. (2004)Combination of ‘omics’ data to investigate the mechanism(s) ofhydrazine-induced hepatotoxicity in rats and to identify potentialbiomarkers. Biomarkers, 9, 116–138.

Kolbe, A., Oliver, S.N., Fernie, A.R., Stitt, M., van Dongen, J.T. and

Geigenberger, P. (2006) Combined transcript and metaboliteprofiling of Arabidopsis leaves reveals fundamental effects of thethiol-disulfide status on plant metabolism. Plant Physiol. 141,412–422.

Kvalheim, O. (1992) The latent variable. Chemometrics Intell. Lab.Syst., 14, 1–3.

Oresic, M., Clish, C.B., Davidov, E.J. et al. (2004) Phenotype char-acterisation using integrated gene transcript, protein andmetabolite profiling. Appl. Bioinformat. 3, 205–217.

Ostrander, D.B., O’Brien, D.J., Gorman, J.A. and Carman, G.M.

(1998) Effect of CTP synthetase regulation by CTP on phospho-lipid synthesis in Saccharomyces cerevisiae. J. Biol. Chem. 273,18992–19001.

Rantalainen, M., Cloarec, O., Beckonert, O. et al. (2006) Statisticallyintegrated metabonomic-proteomic studies on a human prostatecancer xenograft model in mice. J. Proteome Res. 5, 2642–2655.

10 Max Bylesjo et al.

ª 2007 The AuthorsJournal compilation ª 2007 Blackwell Publishing Ltd, The Plant Journal, (2007), doi: 10.1111/j.1365-313X.2007.03293.x

Page 11: Data integration in plant biology: the O2PLS method for combined modeling of transcript and metabolite data

Rischer, H., Oresic, M., Seppanen-Laakso, T., Katajamaa, M.,

Lammertyn, F., Ardiles-Diaz, W., Van Montagu, M.C., Inze, D.,

Oksman-Caldentey, K.M. and Goossens, A. (2006) Gene-to-metabolite networks for terpenoid indole alkaloid biosynthesisin Catharanthus roseus cells. Proc. Natl Acad. Sci. USA, 103,5614–5619.

Schena, M., Shalon, D., Davis, R.W. and Brown, P.O. (1995) Quan-titative monitoring of gene expression patterns with a comple-mentary DNA microarray. Science, 270, 467–470.

Shao, J. (1993) Linear-model selection by cross-validation. J. Am.Stat. Assoc. 88, 486–494.

Sjodin, A., Bylesjo, M., Skogstrom, O., Eriksson, D., Nilsson, P.,

Ryden, P., Jansson, S. and Karlsson, J. (2006) UPSC-BASE-pop-ulus transcriptomics online. Plant J. 48, 806–817.

Smith, P.M.C. and Atkins, C.A. (2002) Purine biosynthesis. Big in celldivision, even bigger in nitrogen assimilation. Plant Physiol. 128,793–802.

Smith, C., Rodriguez-Buey, M., Karlsson, J. and Campbell, M. (2004)The response of the poplar transcriptome to wounding and sub-sequent infection by a viral pathogen. New Phytol. 164, 123–136.

Stasolla, C., Katahira, R., Thorpe, T.A. and Ashihara, H. (2003) Purineand pyrimidine nucleotide metabolism in higher plants. J. PlantPhysiol. 160, 1271–1295.

Sterky, F., Bhalerao, R., Unneberg, P. et al. (2004) A Populus ESTresource for plant functional genomics. Proc. Natl Acad. Sci. USA,101, 13951–13956.

Thomas, B. and Vince-Prue, D. (1997) Photoperiodism in Plants, 2edn. San Diego, CA: Academic Press.

Tohge, T., Nishiyama, Y., Hirai, M.Y. et al. (2005) Functionalgenomics by integrated analysis of metabolome and transcrip-

tome of Arabidopsis plants over-expressing an MYB transcriptionfactor. Plant J., 42, 218–235.

Trygg, J. (2002) O2-PLS for qualitative and quantitative analysis inmultivariate calibration. J. Chemometrics 16, 283–293.

Trygg, J. and Wold, S. (2002) Orthogonal projections to latentstructures (O-PLS). J. Chemometrics 16, 119–128.

Trygg, J. and Wold, S. (2003) O2-PLS, a two-block (X-Y) latent var-iable regression (LVR) method with an integral OSC filter.J. Chemometrics 17, 53–64.

Tuskan, G.A., Difazio, S., Jansson, S. et al. (2006) The genome ofblack cottonwood, Populus trichocarpa (Torr. & Gray). Science,313, 1596–1604.

Wold, S., Ruhe, A., Wold, H. and Dunn, W.I. (1984) The collinearityproblem in linear regression. The partial least squares approachto generalized inverses. SIAM J. Sci. Stat. Comput. 5, 735–743.

Wold, S., Esbensen, K. and Geladi, P. (1987) Principal ComponentAnalysis. Chemometrics Intell. Lab. Syst. 2, 37–52.

Wold, S., Antti, H., Lindgren, F. and Ohman, J. (1998) Orthogonalsignal correction of near-infrared spectra. Chemometrics Intell.Lab. Syst. 44, 175–185.

Wold, S., Sjostrom, M. and Eriksson, L. (2001) PLS-regression: abasic tool of chemometrics. Chemometrics Intell. Lab. Syst. 58,109–130.

Yang, Y.H., Dudoit, S., Luu, P., Lin, D.M., Peng, V., Ngai, J. and

Speed, T.P. (2002) Normalization for cDNA microarray data: arobust composite method addressing single and multiple slidesystematic variation. Nucleic Acids Res. 30, e15.

Zrenner, R., Stitt, M., Sonnewald, U. and Boldt, R. (2006) Pyrimidineand purine biosynthesis and degradation in plants. Annu. Rev.Plant Biol. 57, 805–836.

Integration of plant omics data by O2PLS 11

ª 2007 The AuthorsJournal compilation ª 2007 Blackwell Publishing Ltd, The Plant Journal, (2007), doi: 10.1111/j.1365-313X.2007.03293.x