Data integration in plant biology: the O2PLS method for combined modeling of transcript and metabolite data Max Bylesjo ¨ 1,† , Daniel Eriksson 2,† , Miyako Kusano 2,3 , Thomas Moritz 2 and Johan Trygg 1,* 1 Research group for Chemometrics, Department of Chemistry, Umea ˚ University, SE-901 87 Umea ˚ , Sweden, 2 Umea ˚ Plant Science Centre, Department of Forest Genetics and Plant Physiology, Swedish University of Agricultural Sciences, SE-901 83 Umea ˚ , Sweden, and 3 RIKEN Plant Science Center, 1-7-22 Tsurumi, Yokohama City, Kanagawa, 230-0045, Japan Received 7 June 2007; revised 23 July 2007; accepted 7 August 2007. * For correspondent (fax þ46 (0)90 138885; e-mail [email protected]). † These authors contributed equally to this work. Summary The technological advances in the instrumentation employed in life sciences have enabled the collection of a virtually unlimited quantity of data from multiple sources. By gathering data from several analytical platforms, with the aim of parallel monitoring of, e.g. transcriptomic, metabolomic or proteomic events, one hopes to answer and understand biological questions and observations. This ‘systems biology’ approach typically involves advanced statistics to facilitate the interpretation of the data. In the present study, we demonstrate that the O2PLS multivariate regression method can be used for combining ‘omics’ types of data. With this methodology, systematic variation that overlaps across analytical platforms can be separated from platform- specific systematic variation. A study of Populus tremula · Populus tremuloides, investigating short-day- induced effects at transcript and metabolite levels, is employed to demonstrate the benefits of the methodology. We show how the models can be validated and interpreted to identify biologically relevant events, and discuss the results in relation to a pairwise univariate correlation approach and principal component analysis. Keywords: combined profiling, O2PLS, chemometrics, Populus, multivariate regression. Introduction In the post-genomics era, the development of life science technologies that enable transcriptomic, proteomic and metabolomic events to be analyzed in detail in the same biological systems has revolutionized biological studies. Instead of relating biological observations to a small number of variables, it is now possible to study biological systems with a global analytical approach. This is sometimes called systems biology, and the purpose is to study organisms as integrated systems of genetic, protein, metabolic, cellular, and pathway events. A systems biology approach based on data collected from many different analytical platforms involves advanced sta- tistics to be able to interpret the data. Numerous examples exist in the literature regarding the use of data from parallel sources (e.g. Carrari et al., 2006; Clish et al., 2004; Gygi et al., 1999; Hirai et al., 2004, 2005; Kleno et al., 2004; Kolbe et al., 2006; Oresic et al., 2004; Rischer et al., 2006; Tohge et al., 2005). In the majority of studies, pairwise univariate correlations between measured variables in different data- sets have been utilized to elucidate joint effects and under- lying mechanisms. In the plant field Professor Saito’s group have made substantial contributions (Hirai et al., 2004, 2005; Tohge et al., 2005), primarily concerning the integration of transcript and metabolite data for Arabidopsis thaliana. The general approach has been to analyze and interpret the data sources in parallel, formulate hypotheses independently for each platform and finally outline a consensus theory based on prior knowledge of existing pathways to unravel novel trends and processes. The main intricacies associated with a unified approach lie in the subsequent analysis and interpretation: i.e. to unwrap systematic and biologically relevant information from noisy, ª 2007 The Authors 1 Journal compilation ª 2007 Blackwell Publishing Ltd The Plant Journal (2007) doi: 10.1111/j.1365-313X.2007.03293.x
11
Embed
Data integration in plant biology: the O2PLS method for combined modeling of transcript and metabolite data
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Data integration in plant biology: the O2PLS method forcombined modeling of transcript and metabolite data
Max Bylesjo1,†, Daniel Eriksson2,†, Miyako Kusano2,3, Thomas Moritz2 and Johan Trygg1,*
1Research group for Chemometrics, Department of Chemistry, Umea University, SE-901 87 Umea, Sweden,2Umea Plant Science Centre, Department of Forest Genetics and Plant Physiology, Swedish University of Agricultural Sciences,
SE-901 83 Umea, Sweden, and3RIKEN Plant Science Center, 1-7-22 Tsurumi, Yokohama City, Kanagawa, 230-0045, Japan
Received 7 June 2007; revised 23 July 2007; accepted 7 August 2007.*For correspondent (fax þ46 (0)90 138885; e-mail [email protected]).†These authors contributed equally to this work.
Summary
The technological advances in the instrumentation employed in life sciences have enabled the collection of a
virtually unlimited quantity of data from multiple sources. By gathering data from several analytical platforms,
with the aim of parallel monitoring of, e.g. transcriptomic, metabolomic or proteomic events, one hopes to
answer and understand biological questions and observations. This ‘systems biology’ approach typically
involves advanced statistics to facilitate the interpretation of the data. In the present study, we demonstrate
that the O2PLS multivariate regression method can be used for combining ‘omics’ types of data. With this
methodology, systematic variation that overlaps across analytical platforms can be separated from platform-
specific systematic variation. A study of Populus tremula · Populus tremuloides, investigating short-day-
induced effects at transcript and metabolite levels, is employed to demonstrate the benefits of the
methodology. We show how the models can be validated and interpreted to identify biologically relevant
events, and discuss the results in relation to a pairwise univariate correlation approach and principal
(Thomas and Vince-Prue, 1997; M. Kusano et al., unpub-
lished data).
OPLS can superficially be seen as a one-way O2PLS; out of
the six structures described in Figure 1(a), only three of
these remain in the OPLS model. These will be denoted as
predictive, unique and residual for consistency with the
previous notation for the O2PLS model. Properties of the
predictive and unique structures were determined using
class-balanced MCCV (Shao, 1993), recommending one
predictive and no unique latent variable for the OPLS-DA
classification model. Model statistics are available in the
Supplementary material. For the training samples, perfect
classification was achieved for the LD0 and SD6 classes. All
of the six LD2 samples were accurately predicted to belong
to the LD0 class as an external test set. Out of the six SD2
samples, two were predicted to belong to the LD0 class, and
the remaining four were predicted to belong to the SD6 class,
roughly as expected. This is depicted in Figure 4, where
densities of the class predictions are shown.
Comparison with related methods
Comparison with univariate correlation studies A permuta-
tion test was utilized to identify significant transcript–
metabolite connections based on the Pearson’s correlation
coefficient for all possible pairwise combinations. The
identified interval of interest was by permutation deter-
mined to be ()0.943, 0.948), i.e. only the transcripts and
metabolites with a pairwise correlation outside this interval
were retained (see Supplementary material for details). This
resulted in the identification of a total of 102 unique tran-
scripts and 109 metabolites. By pooling the transcripts by
means of the GO-BP annotations, using the same procedure
as described in the O2PLS example, the identified groups
Figure 3. Visualization of the identified transcripts and metabolites.
A graph of the identified elements of the second latent variable of the joint variation is illustrated. The transcripts are grouped using the gene ontology biological
process (GO-BP) annotations (boxes) together with the identified metabolites (circles). The edges (connecting lines) are colored according to the correlation, where
darker lines denote stronger correlations and vice versa. The displayed correlations span the interval 0.82–0.98. The label ‘UPSC unknown’ of potential metabolites
refers to database matches that have been seen in previous (internal) studies, but the identity of which currently remains unknown.
6 Max Bylesjo et al.
ª 2007 The AuthorsJournal compilation ª 2007 Blackwell Publishing Ltd, The Plant Journal, (2007), doi: 10.1111/j.1365-313X.2007.03293.x
correlation studies, one can assess model outliers, which
is a fairly common notion in biological studies because of
the technical issues involved.
As we introduce a novel methodology for the integration
of multiple datasets in the plant field, it is natural to
additionally compare the performance of the method
directly to the current ‘gold standard’, which is the employ-
ment of pairwise univariate correlations. One can clearly see
that the outcomes differ; in fact only 10% of the transcripts
and potential metabolites intersect between the two variable
selection strategies. The discrepancies observed between
the two approaches (multivariate versus univariate) are
related to the underlying basis of the correlation calcula-
tions. This subject is handled in more depth in the Exper-
imental procedures section and in the Supplementary
material.
The additional comparison to PCA reveals a greater
overlap between the methods, in particular for the metab-
olite dataset, which is partly explained by the more closely
related properties of PCA and O2PLS. Nonetheless, from
Figure 5 we can see that the majority of the transcripts
(82.6% on average) and metabolites (61.5% on average) are
completely unique to each method. The evaluated methods
are evidently quite dissimilar and can be seen as comple-
mentary for elucidating joint regulatory patterns, as it is not
possible to claim the superiority of one methodology over
another based solely on the evidence provided here. How-
ever, as demonstrated in the presented material, the O2PLS
method has the distinctive capability to separate dataset-
predictive from dataset-unique variation. This property is
likely to be advantageous from an interpretational perspec-
tive in the general case.
Although we only explicitly describe a situation where
one wants to study the relationship between two different
datasets, it is easy to imagine a case where three or more
datasets are available. A natural extension would, for
instance, be to complement the transcriptomic and meta-
bolomic dataset with proteomic data (for example based on
2D-DIGE or LC/MS; de Hoffmann and Stroobant, 2001). As
can be seen by the model structures in Figure 1, the O2PLS
method does not support such a situation without proper
modifications. A generalization of the O2PLS methodology
to handle multiple datasets is something we intend to study
in a future paper.
Conclusions
Outlined is a general methodology for integrating data
structures from multiple sources based on the O2PLS multi-
variate regression method. The benefits of the methodology
8 Max Bylesjo et al.
ª 2007 The AuthorsJournal compilation ª 2007 Blackwell Publishing Ltd, The Plant Journal, (2007), doi: 10.1111/j.1365-313X.2007.03293.x
are demonstrated using a study investigating SD-induced
effects of hybrid aspen (P. tremula · P. tremuloides) based
on transcriptomic (dual-channel cDNA) as well as meta-
bolomic (GC/MS) data. We show how one can identify tran-
scripts and metabolites that exhibit strong multivariate
correlation patterns, and relate these to the underlying
molecular functions based on corresponding annotations.
Identified effects are mainly related to changes in energy
metabolism and mitosis-associated effects, which are bio-
logically sound given the studied system. The methodology
is compared with frequently utilized methods such as uni-
variate correlations and PCA, where results and conclusions
deviate across methods.
Experimental procedures
Plant material and sampling
Twenty-four hybrid aspen (P. tremula · P. tremuloides) trees weregrown in a growth chamber under LD conditions with 12 h ofphotosynthetically active radiation (PAR) light (400 lEin m)2 s)1)and a 6-h daylength extension with low light (30 lEin m)2 s)1). After3 months, leaves 15–17 (with the first leaf below the apex ‡ 1-cmlong being designated as leaf 1) from six plants were sampled (LD0
samples). Two days later another six plants were sampled (LD2
samples), and the daylength was changed to SD conditions, with12 h of PAR light. After both 2 and 6 days with short photoperiodssix plants were sampled (denoted SD2 and SD6, respectively), asdescribed above.
Microarray preparation
The utilized POP2.3 microarray layout consist of 27 648 single-spotted cDNA clones from a previous assembly of more than100 000 expressed sequence tags (ESTs) from the Populus genus(Sterky et al., 2004). All sequence information is available in thePopulusDB (http://www.populus.db.umu.se) online sequence data-base; see also Tuskan et al. (2006) for further information regardingthe Populus genome. A full array layout is available for downloadfrom the UPSC-BASE (http://www.upscbase.db.umu.se) onlinemicroarray database (Sjodin et al., 2006).
All microarray slides were printed using a QARRAY arrayer(Genetix, http://www.genetix.com). The preparation, labeling andhybridization of cDNA clones and mRNA samples were carried outaccording to the protocol described by Smith et al. (2004). Thearrays were scanned on a ScanArray 4000 (Perkin-Elmer, http://www.perkinelmer.com) at 10-lm resolution to obtain raw imagefiles for the Cy5 and Cy3 dye channels.
Microarray pre-processing
Gridding and segmentation of all images were performed in GENE-
PIX PRO 5.1 (Molecular Devices, http://www.moleculardevices.com).Quantification was based on median foreground intensity values,and data was subsequently within-slide normalized using the print-tip lowess method (Yang et al., 2002). All original image files as wellas raw and normalized data are available online for download at theUPSC-BASE microarray database from experiment UMA-0068.
The four samples LD0, LD2, SD2 and SD6 were hybridized in a loopdesign (see Supplementary material) independently for each of the
six biological replicates. Expression ratios were subsequentlyresolved using the generalized (Moore–Penrose) inverse. Theresulting ratios for all four samples are thus calculated against aweighted mean, which can be seen as being compared against avirtual reference (in silico reference).
Metabolomics analysis
The samples were extracted by chloroform:MeOH:H2O, and theirmetabolites profiles were analyzed by GC/TOFMS essentiallyaccording to the method described by Gullberg et al. (2004). Allnon-processed MS-files from the metabolic analysis were exportedfrom the CHROMATOF software in NetCDF format to MATLAB� ver-sion 7.0 (Mathworks, http://www.mathworks.com), in which all datapre-treatment procedures, such as baseline correction chromato-gram alignment, data compression and hierarchical multivariatecurve resolution (H-MCR) were performed using in-house producedscripts according to Jonsson et al. (2005). All manual peak inte-grations were performed using in-house scripts.
O2PLS
Dataset pre-processing. The transcript dataset is composed of27 648 microarray in silico expression ratios for all six biologicalreplicates from the LD0 and SD6 treatments, rendering 12 samples intotal. The metabolite dataset consists of the peak-resolved GC/MSdata (Jonsson et al., 2005): a total of 453 peaks for the same 12samples. See Figure 2 (step 1) for a depiction of the different data-sets. The transcript dataset has been mean-centered per microarrayelement, whereas the GC/MS data matrix set has been both mean-centered and scaled to unit variance for each resolved peak prior tomodeling. Scaling to unit variance implies dividing each potentialmetabolite by its standard deviation to reduce deviations in mag-nitude for the different metabolite concentrations. Both datasetswere subsequently scaled to an equal total sum of squares to avoiddominance of any of the matrices. Further details are available in theSupplementary material.
Variable selection. From the joint variation, influential variables(transcripts and metabolites) were subsequently identified using apermutation test. The basis of the variable selection process is thecorrelation-scaled loadings (see Supplementary material). Theselection procedure is similar to the one described in (Johanssonet al., 2003) and can be briefly described as follows. The data isreshuffled multiple times so that the original correlation structuresare partially or completely destroyed. O2PLS models are generatedfrom each rearranged dataset to estimate what degree of correlationcould be expected by chance from a dataset with equal properties.This degree of correlation is later used as a threshold to selecttranscripts and metabolites with unusually high correlations (sig-nificantly higher than chance). The result is a small collection oftranscripts and metabolites that are of particular interest whenstudying the correlation patterns across the two datasets. Furtherdetails of the variable selection procedure are available in theSupplementary material.
Both the presented O2PLS method and the univariate approachesutilize correlations of sorts to rank influential associations. In thatcontext the discrepancies observed between the two approaches(multivariate versus univariate) are related to the selection of thecorrelation calculations. In the univariate case, all possible pairwisecorrelations are calculated between a transcript profile i and anotherpotential metabolite profile j. In the multivariate approach we selectonly one representative variable from each respective dataset (the
Integration of plant omics data by O2PLS 9
ª 2007 The AuthorsJournal compilation ª 2007 Blackwell Publishing Ltd, The Plant Journal, (2007), doi: 10.1111/j.1365-313X.2007.03293.x
latent variable) that is optimal for predicting the respective datasetin a least-squares sense. A latent variable corresponds to aweighted average of the original variables, and can not generallybe replaced by any single variable in the original datasets. Readersfamiliar with similar feature extraction strategies, such as PCA(Wold et al., 1987), should note that the latent variables in theO2PLS model are not principal components, as they are derivedbased on a different objective function (covariance/correlationversus variance). From these O2PLS latent variables, one cansubsequently calculate the correlations between the latent variableand the original variables. This strategy reduces the number ofcorrelation calculations, and will typically provide stability in thecorrelation estimations as the risk for spurious correlations isreduced. This procedure represents the outlined method to rankvariable importance, but is generalized to several latent variablesand incorporates the dataset-unique structures.
Acknowledgements
This work was supported by grants from The Swedish Foundationfor Strategic Research (MB, TM), The Knut and Alice WallenbergFoundation (JT), The Swedish Research Council (MB, TM, JT), TheFunctional Genomics Initiative at Swedish University of AgriculturalSciences (DE), FORMAS (TM) and The KEMPE Foundation (TM).
Supplementary Material
The following supplementary material is available for this articleonline:Figure S1. Comparison between univariate correlations and thecorrelation loadings from the O2PLS model.Figure S2. Cross-validated generalization errors for the O2PLSmodel.Figure S3. Cross-validated classification accuracy for the OPLS-DAmodel.Figure S4. The utilized microarray design.Table S1. The threshold values used for the transcript dataset.Table S2. The threshold values used for the metabolite dataset.Table S3. The identified transcripts with gene ontology biologicalprocess (GO-BP) annotations.Table S4. The identified metabolites.Appendix S1. Text describing properties of the O2PLS method, andgiving details of the variable selection methods, dataset pre-processing and cross-validation.Algorithm S1. The utilized Monte Carlo cross-validation (MCCV)procedure.This material is available as part of the online article from http://www.blackwell-synergy.com.
References
Boldt, R. and Zrenner, R. (2003) Purine and pyrimidine biosynthesisin higher plants. Physiol. Plant. 117, 297–304.
Bylesjo, M., Rantalainen, M., Cloarec, O., Nicholson, J.K., Holmes, E.
and Trygg, J. (2006) OPLS discriminant analysis: combining thestrengths of PLS-DA and SIMCA classification. J. Chemometrics,20, 341–351.
Carrari, F., Baxter, C., Usadel, B. et al. (2006) Integrated analysis ofmetabolite and transcript levels reveals the metabolic shifts thatunderlie tomato fruit development and highlight regulatoryaspects of metabolic network behavior. Plant Physiol. 142, 1380–1396.
Clish, C.B., Davidov, E., Oresic, M. et al. (2004) Integrative biologicalanalysis of the APOE*3-leiden transgenic mouse. Omics, 8, 3–13.
Devitt, M.L. and Stafstrom, J.P. (1995) Cell cycle regulation duringgrowth-dormancy cycles in pea axillary buds. Plant Mol. Biol. 29,255–265.
Dowhan, W. (1997) Molecular basis for membrane phospholipiddiversity: why are there so many lipids? Annu. Rev. Biochem. 66,199–232.
Gullberg, J., Jonsson, P., Nordstrom, A., Sjostrom, M. and Moritz,
T. (2004) Design of experiments: an efficient strategy to identifyfactors influencing extraction and derivatization of Arabidopsisthaliana samples in metabolomic studies with gas chromatogra-phy/mass spectrometry. Anal. Biochem. 331, 283–295.
Gygi, S.P., Rochon, Y., Franza, B.R. and Aebersold, R. (1999) Cor-relation between protein and mRNA abundance in yeast. Mol.Cell. Biol. 19, 1720–1730.
Hirai, M.Y., Yano, M., Goodenowe, D.B., Kanaya, S., Kimura, T.,
Awazuhara, M., Arita, M., Fujiwara, T. and Saito, K. (2004) Inte-gration of transcriptomics and metabolomics for understandingof global responses to nutritional stresses in Arabidopsis thali-ana. Proc. Natl Acad. Sci. USA, 101, 10205–10210.
Hirai, M.Y., Klein, M., Fujikawa, Y. et al. (2005) Elucidation of gene-to-gene and metabolite-to-gene networks in arabidopsis by inte-gration of metabolomics and transcriptomics. J. Biol. Chem. 280,25590–25595.
de Hoffmann, E. and Stroobant, V. (2001) Mass Spectrometry: Prin-ciples and Applications, 2 edn. John Wiley & Sons, Chichester, UK.
Horvath, D.P., Anderson, J.V., Chao, W.S. and Foley, M.E. (2003)Knowing when to grow: signals regulating bud dormancy. TrendsPlant Sci. 8, 534–540.
Johansson, D., Lindgren, P. and Berglund, A. (2003) A multivariateapproach applied to microarray data for identification of geneswith cell cycle-coupled transcription. Bioinformatics, 19, 467–473.
Jolliffe, I.T. (2002) Principal Component Analysis, 2 edn. Springer,New York, USA.
Marklund, S., Sjostrom, M., Antti, H. and Moritz, T. (2005) High-throughput data analysis for detecting and identifying differencesbetween samples in GC/MS-based metabolomic analyses. Anal.Chem. 77, 5635–5642.
Kafer, C., Zhou, L., Santoso, D., Guirgis, A., Weers, B., Park, S. and
Thornburg, R. (2004) Regulation of pyrimidine metabolism inplants. Front. Biosci. 9, 1611–1625.
Kleno, T.G., Kiehr, B., Baunsgaard, D. and Sidelmann, U.G. (2004)Combination of ‘omics’ data to investigate the mechanism(s) ofhydrazine-induced hepatotoxicity in rats and to identify potentialbiomarkers. Biomarkers, 9, 116–138.
Kolbe, A., Oliver, S.N., Fernie, A.R., Stitt, M., van Dongen, J.T. and
Geigenberger, P. (2006) Combined transcript and metaboliteprofiling of Arabidopsis leaves reveals fundamental effects of thethiol-disulfide status on plant metabolism. Plant Physiol. 141,412–422.
Kvalheim, O. (1992) The latent variable. Chemometrics Intell. Lab.Syst., 14, 1–3.
Oresic, M., Clish, C.B., Davidov, E.J. et al. (2004) Phenotype char-acterisation using integrated gene transcript, protein andmetabolite profiling. Appl. Bioinformat. 3, 205–217.
Ostrander, D.B., O’Brien, D.J., Gorman, J.A. and Carman, G.M.
(1998) Effect of CTP synthetase regulation by CTP on phospho-lipid synthesis in Saccharomyces cerevisiae. J. Biol. Chem. 273,18992–19001.
Rantalainen, M., Cloarec, O., Beckonert, O. et al. (2006) Statisticallyintegrated metabonomic-proteomic studies on a human prostatecancer xenograft model in mice. J. Proteome Res. 5, 2642–2655.
10 Max Bylesjo et al.
ª 2007 The AuthorsJournal compilation ª 2007 Blackwell Publishing Ltd, The Plant Journal, (2007), doi: 10.1111/j.1365-313X.2007.03293.x
Rischer, H., Oresic, M., Seppanen-Laakso, T., Katajamaa, M.,
Lammertyn, F., Ardiles-Diaz, W., Van Montagu, M.C., Inze, D.,
Oksman-Caldentey, K.M. and Goossens, A. (2006) Gene-to-metabolite networks for terpenoid indole alkaloid biosynthesisin Catharanthus roseus cells. Proc. Natl Acad. Sci. USA, 103,5614–5619.
Schena, M., Shalon, D., Davis, R.W. and Brown, P.O. (1995) Quan-titative monitoring of gene expression patterns with a comple-mentary DNA microarray. Science, 270, 467–470.
Shao, J. (1993) Linear-model selection by cross-validation. J. Am.Stat. Assoc. 88, 486–494.
Sjodin, A., Bylesjo, M., Skogstrom, O., Eriksson, D., Nilsson, P.,
Ryden, P., Jansson, S. and Karlsson, J. (2006) UPSC-BASE-pop-ulus transcriptomics online. Plant J. 48, 806–817.
Smith, P.M.C. and Atkins, C.A. (2002) Purine biosynthesis. Big in celldivision, even bigger in nitrogen assimilation. Plant Physiol. 128,793–802.
Smith, C., Rodriguez-Buey, M., Karlsson, J. and Campbell, M. (2004)The response of the poplar transcriptome to wounding and sub-sequent infection by a viral pathogen. New Phytol. 164, 123–136.
Stasolla, C., Katahira, R., Thorpe, T.A. and Ashihara, H. (2003) Purineand pyrimidine nucleotide metabolism in higher plants. J. PlantPhysiol. 160, 1271–1295.
Sterky, F., Bhalerao, R., Unneberg, P. et al. (2004) A Populus ESTresource for plant functional genomics. Proc. Natl Acad. Sci. USA,101, 13951–13956.
Thomas, B. and Vince-Prue, D. (1997) Photoperiodism in Plants, 2edn. San Diego, CA: Academic Press.
Tohge, T., Nishiyama, Y., Hirai, M.Y. et al. (2005) Functionalgenomics by integrated analysis of metabolome and transcrip-
tome of Arabidopsis plants over-expressing an MYB transcriptionfactor. Plant J., 42, 218–235.
Trygg, J. (2002) O2-PLS for qualitative and quantitative analysis inmultivariate calibration. J. Chemometrics 16, 283–293.
Trygg, J. and Wold, S. (2002) Orthogonal projections to latentstructures (O-PLS). J. Chemometrics 16, 119–128.
Trygg, J. and Wold, S. (2003) O2-PLS, a two-block (X-Y) latent var-iable regression (LVR) method with an integral OSC filter.J. Chemometrics 17, 53–64.
Tuskan, G.A., Difazio, S., Jansson, S. et al. (2006) The genome ofblack cottonwood, Populus trichocarpa (Torr. & Gray). Science,313, 1596–1604.
Wold, S., Ruhe, A., Wold, H. and Dunn, W.I. (1984) The collinearityproblem in linear regression. The partial least squares approachto generalized inverses. SIAM J. Sci. Stat. Comput. 5, 735–743.
Wold, S., Esbensen, K. and Geladi, P. (1987) Principal ComponentAnalysis. Chemometrics Intell. Lab. Syst. 2, 37–52.
Wold, S., Antti, H., Lindgren, F. and Ohman, J. (1998) Orthogonalsignal correction of near-infrared spectra. Chemometrics Intell.Lab. Syst. 44, 175–185.
Wold, S., Sjostrom, M. and Eriksson, L. (2001) PLS-regression: abasic tool of chemometrics. Chemometrics Intell. Lab. Syst. 58,109–130.
Yang, Y.H., Dudoit, S., Luu, P., Lin, D.M., Peng, V., Ngai, J. and
Speed, T.P. (2002) Normalization for cDNA microarray data: arobust composite method addressing single and multiple slidesystematic variation. Nucleic Acids Res. 30, e15.
Zrenner, R., Stitt, M., Sonnewald, U. and Boldt, R. (2006) Pyrimidineand purine biosynthesis and degradation in plants. Annu. Rev.Plant Biol. 57, 805–836.
Integration of plant omics data by O2PLS 11
ª 2007 The AuthorsJournal compilation ª 2007 Blackwell Publishing Ltd, The Plant Journal, (2007), doi: 10.1111/j.1365-313X.2007.03293.x