Top Banner
Focused Identification of Germplasm Strategy (FIGS) for wild relatives of the cultivated plants Dag Endresen and Abdallah Bari PGR Secure (EU 7th Framework Program) Workshop 9-13 January 2012, Madrid, Spain
55

FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

Dec 12, 2014

Download

Technology

Dag Endresen

We organized last week (9 to 13 January 2012) a workshop in Madrid (Spain) on predictive characterization using the Focused Identification of Germplasm Strategy (FIGS) for wild relatives to the cultivated plants (crop wild relatives). This workshop was part of the EU funded PGR Secure project [1] (EU 7th framework programme). The objective of this workshop was to use predictive computer modeling with R [2] for data mining (trait mining) to identify genebank accessions and populations of crop wild relatives with a higher density of genetic variation for a target trait property (response, independent variable) using climate data and other environment data layers as the explanatory or independent multivariate variables. We have previously validated the FIGS approach for landraces of wheat and barley [3]. This study was one of the first attempts to validate the FIGS approach for other crops as well as for crop wild relatives (CWR). The crop landraces and crop wild relatives included in this study was: Oats (Avena sp.), Beet (Beta sp.), Cabbage and mustard (Brassica sp.), Medick including alfalfa, lucerne (Medicago sp.). We made good progress on the methodology, but also faced some major obstacles related to data availability.

Endresen, D.T.F. (2010). Predictive association between trait data and ecogeographic data for Nordic barley landraces. Crop Sci. 50(6):2418-2430. doi: 10.2135/cropsci2010.03.0174

Endresen, D.T.F., K. Street, M. Mackay, A. Bari, and E. De Pauw (2011). Predictive Association between Biotic Stress Traits and Eco-Geographic Data for Wheat and Barley Landraces. Crop Science 51 (5): 2036-2055. doi: 10.2135/cropsci2010.12.0717

Endresen, D.T.F. (2011). Utilization of Plant Genetic Resources: A Lifeboat to the Gene Pool [PhD Thesis]. Copenhagen University, Faculty for Life Sciences, Department of Agriculture and Ecology. Printed at Media-Tryck, Lund University Press, April 2011. Available at: http://goo.gl/pYa9x (PDF 37 MB). ISBN: 978-91-628-8268-6.

Bari, A., K. Street, M. Mackay, D.T.F. Endresen, E. De Pauw, and A. Amri (2012). Focused identification of germplasm strategy (FIGS) detects wheat stem rust resistance linked to environmental variables. Genetic Resources and Crop Evolution (in press). doi:10.1007/s10722-011-9775-5

Endresen, D.T.F., K. Street, M. Mackay, A. Bari, A. Amri, E. De Pauw, K. Nazari, and A. Yahyaoui (2012). Sources of Resistance to Stem Rust (Ug99) in Bread Wheat and Durum Wheat Identified Using Focused Identification of Germplasm Strategy (FIGS). Crop Science 52, in press. doi: 10.2135/cropsci2011.08.0427
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

Focused Identification of Germplasm Strategy (FIGS)

for wild relatives of the cultivated plants

Dag Endresen and Abdallah Bari PGR Secure (EU 7th Framework Program)

Workshop 9-13 January 2012, Madrid, Spain

Page 2: FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

TOPICS:

Wheat at Alnarp, June 2010 2

• Trait mining with FIGS

– Predictive link between climate data and trait data

– Case studies:• Morphological traits, Nordic barley• Net blotch on barley• Stem rust on wheat • Stem rust, Ug99 on bread wheat

landraces

Page 3: FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

Domestication and cultivated plants:Utilizing genetic potential from the wild

corn, maize

wild tomato

tomato

teosinte3

cultivation

Page 4: FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

A NEEDLE IN A HAY STACK

• Scientists and plant breeders want a few hundred germplasm accessions to evaluate for a particular trait.

• How does the scientist select a small subset likely to have the useful trait?

4

Page 5: FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

Challenges for utilization of plant genetic resources

* Large gene bank collections* Limited screening capacity

5

Page 6: FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

Origin of Concept:Boron toxicity of wheat and barley example of late 1980s

What is Focused Identification of Germplasm Strategy

Slide made byM.C. Mackay, 1995

South AustraliaMediterranean Sea

Page 7: FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

er

y

a

n

is

F I G SOCUSED DENTIFICATION OF ERM PLASM TRATEGY

Data la

yers sie

ve acce

ssions

ba

sed on

latitud

e &

lon

gitud

e

Illustration by Mackay (1995)

FIGS:Origin of FIGS: Michael Mackay (1986, 1990, 1995)

7

Page 8: FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

Origin of FIGS: Michael Mackay (1986, 1990, 1995)

8

Page 9: FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

– Identification of plant germplasm with a higher likelihood of having desired genetic diversity for a target trait property.

– Using climate data for prediction of crop traits a priori BEFORE the field trials.

OBJECTIVES OF FIGS

9

Bread wheat at Nöbbelöv in Lund

Page 10: FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

TRAIT MINING

• Focused Identification of Germplasm Strategy (FIGS).

• Identify new and useful genetic diversity for crop improvement.

• Based on eco-geographic data analysis using climate data.

European mountain ash (Sorbus aucuparia L.) at Alnarp, July 2004 10

Page 11: FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

FOCUSED IDENTIFICATION OF GERMPLASM STRATEGY

Climate layers from the ICARDA ecoclimatic database (De Pauw, 2003)

11

Page 12: FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

ASSUMPTION:The climate at the original source location, where the plant germplasm was developed is correlated to the trait property.

AIM: To build a computer model explaining the crop trait score from the climate data.

12

Page 13: FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

High cost data

Low cost data

Genebank accessions

(landraces & CWR)

Trait data

Climate data

Field trials (€€€)

Focu

sed I

denti

ficati

on of

Germ

plasm

Stra

tegy

Geo-referencing of

collecting places13

Page 14: FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

CLIMATE EFFECT DURING THE CULTIVATION PROCESS

Wild relatives are shaped by the environment

Primitive cultivated crops are shaped by local climate and humans

Traditional cultivated crops (landraces) are shaped by climate and humans

Modern cultivated crops are mostly shaped by humans (plant breeders)

Perhaps future crops are shaped in the molecular laboratory…?

14

Page 15: FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

PREDICTIVE LINK BETWEEN ECO-GEOGRAPHY AND TRAITS

It is possible that the human mediated selection of landraces will contribute to the link between ecogeography and traits.

During traditional cultivation the farmer will select for and introduce germplasm for improved suitability of the landrace to the local conditions.

15

Page 16: FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

CLIMATE DATA – WORLDCLIMThe climate data can be extracted from the WorldClim dataset.http://www.worldclim.org/ (Hijmans et al., 2005)

Data from weather stations worldwide are combined to a continuous surface layer.

Climate data for each landrace is extracted from this surface layer.

Precipitation: 20 590 stations

Temperature: 7 280 stations

16

Page 17: FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

CLIMATE DATALayers used in these early FIGS studies:

• Precipitation (rainfall)• Maximum temperatures • Minimum temperatures

Some of the other layers available:

• Potential evapotranspiration (water-loss)• Agro-climatic Zone (UNESCO classification)• Soil classification (FAO Soil map)• Aridity (dryness)

(mean values for month and year)Eddy De Pauw (ICARDA, 2008)

17

Page 18: FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

LIMITATIONS OF FIGS

• Landraces and wild relatives– The link between climate data and the trait data is

required for trait mining with FIGS. Modern cultivars are not expected to show this predictive link (complex pedigree).

• Georeferenced accessions– Trait mining with FIGS is based on multivariate

models using climate data from the source location of the germplasm. To extract climate data the accessions need to be accurately georeferenced.

18

Page 19: FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

MORPHOLOGICAL TRAITS IN NORDIC BARLEY LANDRACES

Field observations by Agnese Kolodinska Brantestam (2002-2003)

Multi-way N-PLS data analysis, Dag Endresen (2009-2010)

Priekuli (LVA) Bjørke (NOR) Landskrona (SWE) 19

Page 20: FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

MULTI-WAY N-PLS RESULTSNORDIC BARLEY LANDRACES

ExperimentSite Year

Heading days

Ripening days

Length of plant

Harvest index

Volumetric weight

Thousand grain weight

LVA 20021 n.s. n.s. n.s. n.s. *** n.s.

LVA 2003 *** n.s. ** ** *** n.s.NOR 2002 - * ** *** ** n.s.NOR 2003 ** *** *** * * n.s.

SWE 2002 ** *** n.s. ** * n.s.SWE 20032 n.s. ** n.s. n.s. ** n.s.

*** Significant at the 0.001 level (p-value)** Significant at the 0.01 level

* Significant at the 0.05 leveln.s. Not significant (at the above levels)

1 LVA 2002 Germination on spikes (very wet June)2 SWE 2003 Incomplete grain filling (very dry June)

Endresen, D.T.F. (2010). Predictive association between trait data and ecogeographic data for Nordic barley landraces. Crop Science 50: 2418-2430. DOI: 10.2135/cropsci2010.03.0174

20

Page 21: FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

CLASSIFICATION PERFORMANCE

• Positive predictive value (PPV)• PPV = True positives / (True positives + False positives)• Classification performance for the identification of

resistant samples (positives)

• Positive diagnostic likelihood ratio (LR+)• LR+ = sensitivity / (1 – specificity)• Less sensitive to prevalence than PPV

21

Page 22: FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

NET BLOTCH ON BARLEY LANDRACES

Green dots indicate collecting sites for resistant wheat landraces and red dots collecting sites for susceptible landraces.

USDA GRIN, trait data online: http://www.ars-grin.gov/cgi-bin/npgs/html/desc.pl?1041

Field experiments made in Minnesota, North Dakota and Georgia in the USA

22

Page 23: FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

MULTIVARIATE SIMCA RESULTSFOR NET BLOTCH ON BARLEY

Dataset (unit) PPV LR+ Estimated gainNet blotch (accession) 0.54 (0.48-0.60) 1.75 (1.42-2.17) 1.35 (1.19-1.50)

Random 0.40 (0.35-0.45) 0.99 (0.84-1.17) 0.99 (0.87-1.12)(40 % resistant samples)

Endresen, D.T.F., K. Street, M. Mackay, A. Bari, E. De Pauw (2011). Predictive association between biotic stress traits and ecogeographic data for wheat and barley landraces. Crop Science 51: 2036-2055. DOI: 10.2135/cropsci2010.12.0717

PPV = Positive Predictive Value; LR+ = Positive Diagnostic Likelihood Ratio

23

Page 24: FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

STEM RUST ON WHEAT LANDRACES

USDA GRIN, trait data online: http://www.ars-grin.gov/cgi-bin/npgs/html/desc.pl?65049

Green dots indicate collecting sites for resistant wheat landraces and red dots collecting sites for susceptible landraces.

Field experiments made in Minnesota by Don McVey

24

Page 25: FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

MULTIVARIATE SIMCA RESULTSFOR STEM RUST ON WHEAT

Dataset (unit) PPV LR+ Estimated gainStem rust (accession) 0.54 (0.50-0.59) 3.07 (2.66-3.54) 1.95 (1.79-2.09)

Random 0.29 (0.26-0.33) 1.04 (0.90-1.20) 1.03 (0.91-1.16)(28 % resistant samples)

Stem rust (site) 0.50 (0.40-0.60) 4.00 (2.85-5.66) 2.51 (2.02-2.98)Random 0.19 (0.13-0.26) 0.94 (0.63-1.39) 0.95 (0.66-1.33)(20 % resistant samples)

Endresen, D.T.F., K. Street, M. Mackay, A. Bari, E. De Pauw (2011). Predictive association between biotic stress traits and ecogeographic data for wheat and barley landraces. Crop Science 51: 2036-2055. DOI: 10.2135/cropsci2010.12.0717

PPV = Positive Predictive Value; LR+ = Positive Diagnostic Likelihood Ratio

25

Page 26: FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

MULTIVARIATE ANALYSISSTEM RUST ON WHEAT

Classifier method AUC Cohen’s KappaPrincipal Component Regression (PCR)

0.69 (0.68-0.70) 0.40 (0.37-0.42)

Partial Least Squares (PLS) 0.69 (0.68-0.70) 0.41 (0.39-0.43)Random Forest (RF) 0.70 (0.69-0.71) 0.42 (0.40-0.44)Support Vector Machines (SVM) 0.71 (0.70-0.72) 0.44 (0.42-0.45)Artificial Neural Networks (ANN) 0.71 (0.70-0.72) 0.44 (0.42-0.46)

Bari, A., K. Street, , M. Mackay, D.T.F. Endresen, E. De Pauw, and A. Amri (2011). Focused Identification of Germplasm Strategy (FIGS) detects wheat stem rust resistance linked to environment variables. Genetic Resources and Crop Evolution [online first]. doi:10.1007/s10722-011-9775-5; Published online 3 Dec 2011.

Abdallah Bari (ICARDA)

AUC = Area Under the ROC Curve (ROC, Receiver Operating Curve)

26

Page 27: FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

MULTIVARIATE SIMCA RESULTSSTEM RUST (UG99) ON WHEAT

Ug99 set with 4563 wheat landraces screened for Ug99 in Yemen 2007, 10.2 % resistant accessions. The true trait scores for 20% of the accessions (825 samples) were revealed. We used trait mining with SIMCA to select 500 accessions more likely to be resistant from 3728 accession with true scores hidden (to the person making the analysis). The FIGS set was observed to hold 25.8 % resistant samples and thus 2.3 times higher than expected by chance.

27

Page 28: FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

MULTIVARIATE ANALYSIS RESULTSFOR STEM RUST (UG99) ON WHEAT

Classifier method PPV LR+ Estimated gainkNN (pre-study) 0.29 (0.13-0.53) 5.61 (2.21-14.28) 4.14 (1.86-7.57)

SIMCA 0.28 (0.14-0.48) 5.26 (2.51-11.01) 4.00 (2.00-6.86)

Ensemble classifier 0.33 (0.12-0.65) 8.09 (2.23-29.42) 6.47 (2.05-11.06)

Random 0.06 (0.01-0.27) 0.95 (0.13-6.73) 0.97 (0.16-4.35)(pre-study, 550 + 275 accessions)

Ensemble 0.26 (0.22-0.30) 2.78 (2.34-3.31) 2.32 (2.00-2.68)Random 0.11 (0.09-0.15) 1.02 (0.77-1.36) 0.95 (0.77-1.32)(blind study, 825 + 3738 accessions)

Endresen, D.T.F., K. Street, M. Mackay, A. Bari, E. De Pauw, K. Nazari, and A. Yahyaoui (2012). Sources of Resistance to Stem Rust (Ug99) in Bread Wheat and Durum Wheat Identified Using Focused Identification of Germplasm Strategy (FIGS). Crop Science [online first]. doi: 10.2135/cropsci2011.08.0427; Published online 8 Dec 2011.

PPV = Positive Predictive Value; LR+ = Positive Diagnostic Likelihood Ratio

28

Page 29: FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

A LIFEBOAT TO THE GENE POOL

• PDF available from: – http://db.tt/lZMpwgJ

• Available from Libris (Sweden)– ISBN: 978-91-628-8268-6

29

Page 30: FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

SURFING THE GENEPOOL

Michael Clemet Mackay (2011)

• PDF available from: – http://pub.epsilon.slu.se/8439/

• Available from Libris (Sweden)– ISBN: 978-91-576-7634-4

30

Page 31: FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

31

Collecting the seeds for some of the most important populations for the wild relatives of the cultivated plants, to maintain them at the ex situ genebanks, would provide improved access to these valuable plant genetic resources.

... but how to identify the most important CWR populations?

Page 32: FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

GAP ANALYSIS FOR MANAGEMENT OF GENEBANK COLLECTIONS

• Advice the planning of new collecting/gathering expeditions

– Identification of relevant areas were the crop species is predicted to be present (using GBIF data and niche models).

– Focus on areas least well represented in the genebank collection (maximize diversity).

– Focus on areas with a higher likelihood for a desired target trait (FIGS).

See http://gisweb.ciat.cgiar.org/GapAnalysis/ for more information.32

Page 33: FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

WORMWOOD (ARTEMISIA ABSINTHIUM L.)NORDGEN STUDY: JUNE 2010

Species distribution model(7 364 records)

Using the Maxent desktop software.

33

Wormwood (Artemisia absinthium L.)

Page 34: FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

The compatibility of data standards between PGR and biodiversity collections made it possible to integrate the worldwide germplasm collections into the biodiversity community (TDWG, GBIF).

POTENTIAL OF THE GBIF TECHNOLOGY

http://data.gbif.org/datasets/network/2

34

Using GBIF/TDWG technology (and contributing to its development), the PGR community can more easily establish specific PGR networks without duplicating GBIF's work.

Page 35: FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

Data analysis methods for using the FIGS approach with wild relatives of the

cultivated plantsDag Endresen and Abdallah Bari

PGR Secure (EU 7th Framework Program) Workshop 9-13 January 2012, Madrid, Spain

Page 36: FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

STEPS TO FOLLOW:

36

• Data collection and preparation• Geo-referencing of collecting locations• Initial data exploration• Pre-processing of dataset• Choose modeling method• Calibration of model• Validation of model• Validation of prediction results

Page 38: FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

EXPLORE THE DATASET (FOR OUTLIERS)

The influence plot (residuals against leverage) shows sample NGB6300 (FRO) observed at Priekuli in 2003 (replicate 2) with a very high leverage - well separated from the “data cloud”.

After looking into the raw data, this observation point was removed as outlier (set to NaN).

38

Page 39: FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

PRE-PROCESSING

39

Here: Across mode 2 (traits)

Mean centering removes the absolute intensity to avoid the model to focus on the variables with the highest numerical values (intensity).

Scaling makes the relative distribution of values (range spread) more equal between variables. Auto-scaling is a combination of mean centering and variance scaling. After auto-scaling all variables have a mean of zero and a standard deviation of one. The objective is to help the model to separate the relevant information from the noise.

Page 40: FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

No preprocessing Mean centering Auto scale

Priekuli

Bjørke

Landskrona

SOME MORE PRE-PROCESSING EXAMPLES

40

Page 41: FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

A MODEL OF THE REAL WORLD

• Validation of the model– No model can ever be absolutely correct– A simulation model can only be an

approximation– A model is always created for a specific

purpose• Apply the model

– The simulation model is applied to make predictions based on new fresh data

– Be aware to avoid extrapolation problems41

Page 42: FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

DATA FOR THE TRAIT MINING MODEL

• Training set– For the initial calibration or training step.

• Calibration set– Further calibration, tuning step– Often cross-validation on the training set

is used to reduce the consumption of raw data.

• Test set– For the model validation or goodness of

fit testing.– External data, not used in the model

calibration.

42

Page 43: FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

12 months (mode 2)

14 s

ampl

es (m

ode

1)

12 months (mode 2)

14 s

ampl

es (m

ode

1)(m

ode 3)3 cl

imate

varia

bles

Precipitation

Max temp

Min temp

Min. temperature14

sam

ples Jan, Feb, Mar, …

(mode 2)

Max. temperature Precipitation

36 variables

3rd level for mode 32nd level for mode 31st level for mode 3

mod

e 1

Jan, Feb, Mar, …(mode 2)

Jan, Feb, Mar, …(mode 2)

MULTI-WAY DATA STRUCTURE (N-PLS)

43

Page 44: FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

SPLIT-HALF MODEL VALIDATION

The two PARAFAC models each calibrated from two independent split-half subsets, both converge to a very similar solution as the model calibrated from the complete dataset.

The PARAFAC model is thus a general and stable model for the scope of Scandinavia.

Example used here is the Trait data model (mode 1) from Endresen (2010).

44

Page 45: FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

SPLIT-HALF MODEL VALIDATION

45

PARAFAC climate data (3-w)

Page 46: FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

RESIDUALS (VALIDATE MODEL FIT)

46Be aware of over-fitting! NB! Model validation!

The distance between the model (predictions) and the reference values (validation) is the residuals.

Example of a bad model calibration

Cross-validation indicates the appropriate model complexity.

Calibration step

Page 47: FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

MODELING METHODS

47

Page 48: FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

DATA MODELING METHODS• Parallel Factor Analysis (PARAFAC) (Multi-way)• Multi-linear Partial Least Squares (N-PLS) (Multi-way)• Soft Independent Modeling of Class Analogy (SIMCA)• k-Nearest Neighbor (kNN) • Partial Least Squares Discriminant Analysis (PLS-DA)• Linear Discriminant Analysis (LDA)• Principal component logistic regression (PCLR)• Generalized Partial Least Squares (GPLS)• Random Forests (RF)• Neural Networks (NN)• Support Vector Machines (SVM)

These methods above are the modeling methods used by Endresen (2010), Endresen et al (2011, 2012), and Bari et al (2012).

• Boosted Regression Trees (BRT)• Multivariate Regression Trees (MRT)• Bayesian Regression Trees

48

Page 49: FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

Principal component 1

Princip

al co

mponent 2

Prin

cipa

l com

pone

nt 3

3 PCs

1 PC

2 PCs

SIMCA ANALYSIS (PCA MODEL FOR EACH CLASS)

Illustration modified from Wise et al., 2006:201 (PLS Toolbox software manual)

Resistant samples

Intermediate

Susceptible

*

Example from the stem rust set:

49

Page 50: FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

VALIDATION OF RESULTS

• Pearson product-moment correlation (R) (-1 to 1)• Coefficient of determination (R2) (0 to 1)• Cohen’s Kappa (K) (-1 to 1)• Proportion observed agreement (PO) (0 to 1)• Proportion positive agreement (PA) (0 to 1)• Positive predictive value (PPV) (0 to 1)• Positive diagnostic likelihood ratio (LR+) (from 0)• Sensitivity and specificity• Area under the curve (AUC)

– Receiver operating characteristics (ROC)

• Root mean square error (RMSE)– RMSE of calibration (RMSEC)– RMSE of cross-validation (RMSECV)– RMSE of prediction (RMSEP)

• Predicted residual sum of squares (PRESS)

50

Page 51: FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

CONFUSION MATRIX

PredictedResistant (positive) Susceptible (negative)

Observed (Actual)

Resistant (positive) True positive (TP) False negative (FN)

Susceptible (negative) False positive (FP) True negative (TN)

51

Proportion observed agreement (PO) = (TP + TN) / NProportion positive agreement (PA) = (2 * TP) / (2 * TP + FP + FN)

Positive predictive value (PPV) = TP / (TP + FP)

Sensitivity = TP / (TP + FN)Specificity = TN / (TN + FP)

Positive diagnostic likelihood ratio (LR+) = Sensitivity / (1 – specificity) Positive diagnostic likelihood ratio (LR+) = (TP / [TP + FN]) / (FP / [FP + TN])

Odds ratio (OR) = (TP * TN) / (FN * FP)

Yule’s Q = (OR - 1) / (OR + 1)

Positive predictive gain (Gain) = PPV/prevalence

Page 52: FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

CORRELATION COEFFICIENT (R2)Predictions for the cross-validated (leave-one-out) samples for a N-PLS model

Endresen (2010) for trait 5 (volumetric weight) observations from Priekuli, 2002 (mean of the replications) and 6 principal components.

Correlation coefficient:

r2 = 0.96

Sum squared residuals:

RMSECV = 2.86

RMSECV

52Actual trait scores

Pred

icte

d tr

ait s

core

s

RMSEC

Page 53: FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

CORRELATION COEFFICIENT (R2)Predictions for the cross-validated (leave-one-out) samples with a N-PLS model.

Endresen (2010) for trait 5 (volumetric weight) observations from Bjørke, 2002 (mean of the replications) and 4 principal components.

Correlation coefficient:

r2 = 0.84

Sum squared residuals:

RMSECV = 3.66

RMSECV

53

Pred

icte

d tr

ait s

core

s

Actual trait scores

RMSEC

Page 54: FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

SIGNIFICANCE LEVELS

• Often the critical levels (a) for the p-value significance is set as 0.05, 0.01 and 0.001 (5 %, 1 %, 0.1 %).

• The significance level is often marked with one star (*) for the 0.05 level, two stars (**) for the 0.01 level and three stars (***) for the 0.001 level.

– 5% (even a random effect when an experiment is repeated 20 times is likely to be observed one time)

– 1% (if an experiment is repeated 100 times a random effect is likely to be observed one time)

– 0.1% (if an experiment is repeated 1000 times a random effect is likely to be observed one time) 54

Page 55: FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

PGR Secure (EU 7th Framework)Workshop FIGS approach9-13 Jan 2012, Madrid

Dag [email protected]

Abdallah Bari (ICARDA)[email protected]

55