FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

Focused Identification of Germplasm Strategy (FIGS)

for wild relatives of the cultivated plants

Dag Endresen and Abdallah Bari PGR Secure (EU 7th Framework Program)

Workshop 9-13 January 2012, Madrid, Spain

TOPICS:

Wheat at Alnarp, June 2010 2

• Trait mining with FIGS

– Predictive link between climate data and trait data

– Case studies:• Morphological traits, Nordic barley• Net blotch on barley• Stem rust on wheat • Stem rust, Ug99 on bread wheat

landraces

Domestication and cultivated plants:Utilizing genetic potential from the wild

corn, maize

wild tomato

tomato

teosinte3

cultivation

A NEEDLE IN A HAY STACK

• Scientists and plant breeders want a few hundred germplasm accessions to evaluate for a particular trait.

• How does the scientist select a small subset likely to have the useful trait?

4

Challenges for utilization of plant genetic resources

* Large gene bank collections* Limited screening capacity

5

Origin of Concept:Boron toxicity of wheat and barley example of late 1980s

What is Focused Identification of Germplasm Strategy

Slide made byM.C. Mackay, 1995

South AustraliaMediterranean Sea

er

y

a

n

is

F I G SOCUSED DENTIFICATION OF ERM PLASM TRATEGY

Data la

yers sie

ve acce

ssions

ba

sed on

latitud

e &

lon

gitud

e

Illustration by Mackay (1995)

FIGS:Origin of FIGS: Michael Mackay (1986, 1990, 1995)

7

Origin of FIGS: Michael Mackay (1986, 1990, 1995)

8

– Identification of plant germplasm with a higher likelihood of having desired genetic diversity for a target trait property.

– Using climate data for prediction of crop traits a priori BEFORE the field trials.

OBJECTIVES OF FIGS

9

Bread wheat at Nöbbelöv in Lund

TRAIT MINING

• Focused Identification of Germplasm Strategy (FIGS).

• Identify new and useful genetic diversity for crop improvement.

• Based on eco-geographic data analysis using climate data.

European mountain ash (Sorbus aucuparia L.) at Alnarp, July 2004 10

FOCUSED IDENTIFICATION OF GERMPLASM STRATEGY

Climate layers from the ICARDA ecoclimatic database (De Pauw, 2003)

11

ASSUMPTION:The climate at the original source location, where the plant germplasm was developed is correlated to the trait property.

AIM: To build a computer model explaining the crop trait score from the climate data.

12

High cost data

Low cost data

Genebank accessions

(landraces & CWR)

Trait data

Climate data

Field trials (€€€)

Focu

sed I

denti

ficati

on of

Germ

plasm

Stra

tegy

Geo-referencing of

collecting places13

CLIMATE EFFECT DURING THE CULTIVATION PROCESS

Wild relatives are shaped by the environment

Primitive cultivated crops are shaped by local climate and humans

Traditional cultivated crops (landraces) are shaped by climate and humans

Modern cultivated crops are mostly shaped by humans (plant breeders)

Perhaps future crops are shaped in the molecular laboratory…?

14

PREDICTIVE LINK BETWEEN ECO-GEOGRAPHY AND TRAITS

It is possible that the human mediated selection of landraces will contribute to the link between ecogeography and traits.

During traditional cultivation the farmer will select for and introduce germplasm for improved suitability of the landrace to the local conditions.

15

CLIMATE DATA – WORLDCLIMThe climate data can be extracted from the WorldClim dataset.http://www.worldclim.org/ (Hijmans et al., 2005)

Data from weather stations worldwide are combined to a continuous surface layer.

Climate data for each landrace is extracted from this surface layer.

Precipitation: 20 590 stations

Temperature: 7 280 stations

16

http://www.worldclim.org/

CLIMATE DATALayers used in these early FIGS studies:

• Precipitation (rainfall)• Maximum temperatures • Minimum temperatures

Some of the other layers available:

• Potential evapotranspiration (water-loss)• Agro-climatic Zone (UNESCO classification)• Soil classification (FAO Soil map)• Aridity (dryness)

(mean values for month and year)Eddy De Pauw (ICARDA, 2008)

17

LIMITATIONS OF FIGS

• Landraces and wild relatives– The link between climate data and the trait data is

required for trait mining with FIGS. Modern cultivars are not expected to show this predictive link (complex pedigree).

• Georeferenced accessions– Trait mining with FIGS is based on multivariate

models using climate data from the source location of the germplasm. To extract climate data the accessions need to be accurately georeferenced.

18

MORPHOLOGICAL TRAITS IN NORDIC BARLEY LANDRACES

Field observations by Agnese Kolodinska Brantestam (2002-2003)

Multi-way N-PLS data analysis, Dag Endresen (2009-2010)

Priekuli (LVA) Bjørke (NOR) Landskrona (SWE) 19

MULTI-WAY N-PLS RESULTSNORDIC BARLEY LANDRACES

ExperimentSite Year

Heading days

Ripening days

Length of plant

Harvest index

Volumetric weight

Thousand grain weight

LVA 20021 n.s. n.s. n.s. n.s. *** n.s.

LVA 2003 *** n.s. ** ** *** n.s.NOR 2002 - * ** *** ** n.s.NOR 2003 ** *** *** * * n.s.

SWE 2002 ** *** n.s. ** * n.s.SWE 20032 n.s. ** n.s. n.s. ** n.s.

*** Significant at the 0.001 level (p-value)** Significant at the 0.01 level

* Significant at the 0.05 leveln.s. Not significant (at the above levels)

1 LVA 2002 Germination on spikes (very wet June)2 SWE 2003 Incomplete grain filling (very dry June)

Endresen, D.T.F. (2010). Predictive association between trait data and ecogeographic data for Nordic barley landraces. Crop Science 50: 2418-2430. DOI: 10.2135/cropsci2010.03.0174

20

CLASSIFICATION PERFORMANCE

• Positive predictive value (PPV)• PPV = True positives / (True positives + False positives)• Classification performance for the identification of

resistant samples (positives)

• Positive diagnostic likelihood ratio (LR+)• LR+ = sensitivity / (1 – specificity)• Less sensitive to prevalence than PPV

21

NET BLOTCH ON BARLEY LANDRACES

Green dots indicate collecting sites for resistant wheat landraces and red dots collecting sites for susceptible landraces.

USDA GRIN, trait data online: http://www.ars-grin.gov/cgi-bin/npgs/html/desc.pl?1041

Field experiments made in Minnesota, North Dakota and Georgia in the USA

22

http://www.ars-grin.gov/cgi-bin/npgs/html/desc.pl?1041

MULTIVARIATE SIMCA RESULTSFOR NET BLOTCH ON BARLEY

Dataset (unit) PPV LR+ Estimated gainNet blotch (accession) 0.54 (0.48-0.60) 1.75 (1.42-2.17) 1.35 (1.19-1.50)

Random 0.40 (0.35-0.45) 0.99 (0.84-1.17) 0.99 (0.87-1.12)(40 % resistant samples)

Endresen, D.T.F., K. Street, M. Mackay, A. Bari, E. De Pauw (2011). Predictive association between biotic stress traits and ecogeographic data for wheat and barley landraces. Crop Science 51: 2036-2055. DOI: 10.2135/cropsci2010.12.0717

PPV = Positive Predictive Value; LR+ = Positive Diagnostic Likelihood Ratio

23

STEM RUST ON WHEAT LANDRACES

USDA GRIN, trait data online: http://www.ars-grin.gov/cgi-bin/npgs/html/desc.pl?65049

Green dots indicate collecting sites for resistant wheat landraces and red dots collecting sites for susceptible landraces.

Field experiments made in Minnesota by Don McVey

24

http://www.ars-grin.gov/cgi-bin/npgs/html/desc.pl?65049

MULTIVARIATE SIMCA RESULTSFOR STEM RUST ON WHEAT

Dataset (unit) PPV LR+ Estimated gainStem rust (accession) 0.54 (0.50-0.59) 3.07 (2.66-3.54) 1.95 (1.79-2.09)

Random 0.29 (0.26-0.33) 1.04 (0.90-1.20) 1.03 (0.91-1.16)(28 % resistant samples)

Stem rust (site) 0.50 (0.40-0.60) 4.00 (2.85-5.66) 2.51 (2.02-2.98)Random 0.19 (0.13-0.26) 0.94 (0.63-1.39) 0.95 (0.66-1.33)(20 % resistant samples)

Endresen, D.T.F., K. Street, M. Mackay, A. Bari, E. De Pauw (2011). Predictive association between biotic stress traits and ecogeographic data for wheat and barley landraces. Crop Science 51: 2036-2055. DOI: 10.2135/cropsci2010.12.0717


25

MULTIVARIATE ANALYSISSTEM RUST ON WHEAT

Classifier method AUC Cohen’s KappaPrincipal Component Regression (PCR)

0.69 (0.68-0.70) 0.40 (0.37-0.42)

Partial Least Squares (PLS) 0.69 (0.68-0.70) 0.41 (0.39-0.43)Random Forest (RF) 0.70 (0.69-0.71) 0.42 (0.40-0.44)Support Vector Machines (SVM) 0.71 (0.70-0.72) 0.44 (0.42-0.45)Artificial Neural Networks (ANN) 0.71 (0.70-0.72) 0.44 (0.42-0.46)

Bari, A., K. Street, , M. Mackay, D.T.F. Endresen, E. De Pauw, and A. Amri (2011). Focused Identification of Germplasm Strategy (FIGS) detects wheat stem rust resistance linked to environment variables. Genetic Resources and Crop Evolution [online first]. doi:10.1007/s10722-011-9775-5; Published online 3 Dec 2011.

Abdallah Bari (ICARDA)

AUC = Area Under the ROC Curve (ROC, Receiver Operating Curve)

26

MULTIVARIATE SIMCA RESULTSSTEM RUST (UG99) ON WHEAT

Ug99 set with 4563 wheat landraces screened for Ug99 in Yemen 2007, 10.2 % resistant accessions. The true trait scores for 20% of the accessions (825 samples) were revealed. We used trait mining with SIMCA to select 500 accessions more likely to be resistant from 3728 accession with true scores hidden (to the person making the analysis). The FIGS set was observed to hold 25.8 % resistant samples and thus 2.3 times higher than expected by chance.

27

MULTIVARIATE ANALYSIS RESULTSFOR STEM RUST (UG99) ON WHEAT

Classifier method PPV LR+ Estimated gainkNN (pre-study) 0.29 (0.13-0.53) 5.61 (2.21-14.28) 4.14 (1.86-7.57)

SIMCA 0.28 (0.14-0.48) 5.26 (2.51-11.01) 4.00 (2.00-6.86)

Ensemble classifier 0.33 (0.12-0.65) 8.09 (2.23-29.42) 6.47 (2.05-11.06)

Random 0.06 (0.01-0.27) 0.95 (0.13-6.73) 0.97 (0.16-4.35)(pre-study, 550 + 275 accessions)

Ensemble 0.26 (0.22-0.30) 2.78 (2.34-3.31) 2.32 (2.00-2.68)Random 0.11 (0.09-0.15) 1.02 (0.77-1.36) 0.95 (0.77-1.32)(blind study, 825 + 3738 accessions)

Endresen, D.T.F., K. Street, M. Mackay, A. Bari, E. De Pauw, K. Nazari, and A. Yahyaoui (2012). Sources of Resistance to Stem Rust (Ug99) in Bread Wheat and Durum Wheat Identified Using Focused Identification of Germplasm Strategy (FIGS). Crop Science [online first]. doi: 10.2135/cropsci2011.08.0427; Published online 8 Dec 2011.


28

A LIFEBOAT TO THE GENE POOL

• PDF available from: – http://db.tt/lZMpwgJ

• Available from Libris (Sweden)– ISBN: 978-91-628-8268-6

29

http://db.tt/lZMpwgJ

SURFING THE GENEPOOL

Michael Clemet Mackay (2011)

• PDF available from: – http://pub.epsilon.slu.se/8439/

• Available from Libris (Sweden)– ISBN: 978-91-576-7634-4

30

http://pub.epsilon.slu.se/8439/

31

Collecting the seeds for some of the most important populations for the wild relatives of the cultivated plants, to maintain them at the ex situ genebanks, would provide improved access to these valuable plant genetic resources.

... but how to identify the most important CWR populations?

GAP ANALYSIS FOR MANAGEMENT OF GENEBANK COLLECTIONS

• Advice the planning of new collecting/gathering expeditions

– Identification of relevant areas were the crop species is predicted to be present (using GBIF data and niche models).

– Focus on areas least well represented in the genebank collection (maximize diversity).

– Focus on areas with a higher likelihood for a desired target trait (FIGS).

See http://gisweb.ciat.cgiar.org/GapAnalysis/ for more information.32

http://gisweb.ciat.cgiar.org/GapAnalysis/

WORMWOOD (ARTEMISIA ABSINTHIUM L.)NORDGEN STUDY: JUNE 2010

Species distribution model(7 364 records)

Using the Maxent desktop software.

33

Wormwood (Artemisia absinthium L.)

The compatibility of data standards between PGR and biodiversity collections made it possible to integrate the worldwide germplasm collections into the biodiversity community (TDWG, GBIF).

POTENTIAL OF THE GBIF TECHNOLOGY

http://data.gbif.org/datasets/network/2

34

Using GBIF/TDWG technology (and contributing to its development), the PGR community can more easily establish specific PGR networks without duplicating GBIF's work.

Data analysis methods for using the FIGS approach with wild relatives of the

cultivated plantsDag Endresen and Abdallah Bari

PGR Secure (EU 7th Framework Program) Workshop 9-13 January 2012, Madrid, Spain

STEPS TO FOLLOW:

36

• Data collection and preparation• Geo-referencing of collecting locations• Initial data exploration• Pre-processing of dataset• Choose modeling method• Calibration of model• Validation of model• Validation of prediction results

GEO-REFERENCING FOR THE SOURCE COLLECTING SITE

Example of georeferencing for NGB9529, a barley landrace reported as originating from

Lyderupgaard using KRAK.dk and maps.google.com

37

http://www.krak.dk/query?mop=aq&mapstate=7;9.305588071850734;56.61105751259899;h;9.282591620463698;56.61775781407488;9.328584523237769;56.60435721112311;853;469&what=map_adr

http://maps.google.com/maps/ms?ie=UTF8&hl=en&msa=0&msid=107144586665622662057.00045ff98921bd0418037&ll=56.606941,9.297695&spn=0.055554,0.150204&t=h&z=13

EXPLORE THE DATASET (FOR OUTLIERS)

The influence plot (residuals against leverage) shows sample NGB6300 (FRO) observed at Priekuli in 2003 (replicate 2) with a very high leverage - well separated from the “data cloud”.

After looking into the raw data, this observation point was removed as outlier (set to NaN).

38

PRE-PROCESSING

39

Here: Across mode 2 (traits)

Mean centering removes the absolute intensity to avoid the model to focus on the variables with the highest numerical values (intensity).

Scaling makes the relative distribution of values (range spread) more equal between variables. Auto-scaling is a combination of mean centering and variance scaling. After auto-scaling all variables have a mean of zero and a standard deviation of one. The objective is to help the model to separate the relevant information from the noise.

No preprocessing Mean centering Auto scale

Priekuli

Bjørke

Landskrona

SOME MORE PRE-PROCESSING EXAMPLES

40

A MODEL OF THE REAL WORLD

• Validation of the model– No model can ever be absolutely correct– A simulation model can only be an

approximation– A model is always created for a specific

purpose• Apply the model

– The simulation model is applied to make predictions based on new fresh data

– Be aware to avoid extrapolation problems41

DATA FOR THE TRAIT MINING MODEL

• Training set– For the initial calibration or training step.

• Calibration set– Further calibration, tuning step– Often cross-validation on the training set

is used to reduce the consumption of raw data.

• Test set– For the model validation or goodness of

fit testing.– External data, not used in the model

calibration.

42

12 months (mode 2)

14 s

ampl

es (m

ode

1)

12 months (mode 2)

14 s

ampl

es (m

ode

1)(m

ode 3)3 cl

imate

varia

bles

Precipitation

Max temp

Min temp

Min. temperature14

sam

ples Jan, Feb, Mar, …

(mode 2)

Max. temperature Precipitation

36 variables

3rd level for mode 32nd level for mode 31st level for mode 3

mod

e 1

Jan, Feb, Mar, …(mode 2)

Jan, Feb, Mar, …(mode 2)

MULTI-WAY DATA STRUCTURE (N-PLS)

43

SPLIT-HALF MODEL VALIDATION

The two PARAFAC models each calibrated from two independent split-half subsets, both converge to a very similar solution as the model calibrated from the complete dataset.

The PARAFAC model is thus a general and stable model for the scope of Scandinavia.

Example used here is the Trait data model (mode 1) from Endresen (2010).

44

SPLIT-HALF MODEL VALIDATION

45

PARAFAC climate data (3-w)

RESIDUALS (VALIDATE MODEL FIT)

46Be aware of over-fitting! NB! Model validation!

The distance between the model (predictions) and the reference values (validation) is the residuals.

Example of a bad model calibration

Cross-validation indicates the appropriate model complexity.

Calibration step

MODELING METHODS

47

DATA MODELING METHODS• Parallel Factor Analysis (PARAFAC) (Multi-way)• Multi-linear Partial Least Squares (N-PLS) (Multi-way)• Soft Independent Modeling of Class Analogy (SIMCA)• k-Nearest Neighbor (kNN) • Partial Least Squares Discriminant Analysis (PLS-DA)• Linear Discriminant Analysis (LDA)• Principal component logistic regression (PCLR)• Generalized Partial Least Squares (GPLS)• Random Forests (RF)• Neural Networks (NN)• Support Vector Machines (SVM)

These methods above are the modeling methods used by Endresen (2010), Endresen et al (2011, 2012), and Bari et al (2012).

• Boosted Regression Trees (BRT)• Multivariate Regression Trees (MRT)• Bayesian Regression Trees

48

Principal component 1

Princip

al co

mponent 2

Prin

cipa

l com

pone

nt 3

3 PCs

1 PC

2 PCs

SIMCA ANALYSIS (PCA MODEL FOR EACH CLASS)

Illustration modified from Wise et al., 2006:201 (PLS Toolbox software manual)

Resistant samples

Intermediate

Susceptible

*

Example from the stem rust set:

49

VALIDATION OF RESULTS

• Pearson product-moment correlation (R) (-1 to 1)• Coefficient of determination (R2) (0 to 1)• Cohen’s Kappa (K) (-1 to 1)• Proportion observed agreement (PO) (0 to 1)• Proportion positive agreement (PA) (0 to 1)• Positive predictive value (PPV) (0 to 1)• Positive diagnostic likelihood ratio (LR+) (from 0)• Sensitivity and specificity• Area under the curve (AUC)

– Receiver operating characteristics (ROC)

• Root mean square error (RMSE)– RMSE of calibration (RMSEC)– RMSE of cross-validation (RMSECV)– RMSE of prediction (RMSEP)

• Predicted residual sum of squares (PRESS)

50

CONFUSION MATRIX

PredictedResistant (positive) Susceptible (negative)

Observed (Actual)

Resistant (positive) True positive (TP) False negative (FN)

Susceptible (negative) False positive (FP) True negative (TN)

51

Proportion observed agreement (PO) = (TP + TN) / NProportion positive agreement (PA) = (2 * TP) / (2 * TP + FP + FN)

Positive predictive value (PPV) = TP / (TP + FP)

Sensitivity = TP / (TP + FN)Specificity = TN / (TN + FP)

Positive diagnostic likelihood ratio (LR+) = Sensitivity / (1 – specificity) Positive diagnostic likelihood ratio (LR+) = (TP / [TP + FN]) / (FP / [FP + TN])

Odds ratio (OR) = (TP * TN) / (FN * FP)

Yule’s Q = (OR - 1) / (OR + 1)

Positive predictive gain (Gain) = PPV/prevalence

CORRELATION COEFFICIENT (R2)Predictions for the cross-validated (leave-one-out) samples for a N-PLS model

Endresen (2010) for trait 5 (volumetric weight) observations from Priekuli, 2002 (mean of the replications) and 6 principal components.

Correlation coefficient:

r2 = 0.96

Sum squared residuals:

RMSECV = 2.86

RMSECV

52Actual trait scores

Pred

icte

d tr

ait s

core

s

RMSEC

CORRELATION COEFFICIENT (R2)Predictions for the cross-validated (leave-one-out) samples with a N-PLS model.

Endresen (2010) for trait 5 (volumetric weight) observations from Bjørke, 2002 (mean of the replications) and 4 principal components.

Correlation coefficient:

r2 = 0.84

Sum squared residuals:

RMSECV = 3.66

RMSECV

53

Pred

icte

d tr

ait s

core

s

Actual trait scores

RMSEC

SIGNIFICANCE LEVELS

• Often the critical levels (a) for the p-value significance is set as 0.05, 0.01 and 0.001 (5 %, 1 %, 0.1 %).

• The significance level is often marked with one star (*) for the 0.05 level, two stars (**) for the 0.01 level and three stars (***) for the 0.001 level.

– 5% (even a random effect when an experiment is repeated 20 times is likely to be observed one time)

– 1% (if an experiment is repeated 100 times a random effect is likely to be observed one time)

– 0.1% (if an experiment is repeated 1000 times a random effect is likely to be observed one time) 54

PGR Secure (EU 7th Framework)Workshop FIGS approach9-13 Jan 2012, Madrid

Dag [email protected]

Abdallah Bari (ICARDA)[email protected]

55

FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

Technology

climate data

climate layers

trait data case studies

particular trait

useful trait

bread wheat landraces

resistant wheat landraces

level n