Skip to content

Pre-breeding for sustainable plant production

January 30, 2012

NOVA PhD course, 22-27 January 2012 at Röstånga in Southern Sweden

Pre-breeding provides an important element in broadening the genetic diversity and introducing new and useful traits and properties to the food crops. New traits introduced in pre-breeding activities are not least important to meet the new challenges agriculture will face from the on-going climate change. The needed genetic diversity is often available outside of the gene-pool of cultivars and elite breeding lines. And sources of novel genetic diversity such as the primitive crops and even the wild relatives of the cultivated plants are expected to get increased focus when facing new challenges in agriculture.

The GBIF data portal provides information on in situ occurrences for many of the wild relatives to the cultivated plants that are not (yet) collected and accessioned by the ex situ seed genebank collections. The GBIF data portal will therefore provide a very valuable bridge between these data sources for genebank accessions and occurrence data sources outside of the genebank community. Occurrences from the GBIF data portal will assist in the identification of locations where potentially useful populations of crop wild relatives can be found. Ecological niche modeling provides a widely used approach for predicting species distributions and can be used for this purpose.

View more presentations from Dag Endresen

Recent work on predictive modeling to identify a link between useful crop traits and eco-geographic data associated with the source locations for germplasm may have particular value for pre-breeding efforts. The Focused Identification of Germplasm Strategy (FIGS) provides and approach for efficient identification of germplasm material with new and useful genetic diversity for a target trait property. Such predictive modeling approaches are of particular interest when performing pre-breeding because of the high costs related to working with this material. Cultivated plants are domesticated for properties and traits such as non-shattering seed behavior and more uniform harvest time that makes conducting agricultural experiments easier and less costly. Non-domesticated germplasm material and also the older cultivars and landraces have many agro-botanical traits that was moderated in modern cultivars to better suit agricultural practices and efficiency. Pre-breeding is largely about removing such undesired traits from the non-cultivated and less intensively domesticated material while maintaining potentially useful traits.

NOVA PhD course home page (course code: 03-110404-412):

Plant genetic resources published to the GBIF data portal:

FIGS workshop in Madrid (PGR Secure) January 2012

January 14, 2012
View more presentations from Dag Endresen

We organized last week (9 to 13 January 2012) a workshop in Madrid (Spain) on predictive characterization using the Focused Identification of Germplasm Strategy (FIGS) for wild relatives of the cultivated plants (crop wild relatives). This workshop was part of the EU funded PGR Secure project [1] (EU 7th framework programme). The objective of this workshop was to use predictive computer modeling with R [2] for data mining (trait mining) to identify genebank accessions and populations of crop wild relatives with a higher density of genetic variation for a target trait property (response, independent variable) using climate data and other environment data layers as the explanatory or independent multivariate variables. We have previously validated the FIGS approach for landraces of wheat and barley [3]. This study was one of the first attempts to validate the FIGS approach for other crops as well as for crop wild relatives (CWR). The crop landraces and crop wild relatives included in this study was: Oats (Avena sp.), Beet (Beta sp.), Cabbage and mustard (Brassica sp.), Medick including alfalfa, lucerne (Medicago sp.). We made good progress on the methodology, but also faced some major obstacles related to data availability.

A major effort was made by the PGR Secure project team in collecting data for genebank accessions and other occurrence data for crop wild relatives from the GBIF portal [4], EURISCO portal [5], European Central Crop Databases (ECCDBs) [6], USDA Germplasm Resources Information Network (GRIN) [7], Canadian Germplasm Resources Information Network (CA-GRIN) as well as other online sources. The trait characterization and evaluation data was mostly collected from the European Central Crop Databases, and the USDA GRIN, but also other online sources of trait data were explored. In total more than 33 000 occurrence records and genebank accessions were collected and georeferenced. Approximately 18 000 of these occurrences were considered to have an acceptable georeferenced coordinate quality. The availability of trait data was much more limited. The typical number of trait data per species data was below 10 data points, but at least some of the species had a few hundred trait data points. However when matching the germplasm occurrences and accessions with trait data points available to the germplasm material with acceptable georeferenced coordinates the number per species dropped dramatically and left even in the best cases less than 50 records per species.

The predictive computer models used for trait mining in this workshop were calibrated using the Random Forest algorithm. Random Forest can be used both for regression with continuous trait variables (independent response variable) and for classification with ordinal and categorical trait variables. Because of the low number of records in the final datasets, these sets did not succeed to calibrate any predictive models. We focused therefore our efforts on developing the method for trait mining with FIGS for R using a dataset for stem rust on wheat made available by the USDA GRIN. This was the same dataset as explored by Endresen et al (2011) and by Bari et al (2012). In addition to the Random Forest algorithm we also started to explore the k Nearest Neighbor (kNN), Boosted Regression Trees (BRT) and the Parallel Factor Analysis (PARAFAC) [9].

This workshop was with the focus on the predictive computer modeling approach of FIGS (trait mining). A previous workshop for the PGR Secure project was organized in Rome in the autumn of 2011 with the focus on an alternative approach to conducting a FIGS study that is based on collecting information on the environmental conditions most likely to support the adaptive development of the target trait property. It is important to notice that the approach taken by this second FIGS workshop, calibrating a predictive computer model, demands a priori known trait data to be used as the training set. While the alternative approach based on collecting expert knowledge of the suitable environmental patterns can be conducted without the need for such a training set.

The workshop participants included: Dag Endresen (GBIF), Imke Thormann (Bioversity), Jacob van Etten (Bioversity), José María Iriondo (Universidad Rey Juan Carlos), Mauricio Parra Quijano (Universidad Politécnica de Madrid), María Luisa Rubio (Universidad Rey Juan Carlos), Shelagh Kell (University of Birmingham), Sónia Dias (Bioversity), Rosa García (Centro Recursos Fitogenéticos, INIA). (Abdallah Bari [ICARDA] was unfortunately not able to join the workshop because of problems with the Visa entry to EU).

[1] PGR Secure,
[2] R,
[3] Mendeley FIGS,
[4] GBIF portal,
[5] EURISCO portal,
[6] ECCDB,
[8] Canadian GRIN,

Sources of Resistance to Stem Rust (Ug99) in Bread Wheat and Durum Wheat Identified Using Focused Identification of Germplasm Strategy (FIGS)

December 24, 2011

Dag Terje Filip Endresen, Kenneth Street, Michael Mackay, Abdallah Bari, Ahmed Amri, Eddy De Pauw, Kumarse Nazari, and Amor Yahyaoui (2012). Sources of Resistance to Stem Rust (Ug99) in Bread Wheat and Durum Wheat Identified Using Focused Identification of Germplasm Strategy (FIGS). Crop Science 52(2):764-773. doi: 10.2135/cropsci2011.08.0427; Published online 8 Dec. 2011.

This experiment validates the FIGS approach in a “blind” study where the person conducting the predictive computer modeling did not know the actual trait scores for the test set. We explored a new dataset with measurements of susceptibility to a new strain of stem rust (Puccinia graminis Pers. f.sp. tritici Eriks. & Henn.) typified to race TTKSK and known as Ug99. The screening experiment for Ug99 was made in Yemen in 2008. The total dataset included 4563 landraces of bread wheat (Triticum aestivum L. ssp. aestivum) and durum wheat (Triticum turgidum ssp. durum (Desf.) Husn.). A data modeling training set including 825 landraces was prepared, but the true trait scores of the remaining 3738 landrace accessions was still unknown to the person calibrating the data models. The predictive performance using the FIGS approach was 2.3 times higher than a random sampling of accessions.

Focused Identification of Germplasm Strategy (FIGS) was proposed in the 1990s by Michael Mackay as an approach to identify useful traits in crops and the relatives of the cultivated plants. FIGS is based on finding a link between the eco-climatic attributes for the collecting sites and source locations for germplasm resources and their useful trait properties.

See also:

Focused identification of germplasm strategy (FIGS) detects wheat stem rust resistance linked to environmental variables

December 17, 2011

Genetic Resources and Crop EvolutionAbdallah Bari, Kenneth Street, Michael Mackay, Dag Terje Filip Endresen, Eddy De Pauw and Ahmed Amri (2012). Focused identification of germplasm strategy (FIGS) detects wheat stem rust resistance linked to environmental variables. Genetic Resources and Crop Evolution (Published Online 3 December 2011), pp. 1-17. doi:10.1007/s10722-011-9775-5

This new FIGS study follows the same principles and uses the same stem rust trait dataset from USDA GRIN as our previous study published in Crop Science (Endresen et al., 2011; doi: 10.2135/cropsci2010.12.0717). The predictive computer models are however calibrated using other methods such as Artificial Neural Networks (ANN), Support Vector Machines (SVM), Random Forests (RF), Principal Component Logistic Regression (PCLR) and Generalized Partial Least Squares (GPLS). My colleague Abdallah Bari from ICARDA based in Aleppo, Syria was conducting the data analysis. In particular the non-linear methods ANN and SVM seems suitable for this dataset. The results from this new study provides support for the results from our previous study where the data analysis was conducted by me.

See also:

Predictive Association between Biotic Stress Traits and Eco-Geographic Data for Wheat and Barley Landraces

August 11, 2011
Crop Science Volume 51, Front cover

Crop Science 51(5): 2036-2055

Crop Science Volume 51 Issue 5 (September/October 2011)

Dag Terje Filip Endresen, Kenneth Street, Michael Mackay, Abdallah Bari, and Eddy De Pauw (2011). Predictive Association between Biotic Stress Traits and Eco-Geographic Data for Wheat and Barley Landraces. Crop Science 51 (5): 2036-2055. doi: 10.2135/cropsci2010.12.0717

This FIGS study validates the FIGS approach designed to identify genebank accessions with a higher likelihood for having a useful trait of economic value for plant breeding or crop research. With this study we demonstrate how the FIGS approach can be used to more than double the likelihood of finding a target trait property compared to a random sampling of accessions. The Soft Independent Modeling of Class Analogy (SIMCA) and the k-Nearest Neighbor (kNN) data analysis methods proved superior to Linear Discriminant Analysis (LDA) and Partial Least Squares Discriminant Analysis (PLS-DA). For this study we used a stem rust dataset for wheat landraces and a net blotch dataset with barley landraces, both kindly provided by the USDA GRIN.

DOIs for genebank collections

April 16, 2011

Digital object identifiers (DOI)

for the plant genetic resources community

To advance in collaborative regional or global efforts for documentation of genebank collections a persistent identifier like the Life Science Identifier (LSID) or the digital object identifier (DOI) is required. This text will focus on the utility of DOIs for this purpose. The following text express my personal opinions, but also constitutes my contribution to recent discussions with colleagues from the Dutch genebank, Bioversity, the Nordic genebank and the Russian genebank.

Digital Object Identifier

The DOI is a name not a location. The DOI is persistent and actionable. The DOI identifies a digital object. Thus, the DOI does not really identify the printed book, the printed journal manuscript, or a genebank seed sample. It could be seen as a name or identifier for the digital metadata objects that uniquely identify and describes the genebank accession. For genebanks these metadata that uniquely identify the accession are often called passport data. The passport data would thus here be seen as the “data”. A digital book or manuscript can be identified directly, but it is even here often convenient to resolve by default to the metadata about the digital book or manuscript. DOI is a cross-sector, not-for-profit effort (ISO TC46/SC9). DOI was founded in 1998. Overview document provided from DOI:

To obtain a domain prefix (like eg GENESYS, EURISCO or NordGen) this need to be registered with an authorized DOI Registration Agency. The DataCite ( could be the most appropriate DOI Registration Agency for the genebank community. I can’t find a direct quote on the cost of obtaining a domain prefix with DataCite, but I believe that it would be around USD $1000. DOI as the top-level foundation does not specify how much a Registration Agency should charge for the prefix.

” DataCite supports data centers by providing workflows and standards for data publication. For more information on how you can register your data. Contact:


One useful example of syntax could perhaps be:
doi:10.genesys/nld37/2445 for accession number 2445 at WUR CGN
NordGen would get as of the example above:
 doi:10.genesys/swe54/NGB1212 for accession number NGB1212

doi: Just the DOI prefix (not part of the doi per se)
10. This the DOI prefix at the Handle system (always 10 for DOI)
genesys This the domain prefix that costs USD $1000
/nld37 The first part of the “local” identifier could eg be the WIEWS code
/2445 The next part could be the genebank catalog number
/batch1 It is eg possible for the genebank to build further on the DOI…

Issues to discuss:

* Buying one DOI domain prefix for genesys will save money for each genebank institute. I think that everybody will be very happy if for example Bioversity will offer to do this! The alternative is for each genebank (or groups of genebanks) to buy one DOI prefix each. For each genebank to buy a DOI prefix would be the preferred option from a long-term persistence perspective as it is possible to perceive different genebank institutes remaining operational for different periods of time into the future – possibly with the need to resolve the DOIs for longer than Bioversity will exist.

* Another alternative, that I think was proposed by the DOI Foundation at one of the TDWG persistent identifier workshops that I attended some years back is that DOI would not charge a one time fee for the prefix domain, but instead a fee for each DOI at around 5 cent (USD $0.05). If this option is at all available, I think that USD $1000 is still a manageable amount to pay and will give so much more flexibility to design the syntax of the DOI (to include the WIEWS codes and accession numbers).

* Domain prefix: genesys, GENESYS, GeneSys, …? (EURISCO, SINGER, GRIN, NordGen, NORDGEN,…?). Even if we start using DOIs with genesys, other institutes could of course buy a new prefix and use this – if they would like to do so.

* Remember that the same accession represented in GeneSys, EURISCO and NordGen should be assigned different DOIs even if all these DOIs would be issued from the same DOI domain. These are different data objects and relations between them may very well need to be stated.

* First part of the local identifier could be: nld37, nld037, nld0037, NLD037, …? The FAO WIEWS codes was initially the three letter country acronym followed by a three-digit number. However at least for the USA, FAO have almost reached more than 1000 institutes (see USA998) and may need to issue WIEWS institute codes with a four-digit number. If we start to pad the previous WIEWS codes to make four digit numbers – who would guarantee that we will not eventually get more than 10 000 institutes in one country (new code when institutes merge or split etc…). Perhaps the FAO WIEWS could drop the padding zeros all together – and make the present codes with the padding zeroes synonyms or aliases to the codes without any padding zeroes…? NLD037 = NLD37. The FAO WIEWS codes could perhaps also be made case-insensitive…? We would in any case need to decide on lower-case or lower-case three-letter country acronym for the DOI. Perhaps all upper case is what people are most used to see…? I think that all lower-case “looks” better, but that is secondary of course.

* Next part of the local identifier could be the catalog number. We are used with the accession numbers.

* The experts on persistent identifiers would argue against using any string that has (or could have to someone) a semantic meaning. It the possible situation when the semantic meaning changes, people might be tempted to change the string… As an example when the Nordic Gene Bank (NGB) became NordGen, some was asking if the accessions numbers should be updated from eg NGB1212 to NordGen1212…!! Well, I said absolutely not! And they have at least so far remained with the NGB prefix. Similarly the DOIs could be vulnerable when the WIEWS codes and the accession numbers are used. However if things change so much that the WIEWS code or the accession number changes, I think there are other larger issues to look at. And to issue some new DOIs for the accessions would not be any problem anyway (even if best avoided, of course).

* The last part of the local identifier could be a version number. Perhaps the last part would a number to indicate the regeneration cycle of the accession, perhaps there are other internal genebank management operations that are useful to identify…? We could thus for example suggest separating such a suffix off at the end of the DOI. And there could be different DOIs with and without the suffix to indicate the accession in a specific context and in the more general identity.

* Another issue to consider would be the separator to use inside the DOI. In my example above, I used the slash character: “/”. If we sometimes use the DOIs as part of URL:s then this would make for a nice division in folders and sub-folders… If this is undesired, the simple dot character: “.” or similar can be chosen.

Michael Hausenblas of the Linked Data Research Centre in Ireland has proposed to assist the genebank community with such solutions as described here for the Svalbard Seed Vault. An efficient mechanism to share data between GENESYS and the Svalbard Seed Vault would be very useful – as only one example of use for such persistent identifiers.

Please feel free to join the discussion at the Agricultural Biodiversity Weblog

Trait Mining Toolbox for MATLAB

April 11, 2011
Trait mining with Focused Identification of Germplasm Strategy (FIGS)

Eco-climatic layers for trait mining (GIS source: De Pauw, 2008)

New code repository for the Trait Mining Toolbox in MATLAB ( including the scripts I developed for the PhD thesis research (Endresen, 2011, doi:10.13140/2.1.1829.9846). The source code is made available using Google Code and can be downloaded as a ZIP archive or by using Subversion. There are plans to update the toolbox for R. Any interest to help with the scripts are most welcome,

* (short URL)

PhD thesis available from: doi:10.13140/2.1.1829.9846 (ResearchGate, PDF 37 MB)

%d bloggers like this: