FIGS workshop in Madrid (PGR Secure) January 2012
We organized last week (9 to 13 January 2012) a workshop in Madrid (Spain) on predictive characterization using the Focused Identification of Germplasm Strategy (FIGS) for wild relatives of the cultivated plants (crop wild relatives). This workshop was part of the EU funded PGR Secure project  (EU 7th framework programme). The objective of this workshop was to use predictive computer modeling with R  for data mining (trait mining) to identify genebank accessions and populations of crop wild relatives with a higher density of genetic variation for a target trait property (response, independent variable) using climate data and other environment data layers as the explanatory or independent multivariate variables. We have previously validated the FIGS approach for landraces of wheat and barley . This study was one of the first attempts to validate the FIGS approach for other crops as well as for crop wild relatives (CWR). The crop landraces and crop wild relatives included in this study was: Oats (Avena sp.), Beet (Beta sp.), Cabbage and mustard (Brassica sp.), Medick including alfalfa, lucerne (Medicago sp.). We made good progress on the methodology, but also faced some major obstacles related to data availability.
A major effort was made by the PGR Secure project team in collecting data for genebank accessions and other occurrence data for crop wild relatives from the GBIF portal , EURISCO portal , European Central Crop Databases (ECCDBs) , USDA Germplasm Resources Information Network (GRIN) , Canadian Germplasm Resources Information Network (CA-GRIN) as well as other online sources. The trait characterization and evaluation data was mostly collected from the European Central Crop Databases, and the USDA GRIN, but also other online sources of trait data were explored. In total more than 33 000 occurrence records and genebank accessions were collected and georeferenced. Approximately 18 000 of these occurrences were considered to have an acceptable georeferenced coordinate quality. The availability of trait data was much more limited. The typical number of trait data per species data was below 10 data points, but at least some of the species had a few hundred trait data points. However when matching the germplasm occurrences and accessions with trait data points available to the germplasm material with acceptable georeferenced coordinates the number per species dropped dramatically and left even in the best cases less than 50 records per species.
The predictive computer models used for trait mining in this workshop were calibrated using the Random Forest algorithm. Random Forest can be used both for regression with continuous trait variables (independent response variable) and for classification with ordinal and categorical trait variables. Because of the low number of records in the final datasets, these sets did not succeed to calibrate any predictive models. We focused therefore our efforts on developing the method for trait mining with FIGS for R using a dataset for stem rust on wheat made available by the USDA GRIN. This was the same dataset as explored by Endresen et al (2011) and by Bari et al (2012). In addition to the Random Forest algorithm we also started to explore the k Nearest Neighbor (kNN), Boosted Regression Trees (BRT) and the Parallel Factor Analysis (PARAFAC) .
This workshop was with the focus on the predictive computer modeling approach of FIGS (trait mining). A previous workshop for the PGR Secure project was organized in Rome in the autumn of 2011 with the focus on an alternative approach to conducting a FIGS study that is based on collecting information on the environmental conditions most likely to support the adaptive development of the target trait property. It is important to notice that the approach taken by this second FIGS workshop, calibrating a predictive computer model, demands a priori known trait data to be used as the training set. While the alternative approach based on collecting expert knowledge of the suitable environmental patterns can be conducted without the need for such a training set.
The workshop participants included: Dag Endresen (GBIF), Imke Thormann (Bioversity), Jacob van Etten (Bioversity), José María Iriondo (Universidad Rey Juan Carlos), Mauricio Parra Quijano (Universidad Politécnica de Madrid), María Luisa Rubio (Universidad Rey Juan Carlos), Shelagh Kell (University of Birmingham), Sónia Dias (Bioversity), Rosa García (Centro Recursos Fitogenéticos, INIA). (Abdallah Bari [ICARDA] was unfortunately not able to join the workshop because of problems with the Visa entry to EU).
 PGR Secure, http://www.pgrsecure.org/
 R, http://www.r-project.org/
 Mendeley FIGS, http://www.mendeley.com/groups/502321/trait-mining-figs/
 GBIF portal, http://data.gbif.org
 EURISCO portal, http://eurisco.ecpgr.org/
 ECCDB, http://www.ecpgr.cgiar.org/germplasm_databases.html
 USDA GRIN, http://www.ars-grin.gov/
 Canadian GRIN, http://pgrc3.agr.gc.ca/index_e.html
 PARAFAC, http://www.models.life.ku.dk/~rasmus/presentations/parafac_tutorial/paraf.htm