Are you interested in natural history? Help us to capture label information from images of specimens from the Norwegian natural history collections in Oslo.
New crowdsourcing portal for natural history collections! Help us to record information on museum specimens from the collections of the Natural History Museum at the University of Oslo! The new transcription portal, developed by GBIF-Norway, is launched today at the 200-year jubilee party for NHM-UiO botanical garden. The presentation of the portal will take place at 17:30 in the auditorium of Lids hus (Botanical museum) at the Tøyen campus and Botanical gardens. Christian Svindseth (GBIF-Norway, NHM-UiO) has developed the computer code for the new portal. Visit the portal at: http://gbif.no/transcribe (Figure 1).
Digitization of natural history collections
The collections at the Natural History Museum in Oslo include an estimated total of more than 6 million specimens (Mehlum et al., 2011). The collections in Oslo are estimated to hold more than 65% of the specimens held by natural history museums in Norway. The digitization of the Norwegian natural history collections has high priority and has reached a level of more than 50% of the specimens recorded and added into an electronic database system. This is a high proportion digitized when compared to other large natural history collections worldwide, but the estimated efforts to complete the appropriate registration of all remaining specimens is daunting. The Natural History Museum in Oslo has started a large-scale digitization activity in 2013 where specimens are photographed and only the very minimum information of the scientific name and the country where the specimen was collected is registered.
Primary biodiversity information
Large-scale imaging of the specimens in the Norwegian natural history collections in Norway is prioritized and has started. However, only a very minimum of the label information such as scientific name (sometimes only genus) and collecting country will be captured in this project. Capturing additional information such as the collecting location (where), collecting date (when) and the verified current scientific name (what) will substantially increase the scientific value of these data records. The data on where, when and what define the so-called primary biodiversity information and is recognized as the minimum information requirement for respective scientific research. Species distribution modelling is one of the important research tools for understanding the ecology of species, and is dependent on available primary biodiversity information (where, when and what).
Why participate and contribute to citizen science transcription
* Discovery of biodiversity information: Transcription of label information and electronic registration into online databases greatly improve the discoverability of museum specimens for the purpose of scientific research and other public use.
* Education: Students from high school level to graduate and post-graduate level can engage with the photographs of the museum specimens and take part in a first class learning experience in interaction with this resource of primary biodiversity information.
* Scientific research: Scientists that study natural history need readily access to primary biodiversity information made available from museums and their online databases. Using the transcription portal they can take a direct part in making the primary biodiversity information they need for their own research available by transcribing the labels for the respective species groups and or countries that they study.
* Public good, open and free online biodiversity information: The information that we gather from the transcription portal will flow into the museum specimen database and be published to open and free data portals such as the Global Biodiversity Information Facility (GBIF), Norwegian Species Map Service (Artskart) and the Encyclopedia of Life (EOL). This valuable information for documenting historic biodiversity patterns are thus preserved not only for future generations, but also made available for ongoing current research using up-to-date and modern web technologies.
Lichen herbarium, Hildur Krog collection from eastern Africa
The first specimen collection that was loaded to the new citizen science transcription portal is the lichens collected by the Norwegian biologist Hildur Krog and others in East Africa. This collection is part of the lichen herbarium and includes more than 2 500 specimens (figure 2). Professor Hildur Krogh was originally introduced to limnology as a student of professor Eilif Dahl (1916-1993). Eilif and Hildur pioneered the work on chemical methods for identification of lichen species. Hildur was appointed curator of the lichen herbarium at the Botanical Museum of the University of Oslo in 1971. Between 1972 and 1996 Hildur Krog and T.D.V. Swinscow explored systematically the lichen genera of East Africa for the development of the flora “Macrolichens of East Africa”. With this citizen science portal, we are asking for volunteers to assist us with transcribing the label information from the herbarium specimens collected during these expeditions to East Africa. The imaging of this collection was made late 2013 and early 2014 by Silje Larsen Rekdal and Even Stensrud under the coordination of lichen curator Einar Timdal and Siri Rui (NHM-UiO) and with funding from GBIF-Norway (Figure 3).
Mycological herbarium at NHM Oslo
The Mycological herbarium includes approximately 300 000 specimens were approximately 2/3 are electronically registered into the database with label information captured. NHM-Oslo has started a large-scale activity to photograph the specimens of the collections under the coordination of Dr. Eirik Rindal. The Mycological herbarium is here one of the first collections to be photographed. During one week (in September 2013) the staff at the museum digitized around 6000 specimens from the Mycological herbarium (Figure 4). We plan to explore the new citizen science transcription portal as a tool to capture the label information for the remaining specimens not yet appropriately registered into the database. After making the first experiences with transcription of the lichen collection, we plan to also load the first approximately 40 000 specimen images from the Mycological herbarium – and later add even more collections and sets of specimen images incrementally following the progress of the digitization activity.
When is a specimen transcription complete?
Each specimen image is transcribed by at least three volunteers and the recorded information from each volunteer compared. If all three transcriptions provide the same information, the specimen transcription is flagged as completed. If all three transcriptions provide different information the specimen image will be flagged as incomplete and presented for review by new volunteers until there is a 50% agreement (on each information input box). Collection curators and museum staff will review the results, as they come in, before the information is included into the collection database and published to the Norwegian Artskart portal and the global GBIF portal.
Label information should be transcribed verbatim
We ask our volunteer citizen scientists to transcribe the specimen label information in verbatim form as close to the information printed or written on the specimen label as possible. The citizen scientists are not recommended to make their own interpretations or corrections. We do recognize that this recommendation could be a lost opportunity to collect citizen science curation and correction of the specimen database. We are working on a solution for citizen scientists to provide such interpretations and inferred additional information from the same interface. Such a specimen annotation service could provide citizen scientists to a wide variety of inferred information about the museum specimens, including eg. georeferencing with geographic coordinates or links to other systems such as sequence data deposited in GenBank or BoL or traits in EOL.
Challenge: How do we approach citizen scientist interpretations of label text? How to add annotations when the volunteers find or infer more information from other sources than only the specimen label?
Collaborator: Notes from Nature
The Notes from Nature portal provided the primary source of inspiration for the new crowdsourcing portal at NHM-Oslo. We are grateful for very valuable feedback and assistance from the Notes from Nature team including in particular director Michael Denslow at the National Ecological Observatory Network (NEON) and professor Robert Guralnick at the Museum of Natural History, University of Colorado at Boulder. Notes from Nature provides a citizen science platform to capture label information from photographs of specimens from natural history collections (Hill et al., 2012; Franzoni and Sauermann, 2013). The Notes from Nature software platform was developed and is maintained by Zooniverse and Vizzuality in collaboration with university museums and network in Florida (SERNEC), California (CalBug), Colorado (UCMNH) and the bird collection of the Natural History Museum in London (NHMUK). Notes from Nature is open source software, with the source code freely available at GitHub.
Links to some similar citizen science transcription portals
Many natural history collections are these days starting to establish similar transcription portals. One of the first of these crowdsourcing portals was the Herbaria@home from the Botanical Society of Britain & Ireland launched around 2006. The Atlas of Living Australia (ALA) Volunteer Portal provides a crowdsourcing platform for transcription of Australian collections of natural history specimens. The National Museum of Natural History (MNHN) in Paris provides a transcription portal for the collections in France. The Smithsonian National Museum of Natural History (NMNH) provides a Transcription Center with another excellent crowdsourcing portal.
Franzoni C, and Sauermann H (2013). Crowd science: The organization of scientific research in open collaborative projects, Research Policy, Available online 14 August 2013, ISSN 0048-7333, doi:10.1016/j.respol.2013.07.005.
Hill A, Guralnick R, Smith A, Sallans A, Gillespie R, Denslow M, Gross J, Murrell Z, Conyers T, Oboyski P, Ball J, Thomer A, Prys-Jones R, de la Torre J, Kociolek P, and Fortson L (2012). The notes from nature tool for unlocking biodiversity records from museum records through citizen science. ZooKeys 209: 219-233. doi:10.3897/zookeys.209.3472
Mehlum F, Lønnve J, and Rindal E (2011). Samlingsforvaltning ved NHM – strategier og planer. Versjon 30. juni 2011. Naturhistorisk museum, Universitetet i Oslo. Rapport nr. 18, pp. 1-89. ISBN: 978-82-7970-030-2. Available at http://www.nhm.uio.no/forskning/publikasjoner/rapporter/NHM-rapport-18-samlingsplan.pdf, accessed 28 May 2014.
Pensoft Publishers (2012). No specimen left behind: Mass digitization of natural history collections [special issue]. Editors: Blagoderov, V. and Smith, V. ZooKeys 209: 1-267. ISBN: 9789546426451. Available at http://www.pensoft.net/journals/zookeys/issue/209/
When calibrating the prediction model (species distribution model) in Maxent, both types of input spatial data, the samples/localities (the dependent, response variable) and the environment layers (independent, explanatory, predictor variables), have to be described using the very same spatial reference system (SRS). For spatial data with national coverage in Norway, it is common to use the Universal Transverse Mercator (UTM) of zone 33N and WGS84 datum (epsg:32633). The environmental layers provided for the BIO4115/BIO9115 master/PhD course at the Natural History Museum of the University of Oslo are provided in the UTM33N format.
The GBIF portal provides species occurrence locality coordinates as standard WGS84 decimal degrees (epsg:4326). The GBIF portal is a global portal and does not provide respective national coordinate systems. Few other countries than Norway use the UTM33N SRS. If you want to use species occurrence data downloaded from the GBIF portal together with environmental raster layers in UTM33N (or other SRSs) you will need to convert either the occurrence coordinates or the raster layers to a common SRS.
GDAL/OGR library software
The Geospatial Data Abstraction Library (GDAL/OGR, http://www.gdal.org/) is a translator library for geospatial raster and vector data formats. The GDAL/OGR library is available for Windows, MacOsX, Linux and other UNIX-like operating systems. On a MacOsX system you may install the frameworks from KyngChaos (http://www.kyngchaos.com/software/frameworks#gdal_complete). On a Windows system you may want to install the FWTools that includes GDAL/OGR (http://fwtools.maptools.org/).
Convert GBIF species occurrence data to UTM33N using GDAL
The GDAL/OGR library can be used from the command line or from a GIS. I will describe using GDAL from the command line. On a Windows system with the FWTools open the “FWTools shell” in a DOS window (figure 1). On UNIX-like systems (such as MacOsX or Linux) open a standard terminal window (figure 2) (Applications >> Utilities >> Terminal).
Copy your coordinate tuple to a plain text file with each coordinate tuple on a new line and coordinates separated by a space. GDAL expects the coordinate tuple to be ordered as longitude, latitude (x, y) (easting, northing). So remember to put the x-coordinate (longitude or easting) in the first column, and the y-coordinate (latitude, northing) in the second column. Then use gdaltransform to convert your coordinate tuples.
Convert the coordinate tuple from decimal degrees to UTM33N
WGS84_DD.txt (longitude latitude coordinates in a text file):
$ gdaltransform -s_srs EPSG:4326 -t_srs EPSG:32633 < “WGS84_DD.txt”
173154.636743861 6539674.3145872 0
251611.627711431 6654946.66561605 0
This GDAL output can be understood/interpreted as:
UTM 33 173154mE 6539674mN
UTM 33 251611mE 6654946mN
To save the gdaltransform output results directly into a new text-file, you may redirect the standard output using a “>” in the command:
$ gdaltransform -s_srs EPSG:4326 -t_srs EPSG:32633 < “WGS84_DD.txt” > “UTM33N.txt”
Retrieve the UTM coordinates from text file named UTM33N.txt (easting northing).
New line as CR and/or LF
Notice that gdaltransform originates from unix-like systems and expects to find unix-type new line given as LF (“line feed”, “\n”, “0x0A”). MacOsX systems CR (“carriage return”, “\r”, “0x0D”) to separate new lines; while Windows systems use both CR + LR (“\r\n”, “0x0D0A”). If you have your coordinates in a MacOsX formatted text-file, you will need to replace the line endings from CR to LF. Here is an example of how to do this using perl (from the command line).
perl -pi -e ‘s/\r/\n/g’ WGS84_dd.txt # Convert CR (Mac) to LF (unix)
The “-p” flag makes perl operate on each line of the input text-file; “-i” flag makes perl operate directly on the file itself (instead of the need of creating a new file); -e allow a complete one-liner perl command to be run on the command line (not in a script); ‘s/\r/\n/g’ is a regular expression where “s” is for substitute or replace and “g” is for global replace of all matches. Inside the slashes /from-pattern-to-be-substituted/to-pattern-to-be-inserted/.
perl -pi -e ‘s/\r|\r?\n/\n/g’ WGS84_dd.txt # Convert both CR (Mac) and/or CR+LF (Win) to LF (unix)
Online coordinate conversion tools
Coordinates can also be converted using an online conversion tool. Most of these provide only the conversion of one single coordinate tuple at a time.
The online coordinate conversion tool from MyGeodata (http://cs2cs.mygeodata.eu/) is the most useful online converter I could find. Keep the default input coordinate system (WGS84 (SRID=4326)) and give the output coordinate system as “EPSG:32633”. The tool accepts a list of input coordinate tuples in a wide variety of formats (including reverse latitude-longitude (y-x) ordering by ticking the bottom box: “Switch XY”).
The World Coordinate Converter (http://twcc.free.fr/) provides a nice interface for converting one coordinate tuple at a time. You may click in the map or type in the coordinate tuple to be converted. Source SRS: “GPS (WGS84) (deg)”. UTM33N is not included in the list of target SRS by default and need to added: Click on the green plus sign button at the bottom right, and type “EPSG:32633” into box number 2.
The conversion tool from EarthPoint (single point, batch conversion, http://www.earthpoint.us) provides an option for batch conversion of multiple coordinate tuples in a spreadsheet – HOWEVER – you will need to register for a user account for this tool to take more than 5 coordinate tuples at a time…
These are some notes for a student training course on species distribution modelling (BIO4115 and BIO9115) from October to December 2012 at the Natural History Museum, University of Oslo (NHM-UiO).
What is the Global Biodiversity Information Facility?
GBIF enables free and open access to biodiversity data online. We’re an international government-initiated and funded initiative focused on making biodiversity data available to all and anyone, for scientific research, conservation and sustainable development.
Darwin Core: What? Where? When?
Using the GBIF data portal:
- http://data.gbif.org/tutorial/tutorial (user manual)
- http://data.gbif.org/species/Beta+vulgaris (beet)
- http://data.gbif.org/species/Dracocephalum+ruyschiana (dragonhead)
- http://data.gbif.org/species/Saccharina+latissima (sugar kelp)
Using Artskart, Artsdatabanken:
Using the REST web-service:
- http://data.gbif.org/ws/rest/occurrence/help (user manual)
- scientific names and classification – species occurrence data – metadata on data providers – metadata on datasets – metadata on data networks
Examples for dragon head:
You may also want to use an R package to download GBIF presence data:
# get GBIF data with function:
, geo = T)
, geo = T)
, geo = T)
You may want to use R to plot a map with a preview of the point data:
# plot occurrences:
(wrld_simpl, col =
, axes = T)
(betavulgaris$lon, betavulgaris$lat, col =
, cex = 0.5)
# -- alternative for Dragonhead:
dragonhead$lat, col =
, cex = 0.5)
- http://macnhm19.uio.no/utm.cgi (this only provides a plain gdal transformation).
- Endresen, D.T.F., and H. Knüpffer (2012). The Darwin Core extension for genebanks opens up new opportunities for sharing germplasm data sets. Biodiversity Informatics 8:12-29.
NOVA PhD course, 22-27 January 2012 at Röstånga in Southern Sweden
Pre-breeding provides an important element in broadening the genetic diversity and introducing new and useful traits and properties to the food crops. New traits introduced in pre-breeding activities are not least important to meet the new challenges agriculture will face from the on-going climate change. The needed genetic diversity is often available outside of the gene-pool of cultivars and elite breeding lines. And sources of novel genetic diversity such as the primitive crops and even the wild relatives of the cultivated plants are expected to get increased focus when facing new challenges in agriculture.
The GBIF data portal provides information on in situ occurrences for many of the wild relatives to the cultivated plants that are not (yet) collected and accessioned by the ex situ seed genebank collections. The GBIF data portal will therefore provide a very valuable bridge between these data sources for genebank accessions and occurrence data sources outside of the genebank community. Occurrences from the GBIF data portal will assist in the identification of locations where potentially useful populations of crop wild relatives can be found. Ecological niche modeling provides a widely used approach for predicting species distributions and can be used for this purpose.
Recent work on predictive modeling to identify a link between useful crop traits and eco-geographic data associated with the source locations for germplasm may have particular value for pre-breeding efforts. The Focused Identification of Germplasm Strategy (FIGS) provides and approach for efficient identification of germplasm material with new and useful genetic diversity for a target trait property. Such predictive modeling approaches are of particular interest when performing pre-breeding because of the high costs related to working with this material. Cultivated plants are domesticated for properties and traits such as non-shattering seed behavior and more uniform harvest time that makes conducting agricultural experiments easier and less costly. Non-domesticated germplasm material and also the older cultivars and landraces have many agro-botanical traits that was moderated in modern cultivars to better suit agricultural practices and efficiency. Pre-breeding is largely about removing such undesired traits from the non-cultivated and less intensively domesticated material while maintaining potentially useful traits.
NOVA PhD course home page (course code: 03-110404-412):
Plant genetic resources published to the GBIF data portal:
We organized last week (9 to 13 January 2012) a workshop in Madrid (Spain) on predictive characterization using the Focused Identification of Germplasm Strategy (FIGS) for wild relatives of the cultivated plants (crop wild relatives). This workshop was part of the EU funded PGR Secure project  (EU 7th framework programme). The objective of this workshop was to use predictive computer modeling with R  for data mining (trait mining) to identify genebank accessions and populations of crop wild relatives with a higher density of genetic variation for a target trait property (response, independent variable) using climate data and other environment data layers as the explanatory or independent multivariate variables. We have previously validated the FIGS approach for landraces of wheat and barley . This study was one of the first attempts to validate the FIGS approach for other crops as well as for crop wild relatives (CWR). The crop landraces and crop wild relatives included in this study was: Oats (Avena sp.), Beet (Beta sp.), Cabbage and mustard (Brassica sp.), Medick including alfalfa, lucerne (Medicago sp.). We made good progress on the methodology, but also faced some major obstacles related to data availability.
A major effort was made by the PGR Secure project team in collecting data for genebank accessions and other occurrence data for crop wild relatives from the GBIF portal , EURISCO portal , European Central Crop Databases (ECCDBs) , USDA Germplasm Resources Information Network (GRIN) , Canadian Germplasm Resources Information Network (CA-GRIN) as well as other online sources. The trait characterization and evaluation data was mostly collected from the European Central Crop Databases, and the USDA GRIN, but also other online sources of trait data were explored. In total more than 33 000 occurrence records and genebank accessions were collected and georeferenced. Approximately 18 000 of these occurrences were considered to have an acceptable georeferenced coordinate quality. The availability of trait data was much more limited. The typical number of trait data per species data was below 10 data points, but at least some of the species had a few hundred trait data points. However when matching the germplasm occurrences and accessions with trait data points available to the germplasm material with acceptable georeferenced coordinates the number per species dropped dramatically and left even in the best cases less than 50 records per species.
The predictive computer models used for trait mining in this workshop were calibrated using the Random Forest algorithm. Random Forest can be used both for regression with continuous trait variables (independent response variable) and for classification with ordinal and categorical trait variables. Because of the low number of records in the final datasets, these sets did not succeed to calibrate any predictive models. We focused therefore our efforts on developing the method for trait mining with FIGS for R using a dataset for stem rust on wheat made available by the USDA GRIN. This was the same dataset as explored by Endresen et al (2011) and by Bari et al (2012). In addition to the Random Forest algorithm we also started to explore the k Nearest Neighbor (kNN), Boosted Regression Trees (BRT) and the Parallel Factor Analysis (PARAFAC) .
This workshop was with the focus on the predictive computer modeling approach of FIGS (trait mining). A previous workshop for the PGR Secure project was organized in Rome in the autumn of 2011 with the focus on an alternative approach to conducting a FIGS study that is based on collecting information on the environmental conditions most likely to support the adaptive development of the target trait property. It is important to notice that the approach taken by this second FIGS workshop, calibrating a predictive computer model, demands a priori known trait data to be used as the training set. While the alternative approach based on collecting expert knowledge of the suitable environmental patterns can be conducted without the need for such a training set.
The workshop participants included: Dag Endresen (GBIF), Imke Thormann (Bioversity), Jacob van Etten (Bioversity), José María Iriondo (Universidad Rey Juan Carlos), Mauricio Parra Quijano (Universidad Politécnica de Madrid), María Luisa Rubio (Universidad Rey Juan Carlos), Shelagh Kell (University of Birmingham), Sónia Dias (Bioversity), Rosa García (Centro Recursos Fitogenéticos, INIA). (Abdallah Bari [ICARDA] was unfortunately not able to join the workshop because of problems with the Visa entry to EU).
 PGR Secure, http://www.pgrsecure.org/
 R, http://www.r-project.org/
 Mendeley FIGS, http://www.mendeley.com/groups/502321/trait-mining-figs/
 GBIF portal, http://data.gbif.org
 EURISCO portal, http://eurisco.ecpgr.org/
 ECCDB, http://www.ecpgr.cgiar.org/germplasm_databases.html
 USDA GRIN, http://www.ars-grin.gov/
 Canadian GRIN, http://pgrc3.agr.gc.ca/index_e.html
 PARAFAC, http://www.models.life.ku.dk/~rasmus/presentations/parafac_tutorial/paraf.htm
Sources of Resistance to Stem Rust (Ug99) in Bread Wheat and Durum Wheat Identified Using Focused Identification of Germplasm Strategy (FIGS)
Dag Terje Filip Endresen, Kenneth Street, Michael Mackay, Abdallah Bari, Ahmed Amri, Eddy De Pauw, Kumarse Nazari, and Amor Yahyaoui (2012). Sources of Resistance to Stem Rust (Ug99) in Bread Wheat and Durum Wheat Identified Using Focused Identification of Germplasm Strategy (FIGS). Crop Science 52(2):764-773. doi: 10.2135/cropsci2011.08.0427; Published online 8 Dec. 2011.
This experiment validates the FIGS approach in a “blind” study where the person conducting the predictive computer modeling did not know the actual trait scores for the test set. We explored a new dataset with measurements of susceptibility to a new strain of stem rust (Puccinia graminis Pers. f.sp. tritici Eriks. & Henn.) typified to race TTKSK and known as Ug99. The screening experiment for Ug99 was made in Yemen in 2008. The total dataset included 4563 landraces of bread wheat (Triticum aestivum L. ssp. aestivum) and durum wheat (Triticum turgidum ssp. durum (Desf.) Husn.). A data modeling training set including 825 landraces was prepared, but the true trait scores of the remaining 3738 landrace accessions was still unknown to the person calibrating the data models. The predictive performance using the FIGS approach was 2.3 times higher than a random sampling of accessions.
Focused Identification of Germplasm Strategy (FIGS) was proposed in the 1990s by Michael Mackay as an approach to identify useful traits in crops and the relatives of the cultivated plants. FIGS is based on finding a link between the eco-climatic attributes for the collecting sites and source locations for germplasm resources and their useful trait properties.