DOIs for genebank collections
Digital object identifiers (DOI)
for the plant genetic resources community
To advance in collaborative regional or global efforts for documentation of genebank collections a persistent identifier like the Life Science Identifier (LSID) or the digital object identifier (DOI) is required. This text will focus on the utility of DOIs for this purpose. The following text express my personal opinions, but also constitutes my contribution to recent discussions with colleagues from the Dutch genebank, Bioversity, the Nordic genebank and the Russian genebank.
The DOI is a name not a location. The DOI is persistent and actionable. The DOI identifies a digital object. Thus, the DOI does not really identify the printed book, the printed journal manuscript, or a genebank seed sample. It could be seen as a name or identifier for the digital metadata objects that uniquely identify and describes the genebank accession. For genebanks these metadata that uniquely identify the accession are often called passport data. The passport data would thus here be seen as the “data”. A digital book or manuscript can be identified directly, but it is even here often convenient to resolve by default to the metadata about the digital book or manuscript. DOI is a cross-sector, not-for-profit effort (ISO TC46/SC9). DOI was founded in 1998. Overview document provided from DOI: http://dx.doi.org/10.1000/203
To obtain a domain prefix (like eg GENESYS, EURISCO or NordGen) this need to be registered with an authorized DOI Registration Agency. The DataCite (http://datacite.org/) could be the most appropriate DOI Registration Agency for the genebank community. I can’t find a direct quote on the cost of obtaining a domain prefix with DataCite, but I believe that it would be around USD $1000. DOI as the top-level foundation does not specify how much a Registration Agency should charge for the prefix.
” DataCite supports data centers by providing workflows and standards for data publication. For more information on how you can register your data. Contact: firstname.lastname@example.org”
One useful example of syntax could perhaps be:
doi:10.genesys/nld37/2445 for accession number 2445 at WUR CGN
NordGen would get as of the example above:
doi:10.genesys/swe54/NGB1212 for accession number NGB1212
|doi:||Just the DOI prefix (not part of the doi per se)|
|10.||This the DOI prefix at the Handle system (always 10 for DOI)|
|genesys||This the domain prefix that costs USD $1000|
|/nld37||The first part of the “local” identifier could eg be the WIEWS code|
|/2445||The next part could be the genebank catalog number|
|/batch1||It is eg possible for the genebank to build further on the DOI…|
Issues to discuss:
* Buying one DOI domain prefix for genesys will save money for each genebank institute. I think that everybody will be very happy if for example Bioversity will offer to do this! The alternative is for each genebank (or groups of genebanks) to buy one DOI prefix each. For each genebank to buy a DOI prefix would be the preferred option from a long-term persistence perspective as it is possible to perceive different genebank institutes remaining operational for different periods of time into the future – possibly with the need to resolve the DOIs for longer than Bioversity will exist.
* Another alternative, that I think was proposed by the DOI Foundation at one of the TDWG persistent identifier workshops that I attended some years back is that DOI would not charge a one time fee for the prefix domain, but instead a fee for each DOI at around 5 cent (USD $0.05). If this option is at all available, I think that USD $1000 is still a manageable amount to pay and will give so much more flexibility to design the syntax of the DOI (to include the WIEWS codes and accession numbers).
* Domain prefix: genesys, GENESYS, GeneSys, …? (EURISCO, SINGER, GRIN, NordGen, NORDGEN,…?). Even if we start using DOIs with genesys, other institutes could of course buy a new prefix and use this – if they would like to do so.
* Remember that the same accession represented in GeneSys, EURISCO and NordGen should be assigned different DOIs even if all these DOIs would be issued from the same DOI domain. These are different data objects and relations between them may very well need to be stated.
* First part of the local identifier could be: nld37, nld037, nld0037, NLD037, …? The FAO WIEWS codes was initially the three letter country acronym followed by a three-digit number. However at least for the USA, FAO have almost reached more than 1000 institutes (see USA998) and may need to issue WIEWS institute codes with a four-digit number. If we start to pad the previous WIEWS codes to make four digit numbers – who would guarantee that we will not eventually get more than 10 000 institutes in one country (new code when institutes merge or split etc…). Perhaps the FAO WIEWS could drop the padding zeros all together – and make the present codes with the padding zeroes synonyms or aliases to the codes without any padding zeroes…? NLD037 = NLD37. The FAO WIEWS codes could perhaps also be made case-insensitive…? We would in any case need to decide on lower-case or lower-case three-letter country acronym for the DOI. Perhaps all upper case is what people are most used to see…? I think that all lower-case “looks” better, but that is secondary of course.
* Next part of the local identifier could be the catalog number. We are used with the accession numbers.
* The experts on persistent identifiers would argue against using any string that has (or could have to someone) a semantic meaning. It the possible situation when the semantic meaning changes, people might be tempted to change the string… As an example when the Nordic Gene Bank (NGB) became NordGen, some was asking if the accessions numbers should be updated from eg NGB1212 to NordGen1212…!! Well, I said absolutely not! And they have at least so far remained with the NGB prefix. Similarly the DOIs could be vulnerable when the WIEWS codes and the accession numbers are used. However if things change so much that the WIEWS code or the accession number changes, I think there are other larger issues to look at. And to issue some new DOIs for the accessions would not be any problem anyway (even if best avoided, of course).
* The last part of the local identifier could be a version number. Perhaps the last part would a number to indicate the regeneration cycle of the accession, perhaps there are other internal genebank management operations that are useful to identify…? We could thus for example suggest separating such a suffix off at the end of the DOI. And there could be different DOIs with and without the suffix to indicate the accession in a specific context and in the more general identity.
* Another issue to consider would be the separator to use inside the DOI. In my example above, I used the slash character: “/”. If we sometimes use the DOIs as part of URL:s then this would make for a nice division in folders and sub-folders… If this is undesired, the simple dot character: “.” or similar can be chosen.
Michael Hausenblas of the Linked Data Research Centre in Ireland has proposed to assist the genebank community with such solutions as described here for the Svalbard Seed Vault. An efficient mechanism to share data between GENESYS and the Svalbard Seed Vault would be very useful – as only one example of use for such persistent identifiers.
Please feel free to join the discussion at the Agricultural Biodiversity Weblog