Persistent identifiers at GBIF.no
The Norwegian GBIF node (GBIF.no) has chosen to use UUIDs (universally unique identifier) prefixed by a PURL as the preferred Darwin Core occurrenceID format. The PURL prefix provides the PID to be resolvable (using the linked open data friendly HTTP protocol). Resolvable means to enable a data user (or a machine) finding the PID to look up useful information about the thing that is identified. The PURL is redirected to a resolver service located at data.gbif.no/resolver/.
As part of the data publishing process, the GBIF node monitors and scans all Norwegian datasets published in GBIF by Norwegian institutions to establish the resolver service for each herbarium specimen or observation. The UUIDs can be easily generated by the data publishing institutes locally (without any central coordination required) and the resolver service itself is established during data publication as a service from the Norwegian GBIF node.
Our resolver service provide by default a HTML information page and accept HTTP content negotiation to deliver other and machine readable formats such as JSON-LD, RDF, n3/turtle, comma separated valued, or tab delimited text. (You may also simply append an extension to the PID string such as “.json” to preview the result you would get from content negotiation).
The PURL configuration could for example be updated with a redirection to gbif.org instead of to gbif.no, if GBIF choose to establish a resolver service.
Why we chose UUID?
The UUID can be easily generated locally by the herbarium or the researcher producing or managing the datasets. Robust UUIDs can even easily be generated offline.
The UUIDs themselves provide a globally unique identifier that is not dependent on the PURL prefix. This will allow us to more easily migrate to other resolver solutions if this is required or requested in the future. In fact we expose the UUID in URN format as the “pure” form of the identifier.
We do encourage other data aggregation services to build resolver services (and other services) to reuse the “pure” UUID form (without the PURL prefix).
Other Darwin Core identifier terms
We also use the very same PURL + UUID format for some of the other identifier terms in Darwin Core. Most notably we have started to use the same format for eventID and taxonID. Some of the first data publishing institutes in Norway have used the PURL + UUID format for eventID, and we are now in the process of expanding the resolver service to handle also these PIDs. The same PURL + UUID format can, as it is, be used for any of the Darwin Core identifier term. The main challenge here is that the Darwin Core archive format is a very denormalized format and the Darwin Core terms are not always unambiguously described to belong to a particular class (type of thing, e.g. occurrence, organism, taxon, event, location). So it is not always easy to know exactly which attributes in the Darwin Core archive that describes the identified thing.
Whenever possible, we encourage the use of external systems such as Geonames or GRBio to describe things. We always recommend the data publishing institute to report the Geonames PID as dwc:locationID or the GRBio PID (for dwc:instituteID, dwc:collectionID) – and not to generate any separate new UUID here!!
Endresen D, and Svindseth C (2014). Persistent identifiers for museum specimens in Norway. [plenary] Proceedings of TDWG 2014, Jönköping, Sweden. doi:10.13140/2.1.4516.9606
Hagedorn G, Catapano T, Güntsch A, Mietchen D, Endresen D, Sierra S, Groom Q, Biserkov J, Glöckler F, and Morris R (2013). ‘Best practices for stable URIs’, http://wiki.pro-ibiosphere.eu/wiki/Best_practices_for_stable_URIs
Obreza M and Endresen D (2015). ‘Persistent identifiers for germplasm’. White paper, February 2015. https://goo.gl/3voD5K
FAO (2014) ‘Technical options to facilitate the establishment of data links in the field of plant genetic resources for food and agriculture: Permanent unique identifiers, IT/COGIS-1/15/3, November 2014’, International Treaty on Plant Genetic Resources for Food and Agriculture (ITPGRFA), Food and Agriculture Organization of the United Nations (FAO), Rome, Italy, www.planttreaty.org/sites/default/files/cogis1w3.pdf
Guralnick RP, Cellinese N, Deck J, Pyle RL, Kunze J, Penev L, Walls R, Hagedorn G, Agosti D, Wieczorek J, Catapano T, and Page RDM (2015). Community next steps for making globally unique identifiers work for biocollections data, ZooKeys 494:133–154, doi: 10.3897/zookeys.494.9352
- Many of the original source datasets indexed by GBIF are regularly updated and re-indexed by the GBIF portal. Without stable and persistent identifiers information on the same herbarium specimen (or species observation) are sometimes included more than one time, leading to duplicated information – duplicated in the sense of more than one (unlinked) data record for the same Real World entity.
- Without stable and persistent identifiers for herbarium specimens (and species observations) it is difficult to link the same data record indexed at different re-indexing cycles of the GBIF portal. When a data record previously indexed is not re-identified in a new version of a given dataset, then the record is deleted from the portal, and the link to previous versions of this data record is lost.
- A composite identifier (called Darwin Core triplet) based on a combination the metadata attributes for institute code (dwc:instituteCode), collection code (dwc:collectionCode), and the local specimen identifier (dwc:catalogNumber) is often used as the specimen identifier in GBIF. However, all three metadata attributes can (and do) sometimes change.
- What could be a best practice guideline for identifier resolution. Is it useful to define and agree on a (set of) common and well-defined response format? Is it useful to provide recommendations for a set of metadata profiles with a clear set of defined metadata attributes? Or would more general principles and more open recommendations be more likely to stand the test of time and remain relevant with the emergence of new information infrastructure technologies?
- Challenges, pros and cons of reusing object identifiers and metadata attribute terms declared by others without full control of how these objects and terms are maintained. Objects and concepts declared for a particular purpose will often not match exactly the needs suitable for another purpose. How to optimally reuse each others OWL ontologies, metadata vocabularies and data object models?
- Identifiers identifying the Real World physical objects, the entities that the collection curators and users of the information care about. Or should the identifier be assigned to database records? Real World entities will not have a signature byte-sequence and will rely of interpretation of when an object is considered to be the same thing.