Skip to content

Downloading occurrence data using the GBIF REST API

September 11, 2015

 

Several R-packages (such as rgbif and dismo) facilitate download of GBIF-mediated occurrence records. However, here the download of GBIF data use a method of paging through the results (maximum 200 000 records at a time [link1]) which could demand very, very much time for species with many occurrence records.

[link1] http://lists.gbif.org/pipermail/api-users/2015-April/000149.html

The asynchron download as described from the GBIF portal API documentation page, provides a much faster and much more reliable download option for large sets of records. Here, Markus describes an approach to write your filter condition in a json file and issue a curl request to post the data download on the GBIF servers.

http://www.gbif.org/developer/summary#authentication
http://www.gbif.org/developer/occurrence
http://lists.gbif.org/pipermail/api-users/
http://en.wikipedia.org/wiki/CURL

I have followed these instructions using BASH, the Unix command line shell on my MacBook. I assume it is possible to do something similar in MS DOS, but I would much rather suggest to install a bash prompt or to use another programming environment such as e.g. Python, PHP, Perl, Ruby, … if perhaps you are stuck on a Windows computer😉

http://en.wikipedia.org/wiki/Bash_%28Unix_shell%29

I noticed that leaving the query filter condition in a separate json file (as described by the GBIF API documentation), did not allow me to easily loop through a long list of species names to issue separate asynchron download requests. So instead, I wanted to make a bash function where I could give only the species name (actually the speciesKey or in fact the taxonKey) as the input parameter (I used R to find the speciesKey).

ftp://norbif.uio.no/pub/outgoing/GBIF_pooideae/script/gbifapi.sh

I believe that the genusKey will work all fine in the script as well. I believe that you can give familyKey, genusKey, speciesKey etc as input taxonKey – but I did not actually test this.

http://www.gbif.org/developer/occurrence#parameters

---
--- Provide your GBIF API user name and your email
--- (replace in the code below)
--- Copy the function to memory and paste into a bash command line prompt.
---
function gbifapi { curl -i --user _YOUR_GBIF_USER_NAME_:_YOUR_GBIF_PASSWORD_ -H "Content-Type: application/json" -H "Accept: application/json" -X POST -d "{\"creator\":\"_YOUR_GBIF_USER_NAME_\", \"notification_address\": [\"_YOUR_EMAIL_\"], \"predicate\": {\"type\":\"and\", \"predicates\": [{\"type\":\"equals\",\"key\":\"HAS_COORDINATE\",\"value\":\"true\"}, {\"type\":\"equals\", \"key\":\"TAXON_KEY\", \"value\":\"$1\"}] }}" http://api.gbif.org/v1/occurrence/download/request >> log_gbifapi.txt echo -e "\r\n$1 $2\r\n\r\n----------------\r\n\r\n" >> log_gbifapi.txt }
---
---

You will notice from the code that I log all the download requests in a log-file (log_gbifapi.txt) in the current directory of the bash command line prompt.

To call the bash function, I created a list in a spreadsheet with first column the name of the function (gbifapi), the second column the respective speciesKey values (which I used R to find), and then as third column I included the species name (for no other reason than providing a human readable label for myself).

---
--- Copy to memory and paste into bash command line prompt
--- (or run as script)
---
gbifapi 4140730 "Aciachne acicularis"
gbifapi 4140704 "Aciachne flagellifera"
gbifapi 5289784 "Aegilops comosa"
gbifapi 4138203 "Aegilops mutica"
--- ... etc
---

I noticed that pasting the full list of 300 species API calls into my bash command line prompt caused some kind of time-out error. So, I spilt the list into segments of some 30 species at a time and allowed these species download requests to be placed at the GBIF server before doing the next segment of some 30 species. When this was done all my species download request was placed in the queue at the GBIF servers for asynchron download. Some species completed in a few minutes and those with the more numerous occurrences could take up to an hour. After one day all 300 species download requests where completed.

All completed download files are listed from your user profile at the GBIF portal, and you may simply pick them up here:

http://www.gbif.org/user/download

The log file log_gbifapi.txt captures the response from the GBIF API for each respective species download request. Here, the downloadKey for each respective data file is provided – however, not as a clean attribute parameter, so some regular expression text cleaning is needed. I did not yet complete making this step into a script…

ftp://norbif.uio.no/pub/outgoing/GBIF_pooideae/script/gbifkeys.sh
http://lists.gbif.org/pipermail/api-users/2015-April/000157.html

With the cleaned list of download keys, you could issue a set of e.g. wget commands to collect your download files from the GBIF server.

ftp://norbif.uio.no/pub/outgoing/GBIF_pooideae/script/gbifwget.sh
http://en.wikipedia.org/wiki/Wget

References

Scott Chamberlain, Karthik Ram, Vijay Barve, Dan Mcglinn (2015). rgbif: Interface to the Global Biodiversity Information Facility ‘API’. Available at http://cran.r-project.org/web/packages/rgbif/index.html and https://github.com/ropensci/rgbif

Robert J. Hijmans, Steven Phillips, John Leathwick and Jane Elith (2015). dismo: Species Distribution Modeling. Available at
http://cran.r-project.org/web/packages/dismo/index.html

REcology (2012) GBIF biodiversity data from R, more functions. Available at http://recology.info/2012/10/rgbif-newfxns/

No comments yet

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: