Linking occurrences :: GBIF Training Courses

Reminder: GBIF occurrences

According to the Darwin Core documentation an occurrence is:

An existence of a dwc:Organism at a particular place at a particular time.

— http://rs.tdwg.org/dwc/terms/Occurrence

In the context of natural history collections, occurrences will often be a digitized specimen record. This specimen can be a preserved specimen, a fossil specimen or a living specimen. But these aren’t the only types of occurrence records from GBIF that can be found on GRSciColl. Some occurrence records will come from environmental samples or may have been generated by processing the information in books and research publications (see for example the work of PLAZI on GBIF).

Occurrences shared on GBIF.org can come from a wide range of data providers: Museums, herbaria, journals, etc.

The breath of information associated with a given occurrence can vary. Sometimes, all that is available is a catalogue number, a family, a region, a date range, etc. While other occurrences may be associated with precise geolocation and images.

As mentioned before, when possible, the GBIF’s system attempts to link specimen-related occurrences published on GBIF to GRSciColl entries. This allows us to create metrics and dashboards for institutions and collections regardless of the way data were provided to GBIF.

GRSciColl and GBIF have overlapping content: many of the collections listed in GRSciColl are also datasets on GBIF. However, the scope of GRSciColl goes beyond biodiversity data, for example, GRSciColl contains mineralogy and archaeology collections but doesn’t include things like citizen science observation data. See this blog post explaining the type of data that can be shared on GBIF.

This training material doesn’t cover how data should be shared on GBIF, if you are interested, consider reviewing the course on Biodiversity data mobilization. This Data Use Club recorded webinar is also a good introduction to how the data are shared and processed on GBIF.

How are GBIF occurrences linked to GRSciColl?

Only the occurrences with the following basis of records are selected to show on the GRSciColl website and to be linked with GRSciColl entries:

Preserved Specimen
Fossil Specimen
Living Specimen
Material Sample
Material citation

Every time one of these occurrences is processed by GBIF (when it is published or re-interpreted), GBIF attempts to match it to GRSciColl entries by using the GRSciColl lookup service and a cached version of GRSciColl.

Using a cached version of GRSciColl means that the data used to interpret occurrences isn’t the most up to date. There is often a lag of a few days between an update in GRSciColl and the version that is used to interpret occurrences on GBIF. Sometimes, you or a GBIF administrator will need to trigger the occurrence republishing or re-interpretation in order to ensure that occurrences are linked to the correct entries a few days after GRSciColl update.

The GRSciColl lookup service will attempt to match occurrences based on the following:

Information associated with the occurrence	Information associated with GRSciColl entries
collectionCode	Collection code and alternative codes
institutionCode	Institution code and alternative codes
ownerInstitutionCode	Institution code and alternative codes
collectionID	Collection identifiers, UUID and URL(s)
institutionID	Institution identifiers, UUID and URL(s)
Dataset key	Collection and institution occurrence mappings
Publishing country	Collection and institution country in addresses

Information associated with the occurrence

Information associated with GRSciColl entries

collectionCode

Collection code and alternative codes

institutionCode

Institution code and alternative codes

ownerInstitutionCode

Institution code and alternative codes

collectionID

Collection identifiers, UUID and URL(s)

institutionID

Institution identifiers, UUID and URL(s)

Dataset key

Collection and institution occurrence mappings

Publishing country

Collection and institution country in addresses

Detailed illustration of the GRSciColl lookup

There are two ways to ensure that occurrences get matched to the correct entries:

Using the same codes and identifiers in occurrences and GRSciColl
Adding explicit mappings in GRSciColl

Matching codes and identifiers

Occurrences are linked to GRSciColl if the values in the collectionCode, institutionCode, ownerInstitutionCode, collectionID and institutionID fields match the GRSciColl collection and institution codes and identifiers.

Any of the codes or alternatives codes of a collection or institution can be used in the collectionCode and institutionCode fields. However, several collections or institutions can have the same codes. This means that matches based on codes only are not always reliable. When two GRSciColl entries have the same code, the algorithm compares the occurrence publishing country with the GRSciColl entry countries to try disambiguating the match. This is why all the occurrences that are matched based on codes are flagged as “fuzzy” on GBIF.

In order to ensure unambiguous matches, we recommend using identifiers to link occurrences to GRSciColl entries. Any identifier associated with a GRSciColl collection or institution can be used in the collectionID and institutionID fields.

In addition to that, the UUID or the URL of a GRSciColl collection or institution can also be used.

Currently, the recommended and most declarative way to link data to institutions or collection entities in GRSciColl is to use the full URL for the entity. For example, when using Darwin Core:

`dwc:institutionID`: `https://scientific-collections.gbif.org/institution/e3d4dcc4-81e2-444c-8a5c-41d1044b5381`

`dwc:collectionID`: `https://scientific-collections.gbif.org/collection/772f9e37-4643-452b-82b4-a06550283096`

Explicit mapping with the GRSciColl API

GRSciColl editors can use the GRSciColl API to create explicit mappings between GRSciColl collections and occurrences. A mapping can only be associated with a given GRSciColl entry and will be defined by a dataset key and a code used in the occurrence. For example, a mapping can be added to an entomology collection to specify that all the occurrence records with the collection code “ENT” for a given dataset should be linked to that collection.

Here is an example in python of adding occurrence mapping to a GRSciColl entry (see also this tutorial).

Occurrence explicit mapping can be particularly useful when trying to adjust the mapping of datasets published by third party systems like the European Nucleotide Archive or PLAZI.

Linking occurrences

Video

Video transcript

Review