Linking occurrences

Video

In this video (07:14), GBIF Data Administrator, Marie Grosjean, describes the linking of occurrence records within GRSciColl. If you are unable to watch the embedded Vimeo video, you can download it locally on the Files for download page.

Video transcript

Click to expand

 

module2 section3 Slide1

Reminder: GBIF occurrences

module2 section3 Slide2

According to the Darwin Core documentation an occurrence is:

An existence of a dwc:Organism at a particular place at a particular time.
— http://rs.tdwg.org/dwc/terms/Occurrence

In the context of natural history collections, occurrences will often be a digitized specimen record. This specimen can be a preserved specimen, a fossil specimen or a living specimen. But these aren’t the only types of occurrence records from GBIF that can be found on GRSciColl. Some occurrence records will come from environmental samples or may have been generated by processing the information in books and research publications (see for example the work of PLAZI on GBIF).

Occurrences shared on GBIF.org can come from a wide range of data providers: Museums, herbaria, journals, etc.

The breath of information associated with a given occurrence can vary. Sometimes, all that is available is a catalogue number, a family, a region, a date range, etc. While other occurrences may be associated with precise geolocation and images.

module2 section3 Slide3

As mentioned before, when possible, the GBIF’s system attempts to link specimen-related occurrences published on GBIF to GRSciColl entries. This allows us to create metrics and dashboards for institutions and collections regardless of the way data were provided to GBIF.

module2 section3 Slide4

GRSciColl and GBIF have overlapping content: many of the collections listed in GRSciColl are also datasets on GBIF. However, the scope of GRSciColl goes beyond biodiversity data, for example, GRSciColl contains mineralogy and archaeology collections but doesn’t include things like citizen science observation data. See this blog post explaining the type of data that can be shared on GBIF.

module2 section3 Slide6

This training material doesn’t cover how data should be shared on GBIF, if you are interested, consider reviewing the course on Biodiversity data mobilization. This Data Use Club recorded webinar is also a good introduction to how the data are shared and processed on GBIF.

How are GBIF occurrences linked to GRSciColl?

module2 section3 Slide8

Only the occurrences with the following basis of records are selected to show on the GRSciColl website and to be linked with GRSciColl entries:

  • Preserved Specimen

  • Fossil Specimen

  • Living Specimen

  • Material Sample

  • Material citation

Every time one of these occurrences is processed by GBIF (when it is published or re-interpreted), GBIF attempts to match it to GRSciColl entries by using the GRSciColl lookup service and a cached version of GRSciColl.

Using a cached version of GRSciColl means that the data used to interpret occurrences isn’t the most up to date. There is often a lag of a few days between an update in GRSciColl and the version that is used to interpret occurrences on GBIF. Sometimes, you or a GBIF administrator will need to trigger the occurrence republishing or re-interpretation in order to ensure that occurrences are linked to the correct entries a few days after GRSciColl update.

The GRSciColl lookup service will attempt to match occurrences based on the following:

Information associated with the occurrence Information associated with GRSciColl entries

collectionCode

Collection code and alternative codes

institutionCode

Institution code and alternative codes

ownerInstitutionCode

Institution code and alternative codes

collectionID

Collection identifiers, UUID and URL(s)

institutionID

Institution identifiers, UUID and URL(s)

Dataset key

Collection and institution occurrence mappings

Publishing country

Collection and institution country in addresses

grscicoll lookup
Detailed illustration of the GRSciColl lookup

There are two ways to ensure that occurrences get matched to the correct entries:

  • Using the same codes and identifiers in occurrences and GRSciColl

  • Adding explicit mappings in GRSciColl

Matching codes and identifiers

module2 section3 Slide10

Occurrences are linked to GRSciColl if the values in the collectionCode, institutionCode, ownerInstitutionCode, collectionID and institutionID fields match the GRSciColl collection and institution codes and identifiers.

module2 section3 Slide11

Any of the codes or alternatives codes of a collection or institution can be used in the collectionCode and institutionCode fields. However, several collections or institutions can have the same codes. This means that matches based on codes only are not always reliable. When two GRSciColl entries have the same code, the algorithm compares the occurrence publishing country with the GRSciColl entry countries to try disambiguating the match. This is why all the occurrences that are matched based on codes are flagged as “fuzzy” on GBIF.

module2 section3 Slide12

In order to ensure unambiguous matches, we recommend using identifiers to link occurrences to GRSciColl entries. Any identifier associated with a GRSciColl collection or institution can be used in the collectionID and institutionID fields.

module2 section3 Slide13

In addition to that, the UUID or the URL of a GRSciColl collection or institution can also be used.

Currently, the recommended and most declarative way to link data to institutions or collection entities in GRSciColl is to use the full URL for the entity. For example, when using Darwin Core:

`dwc:institutionID`: `https://scientific-collections.gbif.org/institution/e3d4dcc4-81e2-444c-8a5c-41d1044b5381`
`dwc:collectionID`: `https://scientific-collections.gbif.org/collection/772f9e37-4643-452b-82b4-a06550283096`

Explicit mapping with the GRSciColl API

module2 section3 Slide14

GRSciColl editors can use the GRSciColl API to create explicit mappings between GRSciColl collections and occurrences. A mapping can only be associated with a given GRSciColl entry and will be defined by a dataset key and a code used in the occurrence. For example, a mapping can be added to an entomology collection to specify that all the occurrence records with the collection code “ENT” for a given dataset should be linked to that collection.

module2 section3 Slide15

Here is an example in python of adding occurrence mapping to a GRSciColl entry (see also this tutorial).

Occurrence explicit mapping can be particularly useful when trying to adjust the mapping of datasets published by third party systems like the European Nucleotide Archive or PLAZI.

Review

Quiz yourself on the concepts covered in this module.

True or False?

  1. Some collection content cannot be shared as occurrences on GBIF, for example meteorites.

    • True

    • False

    • True

    • False

  2. Only occurrences with the basis of record “preserved specimens” can be linked to GRSciColl entries.

    • True

    • False

    • True

    • False

  3. Occurrences are matched to a live version of GRSciColl data.

    • True

    • False

    • True

    • False

  4. Occurrences shared on GBIF can be aggregated in dashboards on the GRSciColl website.

    • True

    • False

    • True

    • False

  5. If two institutions based in the same country have the same code, “MMFG”, an occurrence with the value “MMFG” in the institutionCode field and no value in the institutionID field won’t be matched to any institution.

    • True

    • False

    • True

    • False

  6. If you use the URL of a GRSciColl collection entry in the collectionID field of an occurrence, the occurrence will be matched to the relevant collection in GRSciColl.

    • True

    • False

    • True

    • False

  7. If a GRSciColl institution is associated with the code “MMFG” and its collection is associated with the code “ENT”, you can use the code “MMFG” in the collectionCode field of an occurrence to match it to the collection.

    • True

    • False

    • True

    • False