Capture de données

In this module, you will learn about the concept of standards, in particular, the Darwin Core Standard and its components.

You will also learn the types of primary biodiversity data and how to best share that information within GBIF.

Lastly, you will review principles of data quality in the context of data capture and will learn about data quality and coherence (especially on subjects such as georeferencing, dates, names and taxa cross-checking).

Standards and Darwin Core

In this presentation, you will learn how you interact with standards every day. Then you will be introduced to Biodiversity Information Standards, including the Darwin Core Standard with which you will continue to use throughout this course.

 

 

Transcription de la présentation

Cliquez pour développer

 

capture standards dwc Slide1

Slide 1 - Standards and Darwin Core

In this presentation, we will introduce you to data standards as they relate to biodiversity data. In particular, we will focus on the Darwin Core standard.

capture standards dwc Slide2

Slide 2 - Standards: Let’s agree to agree

The engineer and industrialist W. Edwards Deming said:

“Standardisation does not mean that we all wear the same colour and weave of cloth, eat standard sandwiches, or live in standard rooms with standard furnishing. Homes of infinite variety of design are built with a few types of bricks, and with lumber of standard sizes, and with water and heating pipes and fitting of standard dimensions.”

What he was trying to say is that using standardization does not prevent us from being creative. He was also showing us that we already live with standards all around us.

As we move forward, we will define the term “Standard” and we’ll look at how we interact with standards every day. We will then introduce you to biodiversity information standards, including the Darwin Core Standard with which you will continue to use throughout this course.

capture standards dwc Slide3

Slide 3 - What is a standard?

So what is a Standard?

At its simplest, it is:

“An agreed way of doing something.”

Standards are combinations of norms, conventions, specifications, requirements, restrictions, and rules.

capture standards dwc Slide4

Slide 4 - Everyday standards

The main purpose for standards is to create a framework of “mutual understanding”. They should provide clarity and help communication.

Examples of everyday things that we encounter which make use of standards to aid communication of information are:

  • Units of Measurement

  • Numeral Systems

  • Alphabets

  • Languages

  • Emojis

  • Postal Addressing

  • Morse Code

  • Barcoding

capture standards dwc Slide5

Slide 5 - Everyday standard - An example

Let’s take a very specific example and break it down. Here in order to communicate accurately and repeatedly a position on the earth – a latitude and longitude – is actually a combination of at least 8 standards.

measurement - geographic coordinates format - degrees, minutes, seconds numeric system - sexagesimal numbers - Indo-Arabic language - English alphabet - Latin symbols - typography font - Roboto

capture standards dwc Slide6

Slide 6 - Rules and restrictions

Standards provide ways of constraining the array of possibilities. In the earlier Foundations presentations, you learned about types of data, schemas, formats and character encodings. Each of these can be used to constrain the array of possibilities within the terms of a Standard.

Types of data can restrict the values of a field. So alphanumeric text in a text field, decimals in a float. Yes/No in a Boolean.

An encoding schema can restrict the range of values in a field. For example, the list of possible latitude values have a range: between -90 and 90.

A format can restrict the representation of a data in a field. For example, whether a data appears as year month day, or day month year, or month day year.

And then, character encoding provides the rules for interpreting bytes of data. For our purposes, we’ll use UTF-8.

Photo: Eurema blanda (Boisduval, 1836) Observed in Nepal by Bird Explorers (http://creativecommons.org/licenses/by-nc/4.0/)

capture standards dwc Slide7

Slide 7 - Standards for data transfer

During this course, you will be learning how to share your data. As such, you will encounter standards for transferring data.

An application schema, allows for the combination of data standards for a specific purpose. For example, using Darwin Core terms within Darwin Core Archives. We’ll delve more into both in just a few moments.

Again, you will make use of format, but this time the format restricts dataset structures. A dataset may make use of csv, xml, json and rdf.

And lastly, you will have a transfer protocol, which provides information on how and where to send information. These might include http (hypertext transfer protocol), ftp (file transfer protocol), and smtp (simple mail transfer protocol or if you like, send mail to people).

capture standards dwc Slide8

Slide 8 - Biodiversity Information Standards

In the domain of biodiversity informatics there are many standards already available that can help you work with your data. The USGS has a very specific definition of these:

“Data standards are the rules by which data are described and recorded. In order to share, exchange, and understand data, we must standardize the format as well as the meaning.

The result of using these standards, where appropriate, is that you will increase data integrity, accuracy and consistency by clarifying ambiguous meaning and minimizing redundant data.

Those that you are likely to encounter or use on a regular basis may include:

Ecological Metadata Language Standard (EML) Humboldt Ecological Inventory (Humboldt extension) Global Genome Biodiversity Network(GGBN) Ocean Data Standards and Best Practices Project (ODSBP)

And last, but not least, we’ll spend the rest of this discussion on Darwin Core. The Darwin Core standard will allow you to share your occurrence, taxonomic, and event datasets.

capture standards dwc Slide9

Slide 9 - What is Darwin Core

Darwin Core is a biodiversity standard developed by the Biodiversity Informatics community. It was originally developed under the Taxonimic Databases Working Group or T D W G, pronounced TDWG. In recent years, the group has been renamed, Biodiversity Information Standards. But the acronym persists as the community is quite fond of the name TDWG.

“The standard includes a glossary of terms (in other contexts these might be called properties, elements, fields, columns, attributes, or concepts) intended to facilitate the sharing of information about biological diversity by providing identifiers, labels, and definitions. Darwin Core is primarily based on taxa, their occurrence in nature as documented by observations, specimens, samples, and related information.”

So in short, Darwin Core is a

“List of fields and their definitions, as they relate to biodiversity data.”

capture standards dwc Slide10

Slide 10 - Simple Darwin Core

As we dive deeper into Darwin Core or as it is abbreviated, DwC, you’ll learn it is more than JUST a list of fields. We will use Simple Darwin Core which is a predefined subset of the terms that have common use across a wide variety of biodiversity applications.

This subset contains greater than 150 fields, that are placed into a set of field classes comprised of:

Record & Dataset Occurrence Organism Material Entity Material Sample Event Location Geological Context Identification Taxon

Additionally, there are two auxiliary classes called:

ResourceRelationship MeasurementOrFact

From the Simple Darwin Core User Guide, “Simple Darwin Core is simple in that it assumes (and allows) no structure beyond the concept of rows and columns, which might be thought of as attributes and their values, or fields and records.”

Photo: Macaca mulatta (Zimmermann, 1780) Observed in Nepal by Vladimir Tkalčić (http://creativecommons.org/licenses/by-nc/4.0/)

capture standards dwc Slide11

Slide 11 - Darwin Core Quick Reference Guide

The DwC Quick Reference Guide will soon become your go to resource. This page provides a list of all currently recommended terms of the Darwin Core standard. Categories such as Occurrence or Event correspond to Darwin Core classes which group other terms.

capture standards dwc Slide12

Slide 12 - DwC terms country and countryCode

We now will look at some examples of Darwin Core terms. The quick reference guide, consistently displays each term with the identifier name, definition, comments and examples. The first terms we’ll review are Country and CountryCode within the Location category.

Generally, data holders have a field for country within their source data. But often this data can be quite messy with misspellings, abbreviations, and historical names. However, it is one of the most easily standardized pieces of data. As noted in the comments, the recommended best practice is to use a controlled vocabulary such as the Getty Thesaurus of Geographic Names. A controlled vocabulary places restriction on the values that should be used for that term.

CountryCode is a term that is generally not present in holder data. But again, it is another field that can easily be supplied with data due to the recommendation in the comments to use an ISO 3166-1-alpha-2 country code.

GBIF strongly recommends the sharing of CountryCode within occurrence datasets. Sharing of Country is also encouraged.

You will learn more about the GBIF requirements and recommendations in later sessions.

capture standards dwc Slide13

Slide 13 - DwC terms basisOfRecord

The next term is basisOfRecord. basisOfRecord defines the nature of each record in a dataset. BasisOfRecord follows a controlled vocabulary. You can choose from PreservedSpecimen, FossilSpecimen, LivingSpecimen, MaterialSample, Event, HumanObservation, MachineObservation, Taxon, or Occurrence. GBIF requires basisOfRecord with published occurrence datasets.

capture standards dwc Slide14

Slide 14 - DwC terms occurrenceID

The last term we will review is occurrenceID. When publishing occurrence records, GBIF requires an occurrenceID. An occurrenceID is an identifier for the occurrence itself, not the digital record of the occurrence. Recommended best practice is to use a globally unique identifier otherwise known as a GUID. In the absence of a GUID, a unique identifier can be comprised of other identifiers within the dataset. There are tools on the internet that can help you to generate GUIDs for your records. If you use this method, these GUIDs should become a permanent field within your source data identifying each record. For the practice done in this course, you will create an occurrenceID with a format similar to the third example here in the blue box.

capture standards dwc Slide15

Slide 15 - Darwin Core extensions

In using Simple Darwin Core, you may discover that you have more data to share but you cannot find corresponding terms in DwC. This data could be image or sound files or perhaps you are responsible for a collection of vertebrates and you have extensive data compiled on the weights and size of the specimens. Or even detailed historical information on the identification of a taxon. When this occurs, you will look to Darwin Core extensions so that you can extend the base data by providing additional files that correspond to the base data. Extensions that would meet your needs in these three examples are:

Simple Multimedia Measurements or Facts Identification History

There are many more extensions. GBIF maintains a list of all approved and draft extensions on its tools subsite.

Photo: Aleuria aurantia (Pers.) Fuckel Observed in Nepal by Elizabeth Byers http://creativecommons.org/licenses/by-nc/4.0/

capture standards dwc Slide16

Slide 16 - Community and standards relationships

There are many layers in our biodiversity informatics community. The image here shows the relationships between these layers and where they intersect with the Darwin Core and where extensions maybe necessary in order to fully share data.

capture standards dwc Slide17

Slide 17 - Darwin Core Achives (DwC-A)

Data shared to GBIF is currently submitted via a Darwin Core Archive or DwCA.

A DwCA is an expression of the Darwin Core text guide. It is a compressed file containing a minimum of three files. It is encoded as UTF-8.

In this example, these three files are:

A data file (occurrence.txt) conforming to the SIMPLEDWC in a CSV format where the first row includes Darwin Core standard term names. A meta file (meta.xml) in XML format containing technical details to instruct a computer in how to use the data file. A meta file (eml.xml) in an XML format containing explanatory details about the records contained within the data file to instruct a user if the data will be fit for their use.

A more complex structure can be obtained by sharing multiple related csv files to extend the data. They are related to the core file by way of a unique id. In an occurrence dataset, these related csv files are related by the occurrenceID.

capture standards dwc Slide18

Slide 18 - Updates to DwC and moving to the Darwin Core Data Package (DwC-DP)

So while Darwin Core Archives has been our preferred method of publishing since 2012, we are approaching a new model called a Darwin Core Data Package.

The new model is intended to ”broaden the range of scientific questions that GBIF can address.”

It will allow us to expand our data scope, engage with new data communities, and build tools to enable data flows based on the updated standard.

capture standards dwc Slide19

Slide 19 - Recapping the process to date

The community has been developing the work on this model since 2022 and the standards developments for this model has been in development since 2023.

We are finally in the community review and ratification phase for the new terms and the new data package which was released in October of 2025. As you can see from this timeline, it takes a long time implement change to an established standard and it takes a dedicated community to see it through.

capture standards dwc Slide20

Slide 20 - Changes proposed in Darwin Core

The changes under public review include:

65 new terms 75 proposed changes to existing terms A documented conceptual model for DwC Guidelines for using the Frictionless Data Format

capture standards dwc Slide21

Slide 21 - A DwC-A short story OR a DwC-DP epic novel?

The Darwin Core Data Package completely opens up how much data can be shared taking a Darwin Core Archive short story to a Darwin Core Data Package Epic Novel.

For comparison Stories can be told with either DwC-A or DwC-DP, but it should be evident that the range of scientific endeavor that can be faithfully captured by DwC-DP is much more vast.

Bee, flower: Park Jisun DNA: Luvdat Book: zero_wing Net: shin_icons Person: Maxim Basinski Premium Test tube: Freepik Ruler: Freepik Camera: Freepik Identification: Freepik

capture standards dwc Slide22

Slide 22 - Do I have to do all that?

All of this might be overwhelming, when you are just learning about data mobilization, but we want you to know that changes are coming which depending on your data might be very exciting for you.

We also recognize that not everyone needs to do all of that, especially all within one dataset and publishers will be able to use only what they need.

For the time being with regards to this course we are focusing on training using the event core which you will learn more about in the coming sessions.

When the ratification process has completed and we are ready for publishers to begin using the Darwin Core Data Package, we will organize virtual community events to supplement training.

capture standards dwc Slide23

Slide 23 - Why use Darwin Core?

So as we conclude, we’ve covered What Darwin Core Is and hopefully you’ve started to develop a sense of why you should use it.

It is a standard and standards are good! Standards provide us with the rules and protocols we need to share our data with others.

Darwin Core also provides us with a common language. As we saw in the Foundations – Documentation presentation, source data can be tricky when trying to compare datasets. The fields in your source data might be different than the fields in another institutions data source. When we all use Darwin Core to share our data, we understand that data has been shared with a common language.

capture standards dwc Slide24

Slide 24 - Why use Darwin Core?

And it’s not only the data holders that understand this common language, the data users do as well. And after all, what could be better than a user finding a data set that is fit for their use, shared in a common language, that allows them to do better science.

capture standards dwc Slide25

Slide 25 - Conclusion

This is part of a series of presentations used in the GBIF Biodiversity Data Mobilization course. The biodiversity data mobilization curriculum was originally developed as part of the Biodiversity Information Development Programme funded by the European Union.   This presentation was originally created by Paula Zermoglio and John Wieczorek with additional contributions by Sharon Grant, Sophie Pamerlon, Laura Anne Russell, Cecilie Svenningsen and Dag Endresen, BID and BIFA Trainers, Mentors and students.   This presentation has been narrated by Laura Anne Russell.

Exercise 1a

For this activity, you will examine verbatim field names and match them to Darwin Core terms

  1. Find the Darwin Core term at https://dwc.tdwg.org/terms/ that best matches the field names.

  2. Download UC-Practice-exercise-sheet_EN.docx to provide your answers.

GBIF dataset types for primary biodiversity data

In this presentation, you will review primary biodiversity data that can be shared within GBIF.

 

 

Transcription de la présentation

Cliquez pour développer

 

capture standards dwc Slide1

Slide 1 - GBIF dataset types for primary biodiversity data

In this presentation, we’ll have a look at the different types of data that can be called ‘primary biodiversity data’ and shared within GBIF. These data can be complex and have different origins ; we’ll see how they can be structured into one of the three data classes accepted by GBIF, based on the Darwin Core data sharing standard that is used within the GBIF community.

capture standards dwc Slide2

Slide 2 - Data richness levels supported by GBIF

GBIF currently supports four types of dataset:

The first is dataset metadata – this is a dataset that allows you to provide descriptive information about a dataset. You may use this type when you have not yet digitized your collection.

The second type are species checklists. This allows you to share information on species including the countries and areas where they are found.

The third type is for occurrence-only data. This is data for species that includes names, dates and coordinates – the what, when and where of your data.

The last type is Sampling-event data. This allows you to share even more data. You can share species with dates, coordinates, methods, abundance and even absence.

capture standards dwc Slide3

Slide 3 - Checklists and taxonomical resources

The first type of dataset that can be published within GBIF is the checklist, aka a list of species or higher taxa. In a checklist, the main unit is the taxon, not the occurrence (individual).

capture standards dwc Slide4

Slide 4 - Checklists and other taxonomical resources

Checklists, and more generally, taxonomical datasets can vary in scope:

a list of species present in a protected area is a checklist, a national or thematic list of taxa (e.g. Butterflies of Laos, or the Flora of Malaysia), which sometimes provide more info on the distribution, taxonomical hierarchy and synonyms of the taxa is a checklist. IUCN red lists (at the international, national or regional levels) also are checklists and can give more information about the vulnerability status of the listed taxa.

capture standards dwc Slide5

Slide 5 - GBIF template for taxon data

GBIF provides data publishers with templates for each class of dataset. The GBIF template for checklists (taxon Data) allows you to share information linked to each species or taxa : its id, full name and authorship, relationship to other taxa, geographical details and so on. In this template, one line represents one taxon and a taxon can only appear once in the dataset (otherwise it will create a duplicate).

capture standards dwc Slide6

Slide 6 - Specimens and natural history collections materials

the second type of data that can be published within GBIF is data coming from natural history or scientific collections, also known as specimens. Unlike checklists, the main unit in a specimen collection will be the specimen (a single organism or group of organisms), and you can have multiple specimens from the same species or taxon group.

capture standards dwc Slide7

Slide 7 - Specimens and collections materials

collections and specimens vastly differ in numbers, size and preservation methods. However, any kind of collection that holds preserved living organisms can be captured and shared within GBIF ; herbarium or zoological collections (including fossils) for example, but geology or human artifacts collections do not fall under the GBIF scope.

capture standards dwc Slide8

Slide 8 - Literature

Biodiversity data can also be found in technical and scientific literature : it is possible to compile and share this kind of data within GBIF, but you should be extra-cautious not to share duplicate data (for example, a specimen described in a scientific article might already be present on GBIF.org in the dataset of its collection).
capture standards dwc Slide9

Slide 9 - Examples of literature documents

In the absence of digitized datasets, biodiversity data can be extracted and compiled from scientific articles, phD or master theses, reports and other documents. ALWAYS contact the data owner first when compiling literature data to ask for permission to publish them to GBIF.

capture standards dwc Slide10

Slide 10 - Fieldwork records and notes

A large majority of the data available on GBIF.org now comes from fieldwork records. These data can be observed and/or collected using different protocols in the field, by scientists, naturalists or amateur citizen-scientists via programmes and apps.

capture standards dwc Slide11

Slide 11 - Fieldwork records and notes

These field data can be compiled in survey reports, environmental impact assessment studies, field notes and other logs. They can be linked to other types of data, such as checklists, which are complementary to occurrence data observed or collected in the wilderness.
capture standards dwc Slide12

Slide 12 - GBIF template for occurrence data

Occurrence data represents the GBIF class of datasets with the largest number of records on GBIF.org. Collection (aka specimens), literature and fieldwork data can be shared within GBIF using this template, which focuses on the observed individual or collected specimen. In this kind of dataset, multiple individuals or specimens can be recorded for a single taxon, as long as they each have a unique identifier. Other fields for occurrence data include where, when, how and by whom was each occurrence observed and/or collected in the field.

capture standards dwc Slide13

Slide 13 - Event data

Event data is a class of datasets that was added on GBIF.org in 2015. They allow data publishers to share more information about the context of the observation or collection of specimens, and to group occurrences linked to the same event.

capture standards dwc Slide14

Slide 14 - GBIF template for event data

The GBIF template for Event data allows data publishers to share more information about the context of a biodiversity data collecting/recording event such as camera traps, insect traps, botanical relevés, birding sites and so on. Its structure is a bit more complex than the Taxonomical datasets and Occurrence datasets as it involves two files (e.g., two tabs in a spreadsheet) : one for describing the ‘events’ (e.g. each trap) and the other one to describe the specimens or occurrences linked to each event.

capture standards dwc Slide15

Slide 15 - Other origins of data

Sometimes, additional information or data can be found using other source materials than those previously mentioned in this video.

capture standards dwc Slide16

Slide 16 - Other origins of data

These other sources include remote sensing tools, maps and satellite data, and other information supports such as pictures, audio and video recordings of wildlife. Most of the time, the information retrieved this way is supplemental to existing data but can provide valuable information about individuals and their location, habitat, migratory routes, etc.

capture standards dwc Slide17

Slide 17 - Origins of data and Darwin Core concepts

In the following slides, we will see how the Darwin Core standard works with different types of data, and how to choose the dataset class that best fits a given dataset.

capture standards dwc Slide18

Slide 18 - Darwin Core standard: cores and extensions

As you learned in the Darwin Core presentation, Darwin Core is a data sharing standard developed by the Biodiversity Information Standards (TDWG) organization, and is widely used to publish data within the extended GBIF community. Depending on the type of dataset with which you work, you’ll have to choose one of its three Cores (Occurrence, Taxon or Event Core), which can be linked to one or more Extensions to share more information (such as multimedia links, variables, identification history and so on). In some cases, a core can be used as an extension linked to another core: that is the case when publishing Event data, where one can use the Event Core to describe the sampled areas and then use the Occurrence Core as an Extension to share the specimens or individuals recorded.

capture standards dwc Slide19

Slide 19 - Darwin Core standard: cores and extensions

Once the data has been standardized using the Darwin Core standard, each class of dataset (metadata-only, sampling event, taxonomic or occurrence datasets) can be published on GBIF.org using one of the tools mentioned on this figure. The most frequently used tool within the GBIF community is the GBIF-developed Integrated Publishing Toolkit (IPT), which transforms the source data into Darwin Core archives that are then harvested and indexed by GBIF. The data are then searchable and downloadable via GBIF.org.

capture standards dwc Slide20

Slide 20 - Data quality requirements associated with each class

GBIF’s data quality requirements describe what you should provide for each dataset class. It doesn’t mean that the data won’t be indexed if some values are missing, but these requirements summarize what can be considered meaningful information for each class. You can of course share more.

Some requirements are required (such as the occurrenceID, taxonID or eventID, depending on the type of dataset), others are optional but strongly recommended (such as coordinates in decimal degrees). You can find these requirements on GBIF.org.

capture standards dwc Slide21

Slide 21 - How to choose a dataset type?

Choosing a dataset class can take some time, especially if you’re new to data publishing within GBIF. The GBIF helpdesk team created this useful flowchart which can be found on the GBIF Data Blog if you need further support to decide.

capture standards dwc Slide22

Slide 22 - Discussion

In order to be able to publish data to GBIF, it is important to understand the differences between the dataset classes at the early stages of the data capture and data managing process. You can reflect on this topic with the following questions:

Avec quel type de données travaillez-vous ?

Votre type de données est-il différent de ce que vous pensiez initialement ?

How would you publish them to GBIF? (using which Core and/or extension?)

capture standards dwc Slide23

Slide 23 - Conclusion

This is part of a series of presentations used in the GBIF Biodiversity Data Mobilization course. The biodiversity data mobilization curriculum was originally developed as part of the Biodiversity Information Development Programme funded by the European Union.

This presentation was originally created and narrated by Sophie Pamerlon with additional contributions by BID and BIFA Trainers, Mentors and Students.

Exercise 1b

Quiz yourself on the concepts covered in the dataset types presentation. There may be multiple correct answers for some questions.

Vous pouvez en savoir plus sur les réponses dans le Annexe des solutions.

Following the quiz, you will consider the data type of your own data.

  1. Quel(s) type(s) de jeu de données choisiriez-vous pour une collection ichthyologique ?

    • occurrence

    • liste taxonomique

    • événement d’échantillonnage

    QDataTypes specimen
    Eutrigla gurnardus (Linnaeus, 1758) | Muséum d’histoire naturelle de Nice
    • occurrence

    • liste taxonomique

    • événement d’échantillonnage

  2. Quel(s) type(s) de jeu de données choisiriez-vous pour une liste d’espèces envahissantes ?

    • occurrence

    • liste taxonomique

    • événement d’échantillonnage

    QDataTypes plant
    Jacinthe d’eau (Eichhornia crassipes) observée à Bourail, Nouvelle-Calédonie, où il s’agit d’une espèce introduite et envahissante par GRIIS. Photo par Gérard (2016) sous licence CC BY-SA 2.0
    • occurrence

    • liste taxonomique

    • événement d’échantillonnage

  3. Quel(s) type(s) de jeu de données choisiriez-vous pour la flore et la faune d’une étude d’impact environnementale ?

    • occurrence

    • checklist

    • évènement d’échantillonnage

    Les études d’impact environnemental sont réalisées par des experts afin d’évaluer la biodiversité et les biotopes d’une zone donnée, avant, pendant et après qu’elle soit affectée par des activités humaines (travaux routiers, installations d’éoliennes, exploitation minière, construction de bâtiments, etc.).

    QDataTypes field
    Un entomologiste à la poursuite des papillons par Matthieu Gauvain (CC-BY-SA)
    • occurrence

    • checklist

    • évènement d’échantillonnage

  4. Quel(s) type(s) de jeu(x) de données choisiriez-vous pour des données de suivi d’oiseaux ?

    • occurrence

    • checklist

    • évènement d’échantillonnage

    Les données de suivi des oiseaux sont enregistrées à l’aide d’appareils spécifiques, tels que des traceurs GPS fixés sur des oiseaux vivants, permettant ainsi aux scientifiques de suivre leurs routes migratoires ou leurs sites de reproduction.

    QDataTypes tracking
    vautour de Griffin observé à la Réserve Naturelle de Gamla par מינוזיג - MinoZig (CC0)
    • occurrence

    • checklist

    • évènement d’échantillonnage

  5. Quel(s) type(s) de jeu de données choisiriez-vous pour des données de pièges à insectes ?

    • occurrence

    • checklist

    • évènement d’échantillonnage

    QDataTypes traps
    Piège à insecte par miheco (CC-BY-SA)
    • occurrence

    • checklist

    • évènement d’échantillonnage

  6. Quel(s) type(s) de jeu de données choisiriez-vous pour des données de gestion d’un parc national ?

    • occurrence

    • liste d’espèces

    • événement d’échantillonnage

    Les données acquises dans le cadre de la gestion des aires protégées (comme les parcs nationaux mais aussi les réserves naturelles plus petites) peuvent être diverses et avoir des origines différentes : relevés botaniques, suivi des animaux marqués, observations effectuées par les gestionnaires et les gardes, et même des données de «science citoyenne» ou encore déduites d’images partagées sur les réseaux sociaux.

    QDataTypes Observations
    Sri Lankan elephants observed by pen_ash.
    • occurrence

    • liste d’espèces

    • événement d’échantillonnage

  7. Quel(s) type(s) de jeu de données choisiriez-vous pour un inventaire éclair (bioblitz) dans le cadre d’un programme de science participative?

    • occurrence

    • checklist

    • évènement d’échantillonnage

    Les données de sciences participatives sont souvent recueillies par le biais de journées thématiques de travail sur le terrain, connues sous le nom de « bioblitz », ou inventaire éclair. Les bénévoles se rassemblent généralement dans une zone donnée et passent la journée à essayer d’observer et d’identifier autant d’espèces qu’ils le peuvent dans cette zone.

    Les données de chaque participant sont saisies et agrégées dans l’outil de saisie ou de gestion des données du programme de science participative.

    QDataTypes citizen
    Looking for birds with park staff by US National Park Service (authorized reuse on google image search)
    • occurrence

    • checklist

    • évènement d’échantillonnage

  8. Quel(s) type(s) de jeu de données choisiriez-vous pour une liste d’espèces régionale?

    • occurrence

    • liste taxonomique

    • événement d’échantillonnage

    QDataTypes threatened
    Black rhino observed at the Magdeburg Zoo in Germany by Mani300
    • occurrence

    • liste taxonomique

    • événement d’échantillonnage

For this activity, consider your data that you plan to share with GBIF.

Discussion

  • Avec quel type de données travaillez-vous ?

  • Is your dataset type different than you originally thought?

  • Comment publieriez-vous vos données sur le GBIF (en utilisant quel noyau et/ou extension) ?

  • Utilisez la feuille d’exercice précédemment téléchargée pour donner vos réponses.

Saisie, traitement et qualité des données

In this presentation), you will explore the principles of data quality applied to data capture, specifically when capturing data from collection labels, fieldwork notebooks, spreadsheets, etc.

 

 

Transcription de la présentation

Cliquez pour développer

 

capture quality Slide1

Slide 1 - Data capture, processing and quality

This presentation is based on “Principles for data quality” by Arthur Chapman

capture quality Slide2

Slide 2 - Structure

During this presentation, we will explore the principles of data quality applied to data capture, specifically when capturing data from collection labels, fieldwork notebooks, spreadsheets, etc.

capture quality Slide3

Slide 3 - Develop a data processing and quality workflow

Data quality is essential at every step of the data mobilization process, especially in the data capture steps.

Each person involved in the data capture has a share of responsibility regarding the data quality, but most decisions on this topic have to be taken at the institutional level.

The keywords here are: planning and documentation!

As mentioned in the Foundations documentation presentation, use existing standards and plan your workflows to match your goals; document everything you can, at every step, and share or re-use documents, data, tools and standards as much as possible.

capture quality Slide4

Slide 4 - Data processing and quality workflow

This is an example of a data quality workflow.

In this workflow, it begins with the collection of specimens, moves to data capture, then quality control, publishing and finally use.

Data quality isn’t the sole responsibility of the first person in the process (here, the collector) — it is shared at every step and every person in the process should have a responsibility for quality.

A functioning feedback loop needs to be in place in order to check, complete, update or correct data.

This is where documentation is essential: you need to know who was responsible for each step in the process in order to validate changes that have been made to the data (or need to be made to the data).

capture quality Slide5

Slide 5 - Data processing and quality responsibilities

In this simplified view of the data flow, you can see some of the data quality responsibilities of each group of people involved.

In this example, the team in charge of the mobilization can be split into the ‘transcribers’ and ‘curator’ roles.

The transcribers team needs to ensure data is captured and saved as best as possible, while the ‘curator’ role has the ultimate responsibility for ensuring that each team is fulfilling their roles in the process.

User picture: www,gbif.org (WSC Cambodia camera trap dataset search results)

capture quality Slide6

Slide 6 - Structure

Once the data workflow is in place, the data capture itself can begin.

In the following slides, we will explore the different types of information that can be captured from specimens or field observations, and see what are the most common mistakes that should be avoided when dealing with each kind of information.

The main topics will be as follows: taxonomic information, spatial information, collection information, descriptive information.

Please note that each occurrence (each row in your database or spreadsheet) should have information linked to these 4 main topics in order to be shared and reused accordingly.

capture quality Slide7

Slide 7 - Taxonomic information: vocabularies and concepts

Taxonomic information is an essential part of the data capture process.

Without it, a digitized specimen is useless and cannot be properly interpreted or reused.

Note that the species name is not the only type of taxonomic information that can be exploited in the data capture process: sometimes the specimen hasn’t been identified to the species, and higher taxonomic levels such as the genus or family are still useful for data managers and users.

capture quality Slide8

Slide 8 - Taxonomic information — be careful with names!

Most of the time, the scientific name is the main way of retrieving data in a database, portal, website, browser, etc.

Any error in the spelling or authority can lead to wrong or null queries, thus impeding the management and potential reuse of the data.

This is why it’s very important to check all categories of scientific names to fix errors and/or omissions.

capture quality Slide9

Slide 9 - Taxonomic information: common mistakes to avoid

The most common issues occurring with taxon information are missing or inconsistent information, incorrect or non-atomic values, duplicates and uncertainty.

Always check the definitions and examples of taxonomic Darwin Core terms to avoid nomenclatural mistakes: http://rs.tdwg.org/dwc/terms/index.htm

capture quality Slide10

Slide 10 - Spatial information: vocabulary and concepts

Geographic information proves to be valuable in a lot of data re-use contexts, such as niche modelling or studies about species distribution.

While ‘old’ collections or specimens can be understandably difficult, if not impossible, to geolocate precisely, it is recommended to share precise coordinates or textual information when possible.

Coordinates should be recorded directly on the field when possible, along with the uncertainty and the geodetic datum used. Otherwise use relevant and verified sources to geolocate your data.

You should note that coordinates or other geographic information can be generalized or not even shared at all in some contexts, such as with the conservation of sensitive species.

capture quality Slide11

Slide 11 - Spatial information: what are we talking about?

Spatial information can be found in numerous formats, not only geographic coordinates: examples include (but are not limited to) grid data, point+radius, or polygons.

Each of them is useful to share in order to check the consistency of the geographical elements (for example coordinates vs country code, or to ensure a given locality is consistent with a collector’s travels)

capture quality Slide12

Slide 12 - Spatial information: a few more definitions

Within GBIF, it is recommended to share the geodetic Datum that was used to derive the coordinates that were shared (decimal latitude and longitude).

In the absence of a specific geodetic datum, GBIF will infer WGS84 as default.

capture quality Slide13

Slide 13 - Spatial information: common mistakes to avoid

This slide shows an old GBIF map with different types of geographical issues: the most obvious one is a mirror effect between the USA and China (reversed coordinates),

You can also notice an artificial line along the Greenwich meridian where ‘0’ values were put in the ‘decimalLongitude’ field, as well as another one on the Equator where ‘0’ values were put in the ‘decimalLatitude’ field.

GBIF indexing now includes automatic geographical checks between the coordinates and countryCode shared within the dataset. Coordinates can be automatically reversed to match the country.

capture quality Slide14

Slide 14 - Collection information: vocabularies and concepts

Information about the context of the data collection or observation are very useful to share in order to give as much detail as possible regarding each occurrence.

Information such as the collector name, collection or observation protocol, habitat and other factors can prove to be important when reusing data for example with ecological niche modelling.

Depending on the dataset type, other information might also become relevant.

capture quality Slide15

Slide 15 - Collection information: things to keep in mind

Data quality factors regarding collection information are mainly along the lines of exactitude like the correct collector name, consistency for example using the same vocabulary for describing soils, habitats, and completeness as in providing all existing information about the description of a given species including the flowering period, colour of the leaves, and medicinal uses.

Within the Darwin Core and within the IPT you can find recommended controlled vocabularies for some fields such as the ‘lifeStage’. The TDWG vocabularies task group works to promote and make better the ease of use across vocabularies.

capture quality Slide16

Slide 16 - Descriptive information: vocabularies and concepts

Keep in mind that descriptive information are often incomplete due to a whole array of factors.

Depending on the collection state, some labels can be incomplete or lacking crucial information; Completeness (for example of a species description) is often impossible to achieve with a single individual; and you should always check for consistency in your database or spreadsheet, for example in the terms used for describing colours, in order to avoid redundant information.

Image credit: Dataset Taiwan Moth Occurrence Data Collected From Social Network

capture quality Slide17

Slide 17 - Summary

This presentation has focused on the topic of data quality applied to data capture; indeed, these are the steps where it is crucial to ensure that all information related to each record is correctly and completely captured, in order for the data to be as clear and understandable as possible for future users.

This can only be done if consistent decisions are made at the institutional level in order to create a solid workflow for data capture and data management.

The chain of responsibility regarding data quality is then split between the persons involved at each step of the process, but keep in mind that data can always be improved and fixed if errors or omissions are detected at later stages.

capture quality Slide18

Slide 18 - Conclusion

This is part of a series of presentations used in the GBIF Biodiversity Data Mobilization course. The biodiversity data mobilization curriculum was originally developed as part of the Biodiversity Information Development Programme funded by the European Union.

This presentation was originally created and narrated by Sophie Pamerlon with additional contributions by BID and BIFA Trainers, Mentors and Students.

Exercise 1c

For this activity, you will complete an exercise simulating data capture from analogue to digital.

Use the Darwin Core terms to help make decisions on the supllied data that is needed for the project and what could be shared later in publication.

Read the [Practice use case]. Imagine that you are the person assigned to transcribe the data supplied by network volunteers.
  1. Download UC-Practice-1-ForCapture-logs.zip. (1.1 MB). There are three logs files contained in the compressed file.

  2. Download the spreadsheet template: UC-Practice-1-ForCapture-template.xlsx (37 KB) to transcribe the observations recorded.

  3. Utilisez la feuille d’exercice précédemment téléchargée pour donner vos réponses.

you can add fields to the template if you think you may be able to capture more information than was planned for in the template.