Solutions
|
This appendix contains the answers and additional information for the review quizzes. Additionally, this section contains a suggested solution to the Practice Use Case. |
Data management review solutions
Why is it best to clean your data?
-
to make them as fit for use as possible
-
to achieve your data quality goals
You should always aim to manage and publish data with the highest possible quality. This will improve your day-to-day work (it is easier to work with organized and clean data), as well as the work of potential re-users of your data, who need to understand them and trust their source before using them.
How should you organize your data cleaning workflow?
-
ask your colleagues for expertise
-
work at an institutional level to harmonize data quality workflows
Nobody is expected to know everything about biodiversity data; you should seek help and advice from your colleagues or other knowledgeable people, and ensure that you’re applying the good practices recommended by your institution as you clean your data.
Which is best:
-
prevent errors from occurring
-
correct errors as soon as you find them in your database or spreadsheet
The best way to avoid spreading errors in your data is to prevent them from occurring at the start of the data collecting/recording process.
Of course, mistakes are unavoidable so you should also clean them as soon as you find them, and document the cleaning process.
If you don’t have the time or resources to properly clean your data, it is best to wait before you can do so instead of publishing erroneous data that might confuse people.
Whose responsibility is data quality?
-
Everyone involved in the management of data
Every person involved in your data management workflow is at least partly responsible for their quality, from the field technicians to the database manager(s).
People who might later use your data can inform you of any remaining error in your data, and should use them responsibly for their own research, but the initial data quality is not their responsibility.
GBIF can perform automatic checks on your data (e.g. detection of missing values, geographic outliers, unknown scientific names) but should not be held responsible for errors that occurred earlier in the data management process.
Which tools can be used to clean your data?
-
Excel & other spreadsheets management tools
-
OpenRefine
-
Your database software
-
Online tools such as Scientific Names Resolver or Google Maps
All kinds of tools can be used to clean your data, but you should identify which ones will answer your needs in terms of taxonomic resolving, georeferencing, deleting duplicates, and so on. You can find helpful tools listed in the data management section.
Data publishing review solutions
What does data publishing mean in the context of GBIF?
-
Making your biodiversity dataset(s) publicly accessible and discoverable in a standardized format
Data publishing within GBIF means making your biodiversity dataset(s) publicly accessible in a standardized format (most of the time, Darwin Core), so that it can be discovered and reused by other people.
What is an IPT?
-
a tool that helps you publish your data to GBIF
-
a tool that helps you produce a Data paper
The IPT (Integrated Publishing Toolkit) is a Java-coded software that allows you to upload and publish data to GBIF. It is not to be used as a data management or data cleaning tool.
The IPT can also help you with the process of writing and submitting a data paper, thanks to the EML file it generates automatically when you fill in the metadata for your data resource.
Which Creative Commons licences and waivers are recommended by GBIF for data publication?
-
CC0, CC-BY and CC-BY-NC
The Creative Commons licences and waivers recommended to publish your dataset(s) to GBIF are CC0, CC-BY and CC-BY-NC. They are widely recognized licenses and/or waivers that align with international open-data requirements for data sharing and re-use.
Please note that you should only choose CC0 or CC-BY waiver/license for your BID-related dataset(s).
What are the three Cores from which you can choose for an IPT resource?
-
Occurrence Core, Taxon Core, Event Core
You can choose one of the three following Cores for each of your IPT resources: Occurrence, Taxon or Event Core.
The Darwin Core standard also allows you to link extensions to your chosen Core, such as SimpleMultimedia or MeasurementOrFact.
The metadata are filled in a separate section of the IPT and are shared using the EML standard, not the Darwin Core (which is used for data only).
How many Extensions files can a dataset have?
-
as many as needed
Once you have chosen a Core for your IPT resource, you can add Darwin core extensions to it. You can add only one or several extensions, depending on the type of Core you chose, and which extensions are compatible with it.
Extensions are not mandatory (you can publish a dataset without any extension) but can be useful if you want to share additional information that you could not map with your chosen Core.
Practice Use Case suggested solution
Suggested solution (4 MB) Exercise 1c capture example spreadsheet (4 MB)