Terminologie

This module provides an introduction to the language, terminology and definitions for some of the basic concepts, functions and processes related to biodiversity data mobilization.

Définitions

In this video (12:02), you will review terminology used in this course.

If you are unable to watch the embedded video, you can download it locally. (MP4 - 45.8 MB)

Transcription de la présentation

Cliquez pour développer

Slide 1 - Foundations - Definitions terminology

Cette session est un travail sur les bases et fait partie de la section Terminologie. Nous allons vous présenter le langage, la terminologie et les définitions de certains des concepts de base, fonctions et processus que vous allez mettre en œuvre pendant le reste du cours.

Image by Carsten Witzel, obtained via http://www.freeimages.com/photo/root-of-a-tree-1371557.

Slide 2 - Key concepts

Commençons par revoir la structure de cette session.

Dans la section sur les DEFINITIONS nous allons d’abord essayer de clarifier l’utilisation du mot "Données".

Nous examinerons ensuite la terminologie entourant Champs, Étiquettes et Listes.

Enfin, nous allons clarifier certains concepts relatifs aux caractères et à leurs conversions, ce qui permettra d’éviter les erreurs de données indésirables.

Slide 3 - Definitions

Premières définitions.

Dans cette section, nous allons examiner quelques-uns des nombreux termes et expressions qui existent dans le monde de la gestion des données et de l’information.

Nous les utiliserons tout au long du cours et il est important que vous soyez familier avec, et sachiez quand et comment les utiliser.

Image by Sharon Grant: https://www.flickr.com/photos/rondls_pics/3629532537/in/album-72157619834323354/

Slide 4 - Definitions - a rose by any other name

Nous allons nous concentrer un instant sur quelques expressions communes qui contiennent le mot donnée.

Slide 5 - What’s the difference - Database

Toutes les phrases ici incluent le mot donnée, mais sont-elles toutes les mêmes, et si ce n’est pas le cas, alors quelle est la différence?

Slide 6 - Database

Alors, qu’est-ce qu’une “base de données”?

Il y a de nombreuses définitions, mais ce qu’ils ont tous en commun est le fait qu’une base de données est une collection structurée et organisée de données et/ou d’informations détenues sur un ordinateur.
Tout le reste de la liste est juste un moyen de vous aider à réaliser ce nirvana de données!
Approfondissons donc un peu ces autres termes.

Slide 7 - Database language

Examinons l’expression "langage de base de données".

According to the Oxford English dictionary “language” is the method of human communication…consisting of the use of words (or symbols) in a structured and conventional way.

Pour nos besoins, les langages de la base de données et leurs alliés proches, les langage des tableaux sont la façon par laquelle un humain communique avec un ordinateur pour manipuler et déplacer les données.

Les exemples que vous pouvez rencontrer sont:

Java, SPARQL, SQL et Python
Ce sont simplement des méthodes utilisées pour définir et créer des tables, des vues et des requêtes, et permettre aussi aux utilisateurs d’accéder, de mettre à jour, de stocker et de déplacer des données.

Slide 8 - Database program/software

Voyons maintenant l’expression « programme ou logiciel de base de données ».

Une fois de plus, dans le dictionnaire anglais d’Oxford, un "programme" est un ensemble de mesures ou d’activités connexes avec un objectif particulier à long terme.

Dans ce contexte, un "programme de base de données ou logiciel" combine et présente des fonctions et des fonctionnalités pour manipuler les données, ensemble est une interface unifiée.

Les exemples que vous pouvez rencontrer sont:

Microsoft Access, FilemakerPro et Oracle

Ainsi, les programmes ou logiciels de base de données sont des services regroupés qui permettent à un utilisateur, via un langage de base de données, de stocker, gérer, interroger et rapporter les données conservées dans le système.

Slide 9 - Data cleaning tool

Le dernier terme est "outil de nettoyage des données".

Data cleaning tools are systems whose utilities focus is on the manipulation of data with the aim of increasing fitness for use.

Des exemples que vous connaissez peut-être ou que vous apprendrez au cours de la formation sont:

OpenRefine, MatLab and spreadsheets

Un outil de nettoyage des données devrait aider l’utilisateur à découvrir et à corriger des données confuses afin d’en améliorer la qualité.

Slide 10 - Definitions - a rose by any other name

En poursuivant avec d’autres termes avec lesquels vous devriez être familier, nous examinerons les différences entre les termes Champ, Étiquette et Liste.

Slide 11 - Field vs label

You will often hear reference made to field or column names and field labels or headings. Often they will be used interchangeably, however they are different things.

Field or column names refer to the actual column in the database table. They are a often short, not always in plain language, sometimes computer generated; but in a properly formed database they are ALWAYS unique. Field names are almost always used by database languages, because the avoid ambiguity.

Labels and headers on the other hand, refer to the human readable versions of the field names. Most often you will see these in the graphical user interface of your software and on exports and reports. In some software you can in fact have different labels to the same field on a user by user or group basis!

Slide 12 - Lists

As you work with your data you will come across the phrases “drop-down list” and “controlled vocabulary” and on the surface they can appear to be the same. Indeed they are both powerful tools to help you keep your data clean, but they are different.

A drop-down list, also sometimes called a lookup table is a data entry element that is created in your database software. It allows to you select a value from a predefined set of options. They can also be generated from the data itself on the fly.

A controlled vocabulary on the other hand, is a very strictly defined, domain specific set of values that should be applied to a particular field. They are an integral part of a well defined standard. Such values can be used in a drop-down list as mentioned previously.

Slide 13 - Definitions - a rose by any other name

Dans cette section, nous examinerons ce que le terme caractère signifie par rapport aux données et aux ordinateurs.

Slide 14 - Characters

Pour rappel, les ordinateurs "voient" en binaire. Tout ce que vous tapez est conservé en mémoire sous la forme d’une série de uns et de zéros.

The computer does not “see” the repertoire of letters and symbols on your keyboard. It needs a map to translate the character set that you would write or type, into its binary language and then back again to show it to you on the screen or a printer.

Cette carte est composée de deux parties :

The “coded character set”

L'"encodage des caractères"

Chaque symbole ou glyphe est assigné à une valeur numérique et ce nombre est ensuite converti en binaire.

Slide 15 - Characters example

Dans cet exemple, nous voyons comment l’encodage des caractères est utilisé pour mettre en correspondance la lettre majuscule A dans l’alphabet occidental.

The ASCII value for the capital letter A is the decimal number 65

qui en HEXADECIMAL est 041

et en BINAIRE 01000001

Essayez de mapper votre propre nom au binaire en utilisant le lien en bas de l’écran.

Slide 16 - Characters comparison

En effet, ce qui se passe est que lorsque vous tapez dans un clavier et que vous enregistrez un fichier, le logiciel que vous utilisez sélectionnera un encodage à sauvegarder avec.

Lorsque vous ouvrez à nouveau ce fichier, le logiciel doit savoir quel encodage (carte) utiliser pour réafficher les informations correctement.

Slide 17 - Characters

Alors pourquoi est-ce important pour vous en tant que dépositaire de données ?

Disons que vous traduisez le début du soliloque d’Hamlet

"Être ou ne pas être, c’est la question"

en français puis enregistrez le fichier avec un éditeur de texte qui utilise l’encodage Latin-1

Puis vous l’envoyez à un collègue pour vérifier la traduction. Voila!

Slide 18 - Characters

Quand ils ouvrent le fichier dans MicroSoft Word c’est ce qu’ils voient.

Mon dieu!

Il y a des caractères inattendus.

Vous avez oublié de dire à votre collègue que vous utilisiez le Latin-1 comme encodage.

MicroSoft Word à leur extrémité a été configuré pour utiliser l’ensemble d’encodage Unicode UTF-8. Les cartes ne correspondent pas.

Selon le logiciel, les données peuvent être irrémédiablement corrompues.

Slide 19 - Review

En revue, nous avons examiné les différences entre les phrases :

Database; database language; database program, data-cleaning tool
Noms des champs & Libellés des champs ; Listes déroulantes & vocabulaires contrôlés et entre
Repertoires, Coded sets & Encodings

Et a montré que, malgré que les termes puissent sembler similaires, en fait, ont des définitions très différentes.

Slide 20 - Wrap-up

Souvienez-vous que:

“…La première étape de la sagesse est de connaître les choses en elles-mêmes; cette notion consiste à avoir une idée réelle des objets.”

We often take for granted that there is a whole language associated with biodiversity informatics that is not commonplace and has to be learned. There is no shame in going back to basics and getting to grips with the vocabulary, grammar and syntax just like with any new language.

Diapositive 21 - Conclusion

Cette vidéo fait partie d’une série de présentations utilisées dans le cours de mobilisation des données de la biodiversité du GBIF.Le programme de mobilisation des données sur la biodiversité a été initialement développé dans le cadre du programme Biodiversity Information Development financé par l’Union européenne.

Cette présentation a été initialement créée et narrée par Sharon Grant avec des contributions supplémentaires des formateurs du BID et du BIFA, les Mentors et Étudiants.

==== Logiciel

In this video (05:58), you will review examples of the different types of applications and software available in the world of biodiversity mobilization informatics.

If you are unable to watch the embedded video, you can download it locally. (MP4 - 23.5 MB)

Transcription de la présentation

Cliquez pour développer

Slide 1 - Foundations - Software terminology

This session is foundation work and part of the Terminology section We will be introducing you to types of software used in data mobilization.

Image by Carsten Witzel, obtained via http://www.freeimages.com/photo/root-of-a-tree-1371557.

Slide 2 - Software for mobilization

The purpose of this short presentation is to give you some examples of the different types of applications and software available in the world of biodiversity mobilization informatics.

Hopefully it will help you to start to think about what uses they may be put too and what types of things might be practical for you.

Slide 3 - Software definition

So very briefly, a definition.

We are making a distinction here between all software and those applications which have been created for an end-user to carry out a task.

We will take a look at software that can be used for:

Data Capture
Data Management
Data Cleaning and
Data Publishing

Slide 4 - Software for data capture

In the realm of data capture software, there is often no need to go “big” when capturing information initially, especially in the field.

You can capture information from collection specimens or live observations in a simple spreadsheet such as:

Excel, OpenOffice, LibreOffice or Google Sheets.

There may even not be a need to create your own system at all. Apps such as iNaturalist, or eBird are stable, robust and have simple intuitive interfaces.

Whichever the choice, it is important to think about the long term preservation of the data that you collect.

Slide 5 - Software for data management

So, depending on your needs, a more complex data management software might be the right solution for your data. Especially if long term retention and preservation is a priority.

More stable management platforms also better support sharing of data to multiple audiences.

There are many choices which vary in terms of cost and support and a few are shown here. From left to right:

Symbiota, Specify, EMu, Elysia and Brahms.

Slide 6 - Software for data capture and management

If a home-grown solution is preferred and can be supported, relational database software can provide cheaper alternatives.

Examples incluse MySQL, SQLServer, Access or filemaker

Slide 7 - Software for data cleaning

Increasingly the lines between languages, programs and tools have blurred, each being capable of many complex and overlapping functions. The common “Find and Replace” function is a good example of a data cleaning tool that is found in almost all database programs and is a command in database languages.

You are probably already familiar with spreadsheets such as Excel and indeed these are probably the most common data cleaning tools available. However there are others.

OpenRefine originally known as GoogleRefine is a powerful free, open source data cleaning tool with a large support community. You will see more of this program in upcoming sessions.

R is another increasing used, sophisticated programming language and environment. With a little practice it can also provide powerful data manipulations.

Slide 8 - Software for data publishing

Planning further ahead to the sharing of your data through GBIF to the international scientific and deciders community you will need to choose a publishing tool.

The choice will depend largely on the level of infrastructure support at your institution. If this is high and stable, installing an IPT or BioCASe application may be the right option. But there are others if this is not the case.

You can consider partnering with an existing ipt provider or selecting a database management system that includes a publishing pathway of its own.

Whichever you choose, make sure that your system allows you to export data as csv or txt files and you will ultimately be able to publish your data.

Slide 9 - Other software to handle data

This is a summary of the information collected in the course survey of software in use by students, mentors and trainers in the GBIF community.

The link and reference to more details of each section, is at the bottom of the slide.

Reference: https://docs.google.com/spreadsheets/d/1IWBsL4r-LsJRdYYPy9q4eLOAa1nzZYBwTdwk3Kd8cV8/

Slide 10 - Wrap-up

Souvienez-vous que:

“The purpose of software is to help people” and not the other way around.

Your choice of solution should make your life easier but it should also not sacrifice the long term viability of the data that it is meant to work with.

Diapositive 11 - Conclusion

This presentation was narrated by Sharon Grant with additional contributions by BID and BIFA Trainers, Mentors and Students.

Structures

In this video (13:10), you will review the field and data types that hold data, the structures that help to organize and protect that data and what these mean for the integrity and security of your data.

If you are unable to watch the embedded video, you can download it locally. (MP4 - 44.4 MB)

Transcription de la présentation

Cliquez pour développer

Slide 1 - Foundations - Structures terminology

Image by Carsten Witzel, obtained via http://www.freeimages.com/photo/root-of-a-tree-1371557.

Slide 2 - Key concepts

Commençons par revoir la structure de cette session.

In the section on STRUCTURES we will be working on understanding the field and data types that hold data, the structures that help to organize and protect that data and lastly look at what these mean for the integrity and security of your data.

Slide 3 - Structures

In this section on basic database Structures we will focus on the different elements that databases, software tools and languages use to handle the data that you enter.

Slide 4 - Structures - Planes, trains and automobiles

First, lets look at field types, also known as data types.

Slide 5 - Containers – field/data types

Data Type lets the computer know what kind of data is expected in a particular field. This then dictates what functions can be performed on it.

They can be simple and encompass a single concept or combine concepts depending on the software or language you are using. Either way the essential building blocks are the same.

So let’s look at six that you will come across most frequently.

Slide 6 - Containers – field/data types

First Numeric fields – This data type holds numbers, and there are 2 kinds:

Integers and long integers, hold whole numbers. For example 1, 2, 3, 99, 2000.

The size of the number, depends on the number of bits assigned to the field (more about that later). For now just remember that a long integer field will hold a bigger number than an integer.

Floats, also referred to as doubles, hold fractions or parts of whole numbers. For example 4.5, 19.47

The important thing to remember here is that a float always has decimal places so the number 10 will be stored as 10.0. The number of decimal places being an integral element of the type. This has implications when it comes to precision and accuracy.

Slide 7 - Containers – field/data types

Alphanumeric fields - are also referred to as text, character or string fields.

They can hold any letter, number or symbol in the coded character set and usually have a defined character limit. Entering more characters than the character limit can often result in truncation and loss of data on save.

Slide 8 - Containers – field/data types

Another type of text field that you will come across is the Unstructured text field, also known as memo, long char or blob fields.

These types of field are used to hold large amounts of unformatted text and unlike alphanumeric fields may not have an obvious character limit.

In simple systems, they are generally less flexible in terms of processing; for example sorting and indexing, than their smaller alphanumeric cousins.

With decreased storage costs and the rise of NoSQL document-oriented database programs, such as MongoDB however, this is changing.

Slide 9 - Containers – field/data types

System Value fields have a double nature and what you see on the screen is not usually the actual value that the software is storing in memory. Date and time fields are the most common example of this type of field.

In this example you typed “1 March 1971” into a cell in Excel and hit return.

Now it says “1-Mar-71” in the cell AND “3/1/1971” in the functionbar.

Excel is applying a display format to the cell, which you can change at will.

HOWEVER, if you remove the format you see that what is actually being stored in the field is the number 25993.00!

So you can see that the field is actually a float. You should beware because that system value not be the same from software to software. Exports should to checked to make sure the actual date is exported and not the system value.

Slide 10 - Containers – field/data types

Boolean fields – also known as Binary fields only contain one of two values that represent a one or a zero.

They are a special case of a System Value field as mentioned previously in that whilst the software actually saves a number it often displays values such as:

Yes/No; Yes/no; Y/N; True/False.

You should take care to find out which way around the system you are using translates.

Is a 1 a yes or no?

Slide 11 - Containers – field/data types

Structured Text fields are less common but mentioned here for completeness.

They are fields which can be either made up of combinations of the simple data types or even be defined by the data itself.

Most often these are used in programming languages to allow flexibility and handling of complex concepts.

An example is the factor field used in the R programming language, which is defined as a type vector containing a set of numeric codes with character-valued levels.

Ref: https://www.stat.berkeley.edu/~nolan/stat133/Fall05/lectures/DataTypes4.pdf

Slide 12 - Structures - planes, trains and automobiles

Now that you have an understanding of the field types used to represent your data elements let’s talk about the containers or structures, that many database programs and software use collate and display your data.

Reference: https://www.geeksforgeeks.org/data-structures/

Slide 13 - Containers - Data structures

First is the cell.

This is the smallest data structure, it holds a single value and has a single data type associated with it.

Slide 14 - Data structures

There is one concept that you should beware of when it comes to cells and that is the “empty cell”. “An empty cell doesn’t necessarily mean an empty field.”

Some software “fills” an empty field with a System value based on the data type.

Examples are “NULL” “N A” or even zero!

Slide 15 - Containers - Data structures

As you start to organize cells you begin to build a grid. Grids have implied structure in the form of

Lignes

Slide 16 - Containers - Data structures

Et les colonnes.

Slide 17 - Containers - Data structures

When you organize cells into rows and columns, obviously, what you get is a grid!

This is looks very much like a spreadsheet right?

Well, yes. The difference however between a spreadsheet grid and a table are the strength of the ties that hold the rows and columns together.

Slide 18 - Structures - planes, trains and automobiles

So let’s now talk about those connections.

What the implications for the integrity and security of your data of loose unenforced relationships between rows and columns?

Slide 19 - Data structures

What actually makes a table useful are the connections that we create to give the data in the structure meaning.

We intrinsically make patterns of attributes that turns data into information.

Rows become records and Columns become attributes

In a spreadsheet however, the connections between cells are idiomatic, loose and un-enforced.

It is easy for the attributes between the rows to get jumbled up or for the values get changed and then the meaning is lost.

Slide 20 - Integrity and security

This lack of enforced connections makes a spreadsheet easy to use but these same qualities also cause problems.

Errors are easy to make and hard to catch.

Connections and relationships are not documentable and so subject to interpretation and mutation.

Spreadsheets are not built to scale either in terms of size or numbers of users.

So can be slow, crashy and easily corruptible

Spreadsheets are easy to copy and email

So they are hard to version and keep track of

The question is what is the solution?

Reference: https://www.mrc-productivity.com/blog/2014/09/7-dangers-of-spreadsheets-part-1/

Slide 21 - Containers - Integrity and security

Let’s look back, at the connections in a bit more detail.

We generally understand that the data in each row of a table means something.

For example each row can represent a specimen or a taxa that was found at a particular place, at a particular time, by a particular person.

It would be odd to have one row in a table that represents a specimen followed by a row that represents a collection site.

We also often want to analyse, sort and compare values of particular attributes between rows.

For example you may want find the mean weight of all the Kenyan bats.

So ideally we want to use a system to store our data that:

always holds the rows together
does not mix data types
and segregates data in each table to a single concept or theme

These are functions of a database!

Slide 22 - Integrity and security

Functions in a true database inherently bind rows together and columns may only have a single data type.

Tables in database are held together by virtual links called keys.

These keys are identifiers that match between tables and allow data to be connected efficiently.

It is these 2 key concepts that distinguish a spreadsheet from a database.

Reference: https://www.mrc-productivity.com/blog/2014/09/7-dangers-of-spreadsheets-part-1/

Slide 23 - Review

In review then, we have looked at the 6 most common data types, the basic building blocks of data structure and distinguished between a spreadsheet grid and database table.

Understanding these elements and how they interact with one another will allow you to better determine the best ways to manipulate and document the data that falls under your custodianship.

Slide 24 - Wrap-up

Remember that:

“When you realize the difference between the container and the content, you will have knowledge.”

Using the correct terms to describe both your data and the structures that you create and use to store it; is the most important way to ensure that it remains relevant and useable.

Never underestimate how much damage can be done to priceless information by careless descriptions and storage.

Slide 25 - Conclusion

This video is part of a series of presentations used in the GBIF Biodiversity Data Mobilization course. The biodiversity data mobilization curriculum was originally developed as part of the Biodiversity Information Development Programme funded by the European Union.

Cette présentation a été initialement créée et narrée par Sharon Grant avec des contributions supplémentaires des formateurs du BID et du BIFA, les Mentors et Étudiants.