Historical marine data

Familiarize yourself with the use case scenario.

Scenario

UC MS Calliostoma canaliculatum
Calliostoma canaliculatum (Lightfoot) 1786 observed in United States of America by Sheldon Logan (licensed under CC-BY-NC 4.0)

This narrative was developed as a basis for practical exercises in the biodiversity data mobilization course and the exercise concept and content was developed initially by David Bloom and Amanda Whitmire for use in CESP2022-009 (Expanding the Network: Bringing marine research stations and library collections into the data-sharing community).

This is a partially fictionalized scenario based on a real project and datasets. It is intended for instructional purposes only. Use case data may not be reproduced for any other purpose without the permission of the Harold A. Miller Library at Hopkins Marine Station (HMS-Library@lists.stanford.edu).

Description

Hopkins Marine Station (HMS) is the marine laboratory for Stanford University. The station is located two-hours south of the main Stanford campus on the western coast of California. With one of the oldest marine laboratories in the United States, the library at HMS houses an extensive collection of historical marine data in the form of ship manifests, logs, fieldbooks, academic papers (including student papers), and other analog materials.

While contributing to several large-scale digitization and online repository projects, the Head Librarian met a local GBIF Node Manager who encouraged HMS to begin the process of digitizing and publishing the library’s extensive collection of biodiversity data to GBIF and other data aggregators. Recently, the Librarian has made it a priority to begin the process to transfer the data from field books, note cards, and into a new collection of image files (.jpg), pdf documents and spreadsheets containing event and occurrence data. The new files, and the data contained within them, will be organized into the library’s new content management system.publications, and images in the library’s archive into a content management system.

The process of digitizing and publishing HMS data will take time. The library budget is allocated annually and does not include funds for digitization or to hire new staff for the project. Human resources are also limited to the Head Librarian (who has a PhD in Oceanography), a Library Assistant with an MS in Library and Information Science, an undergraduate work-study student who is paid for 10 hours per week, and two local retirees who volunteer at the library twice weekly in four-hour blocks.

The Librarian and Assistant already possess a large-format scanner and the necessary software to scan the library’s wide variety of documents, including materials with bindings. Together, they have determined that digitization should begin with the work of one of HMS’ best known collectors, Willis G. Hewatt. They have also developed a preliminary workflow for the job:

Identify a specific locality or transect within the Hewatt archive and then identify any associated historic or modern faculty or student papers, journal articles, books, and associated resources (e.g., index cards, maps, field books).

The Library Assistant will pull all of the analogue resources from the HMS archives and then scan them into appropriate digital formats, using optical character recognition (OCR) whenever possible. Data formatted as tables will be converted and imported into a common spreadsheet.

The volunteers will review the OCR’d files for conversion errors and to confirm that the tables match the analogue source materials. Any living authors will be consulted when questions arise in regard to their work.

Under the supervision of the Library Assistant and Head Librarian (indirectly), the work-study student will map the tables to GBIF Event Core and Occurrence Core fields, adding information from associated papers and other written resources (e.g., journal articles, manuscripts). The Head Librarian will provide a final review to confirm the accuracy and completeness of the data.

All tables will be sent to a local GBIF Node Manager who will provide access to the Node IPT and who will register the HMS with GBIF and ensure that the datasets are published correctly to both GBIF and the Ocean Biodiversity Information System (OBIS).

Original data collection

The first collection of data selected by the Head Librarian is the “Hewatt Transect”, a qualitative observational record of the intertidal zone directly adjacent to the HMS facility. These observations were recorded continuously from 1931 to 1933, with additional observations added in 19341. A student paper in 1993 (Gilman & Sagarin2) sought to resurvey the same transect, although only part of the original transect was observed. Both projects entered into the survey with a list of species expected to be present within the transect. Hewatt’s dissertation and Gilman & Sagarin’s student paper were preserved by the library, along with related published peer-reviewed papers. The data necessary to publish to GBIF and other data aggregators is spread across all of the resources in the library archive.

MS Figure1
Figure 1. Table X from Willis G. Hewatt’s 1934 thesis showing the abundance of the 90 most common taxa across a 105-yard intertidal transect at Hopkins Marine Station.
MS Figure2a
Figure 2. Data tables from Gilman and Sagarin (1993) showing counts of 97 species across 19 quadrats that were sampled along Hewatt’s transect. The tables show Hewatt’s data (white) and their data (shaded), with totals at the right.
MS Figure2b

Data conversion and aggregation

Both Hewatt and Gilman/Sagarin recorded data in a table format for species for each transect. These “lists” were converted into a format aligned to an occurrence dataset.

MS Figure3
Figure 3. Hewatt’s data transcribed into a spreadsheet. This is a critical first step, but this is not a data layout that will work for upload to taxonomic databases.
MS Figure4
Figure 4. Hewatt’s data converted to a format aligned to an occurrence dataset. Some information has been pulled from the thesis to start to build out each occurrence record.

Data Assessment

As a part of the Head Librarian’s review the converted and aggregated tables included special attention to specific data which describe:

  • Observation depth

  • Date and time

  • Geographic coordinates

  • Locality descriptions

  • Research methods

  • Species Identifiers

MS Figure5
Figure 5. Hewatt’s data converted to a format aligned to an occurrence dataset and Darwin Core standard metadata. After more metadata terms are added, each quadrat observation has become a row with over 65 columns of information.

References

  1. Hewatt, Willis Gilliland. “Ecological Studies on Selected Marine Intertidal Communities of Monterey Bay.” Ph.D., Stanford University, 1934. https://purl.stanford.edu/wr792fj0469.

  2. Gilman, Sarah E., and Raphael D. Sagarin. “Temperature Correlated Long-Term Faunal Changes in the Rocky Intertidal.” Hopkins Marine Station Student Papers. Stanford (Calif.), 1993. Stanford Digital Repository. https://purl.stanford.edu/jd254tq7230.

Exercises

Download the exercise sheet. (MS Word, 2.8 MB)

Exercise 1

Data capture

  1. Download UC-MS-2-ForCapture.pdf. (?? KB)

  2. Download UC-MS-2-template.xlsx. (82 KB)

  3. What data structure would you use to reflect the data?

  4. Create a spreadsheet using this structure and the data.

  5. Use the exercise sheet to provide your answers and submit the spreadsheet created in the previous step.

Exercise 2

Data management

  1. Download UC-MS-2-ForCleaning.csv. (82 KB)

  2. Evaluate the dataset and identify which types of errors are present.

  3. Identify possible ways to correct those issues, and perform those corrections for as many of the errors present as you can.

  4. Use the exercise sheet to provide your answers and submit the spreadsheet.

Exercise 3

Data publishing

  1. Download UC-MS-3-ForPublication.csv. (402 KB)

  2. Use the previously provided IPT installation to publish the given dataset.

  3. Use the exercise sheet to provide your answers and link to the published dataset.