Term catalogue

From PANGAEA Wiki
Jump to navigation Jump to search

The PANGAEA term catalogue

The PANGAEA term catalogue is an integral part of the internal relational data management system. It functions as a thesaurus-like "construction kit" that enables the use of controlled vocabularies - including thesauri, taxonomies, terminologies, and ontologies – for enriching and standardizing metadata across datasets. It consists of terms (concepts) and relationships that can be linked to various components of the PANGAEA data model (e.g., parameters, methods/devices).

To support consistent metadata use across disciplines, the PANGAEA term catalogue includes both internally maintained vocabularies such as "Microbiochemistry, PANGAEA" or "Keywords" and externally curated vocabularies such as The World Register of Marine Species (WoRMS) and Chemical Entities of Biological Interest (ChEBI). PANGAEA maintains bidirectional workflows with some external vocabularies – new species names, for instance, can be submitted to WoRMS, while updated records at WoRMs are regularly imported to PANGAEA.

The term catalogue in PANGAEA is designed to support structured and semantically rich metadata annotation. Each term in the catalogue can be linked to other terms through qualified relationships. Current PANGAEA relationships are, e.g. "has broader term", "has attribute", "is synonym of", "is same as", "is related to", and more. These semantic relations form an open, hierarchical, and context-sensitive structure that enables flexible classification from multiple perspectives and supports the semantic alignment of equivalent concepts across different vocabularies. Wherever possible, terms are linked to persistent identifiers (URIs) to ensure compatibility with semantic web technologies and to minimize redundancy.

Purpose and benefits

By relying on shared, well-documented vocabularies, the term catalogue ensures metadata is described in a standardized, unambiguous, and machine-readable format. This improves semantic consistency within PANGAEA, enables the integration of heterogeneous datasets, and facilitates alignment with external systems.

In practice, the catalogue improves:

  • Metadata quality by refering to unambigous concepts and a unifying terminology, and preventing duplication.
  • Search functionality by supporting hierarchical filtering (faceting), synonym recognition, thematic grouping, and broader-term expansion - for example, a temperature parameter like “temperature, water” and “temperature, ice” is automatically annotated with two terms: the first, temperature, represents the measured quantity (quantity kind in PANGAEA), and the second, introduced here by a preposition (“water” or “ice”), defines the measurement environment (feature in PANGAEA; for more information see https://doi.org/10.1016/j.jbiotec.2017.07.016). This allows such parameters to be grouped differently depending on the domain context (e.g., measured in water vs. in ice). For further details on the PANGAEA search, also see PANGAEA search.
  • Metadata validation - for example, a relation like "is unit of" or "is method of" could be used to ensure that only appropriate units or methods are assigned to a specific parameter.

How is the metadata annotated with terms from the term catalogue?

Many metadata fields in PANGAEA are initially populated using free-text descriptions. These entries are subsequently annotated with appropriate controlled terms from either externally maintained vocabularies or PANGAEA’s own internal terms. In most cases - particularly for parameters - this annotation process is semi-automated. Semi-automated annotation refers to a script-based approach that analyzes segments of the input text. The script parses the string based on predefined rules, such as character position or specific separators (e.g., commas), to identify distinct sections and match them with corresponding terms in the catalogue. For example, characters in positions 1–10 may be linked to one term, while a comma signals the beginning of a new annotation segment. A single parameter section can only be associated with one controlled term directly. When assigning the vocabularies, they are ranked differently, so that it is always clear which vocabulary is used. After this automatic step, data editors can manually review and approve the suggested annotations to ensure correctness and consistency.

In other metadata fields, such as methods or devices, the annotation is currently performed manually by data editors. In certain fields - especially those involving keywords or geographic locations - metadata is directly described using controlled terms from the catalogue without any free-text input.

Annotated terms are publicly visible and are displayed on the PANGAEA website. When users hover over an annotated element, the linked term and its source are revealed in a tooltip-like overlay. For technical details or contributions, see the parameter-annotator GitHub repository

Viewing the terms used to annotate metadata

The terms used to enrich PANGAEA datasets can be both inspected by users through the web interface and harvested by machines.

Popup window example
Example for a popup window showing the terms from controlled vocabularies used to annotate the given PANGAEA parameter

Seeing terms in the user interface

In the PANGAEA web interface, it is not always visible whether metadata entries (e.g., locations or keywords) were taken directly from a controlled vocabulary or entered as free text. However, when freetext metadata have been explicitly annotated with terms afterwards, this metadata will contain a link. Clicking on it will open a popup window showing which controlled terms were used for annotation

Retrieving terms through dataset harvesting

Terms are exposed in both the schema.org format (see example dataset, more details on the schema are available at schema.org and here) as well as in the PANGAEA metadata schema format (see example dataset; more information on standard PANGAEA metadata interfaces are available here). Terms appear in fields such as parameters, methods, Event, locations, and dataset keywords. Each term is provided with a semantic identifier (semantic URI) that can be harvested by external systems. The terms alone, as well as the full relations between terms and the underlying terminologies, cannot currently be obtained from PANGAEA as linked RDF data. For some terminologies, however, term lists are available for download.

List of implemented terminologies

External vocabularies

Name of taxonomy/dictionary/terminoloy/ontology PANGAEA abbreviation Description URI Technical details Used to annotate which metadata?
The World Register of Marine Species, Aphia 1.0 WoRMS WoRMS: The World Register of Marine Species (WoRMS) provides an authoritative and comprehensive list of names of marine organisms, including information on synonymy.

Aphia: The Aphia platform is an infrastructure designed to capture taxonomic and related data and information, and includes an online editing environment.

WoRMS includes information on algae by AlgaeBase, which is redistributed by WoRMS with permission.

WoRMS does not include prokaryotes.

https://www.marinespecies.org/ PANGAEA imports taxonomic terms and relations from the World Register of Marine Species (WoRMS). These terms are updated monthly via data dumps provided by WoRMS.

Data editors may manually create “non-accepted” terms (with status “PANGAEA accepted”) if they are submitted to WoRMS and deleted if rejected. Accepted terms will be updated automatically.

Example for a term entry: https://ws.pangaea.de/es/pangaea-terms/term/1047579

Parameters (text segments of parameters that are taxonomic features, e.g. "Ammodiscus planus" or "Calanus finmarchicus, female, biomass as dry weight"), and taxonomic contents of data series (if those are associated with term-related taxomic parameters).

WoRMS terms are semi-automatically annotated with first priority. Only if no suitable WoRMS term is available, an ITIS term is annotated where possible.

Integrated Taxonomic Information System ITIS ITIS provides authoritative taxonomic information on plants, animals, fungi, and microbes of North America and the world. http://www.itis.gov/ PANGAEA imports taxonomic terms and relations from ITIS. These terms are updated on a monthly basis

Importer available at Github https://github.com/pangaea-data-publisher/pg_itis_importer|| Parameters (taxonomical features), and contents of data series (if those are associated with term-related taxomic parameters) only if no WoRMS term is available Semi-automatic annotation

Environment ontology ENVO ENVO is a community ontology for the concise, controlled description of environments. http://obofoundry.org/ontology/envo.html Single-time import from ENVO to PANGAEA, no importer Parameters (text segments of parameters that are environmental features, e.g. "Deuterium excess, water vapour")

Semi-automatic annotation

Phenotype And Trait Ontology PATO PATO provides terms for phenotypic qualities (properties). This ontology can be used in conjunction with other ontologies such as Gene ontology (GO) or anatomical ontologies to refer to phenotypes. Examples of qualities are red, ectopic, high temperature, fused, small, edematous and arrested. http://obofoundry.org/ontology/pato.html The availability of new PATO versions is checked on a monthly basis. If a new version is available, selected ontology terms and relation of the ontology are imported into PANGAEA by processing the OWL file provided at Github

During this import, only selected ontology elements (not all relations) are imported. || Parameters (text segments of parameters that are phenotypic features, e.g. "Depth, water") Semi-automatic annotation

Quantities, Units, Dimensions and DataTypes Ontology qudt QUDT.org is a public charity nonprofit organization founded to provide semantic specifications for units of measure, quantity kind, dimensions and data types.

QUDT version 1.1 is a structured vocabulary and ontology that provides a standardized representation of units of measurement and their corresponding quantities. It includes definitions for physical quantities (e.g., length, mass, temperature), units (e.g., meter, kilogram, kelvin), and the relationships between them (e.g., conversion factors, symbols, dimensional analysis).

https://www.qudt.org/ Onetime Import of version QUDT 1.1 as part of the "Quantities, PANGAEA" terminology.

Later versions of QUDT diverged from the PANGAEA requirements, as they incorporate broader and more complex structures beyond the scope of units and their associated measurement quantities, but lack some of the necessary mappings PANGAEA relies on.

QUDT forms the core of the custom "Quantities, PANGAEA", powering the background mapping of units to UCUM and linking parameters to quantities.

For details, see description below

Note: PANGAEA plans to change all units to UCUM

Chemical Entities of Biological Interest ChEBI ChEBI is a freely available dictionary of molecular entities focused on ‘small’ chemical compounds. https://www.ebi.ac.uk/chebi/ ChEBI terms are imported to PANGAEA on a monthly basis by processing the OWL file located at Github Parameters (chemical features)

Semi-automatic annotation

Challenge: Systematics is somewhat more detailed and sophisticated than we can make use of; the assignment doesn't work particularly well because it's often not clear which term is the right one; Based on InCHi keys

NERC device categories NERC-L05-L22 NERC SeaDataNet device categories (L05) is a terminology providing standardized categories for marine instruments (“device types”)

The NERC SeaVoX Device Catalogue (L22) lists specific device models with detailed metadata.

https://vocab.nerc.ac.uk/collection/L05/current/

https://vocab.nerc.ac.uk/collection/L22/current/

Importer available at Github Many existing methods have been manually annotated with terms from the terminology. Newly created methods can be annotated manually by data editors.

Automated annotation is planned for the future.

Annotation can be done with terms of any granularity level (Level 1, level 2, level 3, level 4) depending on how generic the method is and how many details the authors provided.

PANGAEA-own vocabularies

Name of terminoloy PANGAEA abbreviation Description URI Technical details Used to annotate which metadata instances?
Quantities, PANGAEA PAN-Quantity The "Quantities, PANGAEA" terminology comprises units and quantities from the harvested QUDT 1.1 terminology as well as other complementary quantities defined by PANGAEA

Later versions of QUDT diverged from the PANGAEA requirements, as they incorporate broader and more complex structures beyond the scope of units and their associated measurement quantities, but lack some of the necessary mappings PANGAEA relies on.

Unpublished, only annotated terms are visible to the public Onetime import of Quantities of QUDT version 1.1 Parameters (quantities)
  • Direct annotation - e.g. "Velocity, horizontal"
  • Indirect annotation by the unit - e.g., the term "Velocity" is added if the unit is "km/h"

Units archived in PANGAEA are automatically processed by a script from the GitLab repository PUCUM, which converts them into the standardized UCUM format (Unified Code for Units of Measure). In addition to the unit itself, this format also specifies the underlying dimension (e.g., {Length}, {Mass}). A JSON file in the repository defines how these UCUM dimensions are mapped to the corresponding Quantities in PANGAEA. This ensures that units are interpreted consistently and linked reliably to the relevant quantities in the system.

Classifying terms, PANGAEA PAN-CT The "Classifying terms, PANGAEA" terminology is used to define the thematic faceting displayed under “Topics” on the PANGAEA start page. It contains mappings to other terms, enabling terms used within datasets to be correctly classified into the appropriate topic (for further information, see Topic). Visible via the Facetting in the PANGAEA Search (“Topic” Facets) Weekly dump

Generates search index

PAN-MicroBio Microbiochemistry, PANGAEA Manually created feature ontology meant to build a framework that embeds and connects other vocabularies in the term catalague, e.g. CHEBI and Classifying Terms


Note: Terminology should at one point be reviewed and remodeled - Part of the trees should be kept and transferred elsewhere or merged with Classifying terms

Unpublished, only the annotated terms are visible to the public Created manually, not completed Parameters (features)
Methods and Devices, PANGAEA PAN-M&D The PANGAEA Methods and Devices terminology was developed in 2021 with the aim of structuring and harmonizing the methods/devices metadata. The terminology is a mix of own PANGAEA terms for broad categories for methods and devices and terms from NERC L05 and L22 (see above) that are integrated and represent more detailed information on device types and device models (for further information, see Intern:Method and Devices Terminology) Unpublished, only the annotated terms are visible to the public Created manually

Structure of the terminology:

  • Devices
    • Level 1: Broad device categories (e.g., Measuring devices)
    • Level 2: Specific functional groups (e.g., Current meters)
    • Level 3: NERC L05 – Device types
    • Level 4: NERC L22 – Device models
  • Other Methods (under construction)
    • Level 1: Broader categories for non-device-based methods||Many existing method/devices are already annotated manually with terms from the “Methods and Devices, PANGAEA”. New methods can be annotated manually by the data editors.

Note: For methods/devices used in events, more generic terms might apply than for methods/devices associated with parameters

In the long term, a process for automated annotation is planned.

Locations, PANGAEA PAN-Loc PANGAEA-own terminology for geographic locations with terms created and edited by data editors in the course of the curational workflow (for further information, see Intern:Locations)

Should be reviewed and cleaned up when time allows (e.g. upper/lower cases, remove duplicates etc.)


Unpublished, only the annotated terms are visible to the public PANGAEA locations are created as needed and maintained by the PANGAEA data editors. Naming is guided by established standards like ISO-3166, marineregions.org, gebco.net, and geonames.org. For more details, see Intern:Locations Secondary (optional) Event metadata, supplementing coordinates or geocodes (manual addition by data editors)

In addition to the locations manually assigned to Events by curators, additional locations are automatically added to datasets in the background to enhance search structuring. Based on latitude and longitude, certain locations (e.g. continents) are derived from this IHO list (available at https://www.marineregions.org/files/S23_1953.pdf).

Note: Location information is also added as start/end attributes to “Campaigns”, but so far only City names with country, i.e. Bremerhaven, Germany, (added as text string) from the C38 SeaDataNet list of port cities

Keywords, PANGAEA PAN-Key PANGAEA keyword terminology with terms created and edited by data editors in the course of the curational workflow.

Should be reviewed and cleaned up when time allows (e.g. upper/lower cases, remove duplicates etc.)

?? PANGAEA locations are created and added to metadata by PANGAEA data editors as needed (often based on suggestions by the data authors). Manual tagging of data sets, staffs, institutions, Events, parameters, and references - extendible to other tables
Technical keywords, PANGAEA PAN-TechKey A vocabulary used for the manual tagging of metadata, primarily for technical classification and organizational purposes (e.g., creation of selections).

Vocabulary comprises parts of the keywords from the former, no longer existing “PANGAEA thesaurus” -The other part moved into “PANGAEA, keywords”.

PANGAEA-own list, partially downloadable (only keywords that are in use):

https://ws.pangaea.de/oai/provider?verb=ListSets

PANGAEA technical keywords are created and maintained by the technical PANGAEA staff.

Formatting strictly regulated, only letters and numbers, no special characters (ASCI only)

Manual tagging of personnel entries, institutions, events, parameters, records and references (and possibly other tables)
Data Model Extensions, PANGAEA PAN-ModExt PANGAEA terminology collecting all PANGAEA attributes and their relationship to scientific disciplines. Unpublished, only the annotated terms are visible to the public For more details, see Intern:Attributes In selected cases, attributes can be added to Events, campaigns, datasets, and, in the future, data series.

PANGAEA is considering including more recognized terminologies, such as Uberon and SWEET, in the catalogue over time.

References

Diepenbroek, M et al. (2017): Terminology supported archiving and publication of environmental science data in PANGAEA. Journal of Biotechnology, 261, 177-186, https://doi.org/10.1016/j.jbiotec.2017.07.016

WoRMS Editorial Board (2023) World Register of Marine Species. Available from https://www.marinespecies.org at VLIZ. https://doi.org/10.14284/170

Integrated Taxonomic Information System (ITIS) (2023) www.itis.gov, CC0, https://doi.org/10.5066/F7KH0KBK

Guiry MD & Guiry GM (2023) AlgaeBase. World-wide electronic publication, National University of Ireland, Galway. https://www.algaebase.org

Buttigieg, P.L., Morrison, N., Smith, B. et al. (2013) The environment ontology: contextualising biological and biomedical entities. Journal of Biomedical Semantics;4:43. https://doi.org/10.1186/2041-1480-4-43

FAIRsharing.org (2023) QUDT; Quantities, Units, Dimensions and Types, https://doi.org/10.25504/FAIRsharing.d3pqw7

Hastings J, Owen G, Dekker A, et al. (2016) ChEBI in 2016: Improved services and an expanding collection of metabolites. Nucleic Acids Research;44(D1):D1214-D1219. https://doi.org/10.1093/nar/gkv1031

British Oceanographic Data Centre (2023) The NERC Vocabulary Server, Natural Environment Research Council. https://vocab.nerc.ac.uk

Kim S, Chen J, Cheng T, et al. (2023) PubChem 2023 update. Nucleic Acids Research;51(D1):D1373–D1380. https://doi.org/10.1093/nar/gkac956