Data Access and Reuse

From PANGAEA Wiki
Jump to navigation Jump to search

This article describes the different methods available for discovering, accessing, and reusing data published in PANGAEA. It is intended for researchers and data practitioners who want to interact with PANGAEA data beyond the standard web interface, including scripted and automated workflows. The methods range from interactive web-based search to fully programmatic access via HTTP content negotiation and dedicated client libraries.

Hands-on training materials covering many of the topics below are available in the PANGAEA Community Workshop Series, including Jupyter notebooks, R scripts, and slide decks in the PANGAEA community workshop GitHub repository.

Overview

Every dataset published in PANGAEA is assigned a globally unique and persistent Digital Object Identifier (DOI). The DOI is the single entry point for all forms of programmatic access: it resolves to a dataset landing page that, in addition to its human-readable representation, exposes all available metadata formats and data download options through standard HTTP mechanisms. There is no separate PANGAEA data API — all data and metadata access is built on top of the DOI and its landing page using standard web protocols. This design makes PANGAEA data directly accessible without any vendor-specific API key or proprietary client, while remaining fully compatible with FAIR principles.

Web-Based Search and Discovery

PANGAEA Search Interface

The primary discovery tool for PANGAEA data is the PANGAEA search engine at https://www.pangaea.de/, based on Elasticsearch. It supports full-text search and faceted filtering across all published metadata. Faceted navigation allows users to constrain results by topic, device type, geographic region, and temporal coverage. Documentation on search functionality and syntax is available in the Wiki at PANGAEA search.

External Portals and Registries

PANGAEA metadata is harvested by a large number of disciplinary and generic portals via OAI-PMH, making datasets discoverable beyond the PANGAEA website through services such as Google Dataset Search, OpenAIRE, DataCite Commons, DataONE, GBIF, EMODnet, GFBio, and others. PANGAEA is registered in re3data.org, FAIRsharing.org, RIsources, and the EOSC Marketplace.

An alternative for search when a PANGAEA-specific search interface is not required is the DataCite Search API (https://support.datacite.org/docs/api), which searches across a large inventory of research data from many repositories. It returns summary metadata including DOI names, and subsequent data access from any PANGAEA result can then proceed via content negotiation as described below.

DOI Landing Pages and Link Discovery

The Landing Page as Access Hub

Each PANGAEA dataset is represented by a landing page accessible at its DOI. For example:

https://doi.pangaea.de/10.1594/PANGAEA.841672

The landing page serves a dual function: it presents dataset metadata and a data preview in human-readable form, and it simultaneously exposes all available machine-readable representations of the same resource through standard HTTP link headers and HTML <link> elements. This implementation follows the Signposting standard (https://signposting.org/), which allows any HTTP client to discover all alternate representations of a dataset — metadata in various formats, the data file itself, and the ORCID iDs of the authors — without any prior knowledge of PANGAEA-specific URL patterns.

Discovering Available Representations

A simple HTTP HEAD request to the landing page returns the full set of typed link relations in the response header. The following example illustrates this using curl:

curl -LI https://doi.org/10.1594/PANGAEA.841672

The response Link: header will contain relations of the following types:

  • cite-as — the canonical DOI citation URL
  • describedby — links to metadata in various formats (ISO 19139, DataCite XML, PANGAEA XML, BibTeX, RIS, JSON-LD)
  • item — links to the data itself (tab-delimited text, HTML view)
  • author — ORCID iDs of the dataset authors

This mechanism allows scripts, harvesters, and other machine clients to discover and retrieve any representation of a dataset from its DOI alone.

HTTP Content Negotiation

HTTP content negotiation allows a client to request a specific representation of a resource by specifying a MIME type in the Accept header. All PANGAEA dataset landing pages support this mechanism, so both data and metadata can be downloaded programmatically by querying the DOI directly with the appropriate content type — no PANGAEA-specific URL construction is required.

Downloading the Data File

The tabular data file for a dataset can be downloaded as tab-delimited text:

curl -OJLf 'https://doi.pangaea.de/10.1594/PANGAEA.841672?format=textfile'

Equivalently, using content negotiation directly against the DOI:

curl -OJLf -H 'Accept: text/tab-separated-values' https://doi.org/10.1594/PANGAEA.841672

The -OJLf flags instruct curl to save the file using the server-provided (i.e. meaningful) filename (-O, -J), follow redirects (-L, to resolve the DOI), and fail with appropriate exit code (-f) on errors (easier to handle in scripts than the respective HTML errors).

Retrieving Metadata in a Specific Format

The same mechanism applies to metadata. To retrieve a dataset's metadata in ISO 19139/19115 format:

curl -L -H 'Accept: application/vnd.iso19139.metadata+xml' https://doi.org/10.1594/PANGAEA.841672

The following MIME types are currently supported for metadata retrieval via content negotiation:

Format MIME type
PANGAEA internal XML application/vnd.pangaea.metadata+xml
DataCite XML (v4) application/vnd.datacite.datacite+xml
ISO 19139 / 19115 application/vnd.iso19139.metadata+xml
NASA DIF application/vnd.nasa.dif-metadata+xml
JSON-LD (Schema.org) application/ld+json
BibTeX application/x-bibtex
RIS application/x-research-info-systems
Plain text citation text/x-bibliography
Tab-separated data text/tab-separated-values

Alternatively, the same formats can be requested using explicit URL parameters:

https://doi.pangaea.de/10.1594/PANGAEA.841672?format=metadata_datacite4
https://doi.pangaea.de/10.1594/PANGAEA.841672?format=metadata_iso19139
https://doi.pangaea.de/10.1594/PANGAEA.841672?format=metadata_jsonld
https://doi.pangaea.de/10.1594/PANGAEA.841672?format=citation_bibtex

Access to Restricted Datasets

A small fraction of PANGAEA datasets are under an active moratorium — access to the data is temporarily restricted while the metadata remains publicly visible. Access to your personal moratorium datasets requires authentication using a bearer token, which is the standard mechanism for API authorization (RFC 6750). Note that access to protected datasets of other authors is, of course, still not possible.

To obtain your bearer token, log in to PANGAEA. Login is supported both with a PANGAEA username and password, and via ORCID iD. After logging in, the user profile page displays the current session token under "Your temporary login token".

Please note: The bearer token is private and should be treated like a password — it must not be shared with others or included in publicly accessible scripts or repositories. If a token has been accidentally disclosed, the user profile page at https://pangaea.de/user/ provides a "Log out from all devices" option, which immediately invalidates all active tokens. When using content negotiation via the DOI resolver at https://doi.org/, it is safe to include the Authorization header: PANGAEA trusts the DOI Foundation's infrastructure, and the request is redirected to PANGAEA's own servers before any protected content is served.

The token can be passed as an Authorization: Bearer header in any HTTP request:

curl -OJLf -H 'Authorization: Bearer <your-token>' -H 'Accept: text/tab-separated-values' https://doi.org/10.1594/PANGAEA.841672

If a request is made without a token to a restricted dataset, the server responds with 401 Bearer Realm, signaling that authentication is required. In recent versions, curl can provide the token with an explicit flag (--oauth2-bearer). The respective command example would then be:

curl -OJLf --oauth2-bearer '<your-token>' -H 'Accept: text/tab-separated-values' https://doi.org/10.1594/PANGAEA.841672

OAI-PMH Metadata Harvesting

For bulk metadata harvesting, PANGAEA provides an OAI-PMH endpoint at:

https://ws.pangaea.de/oai/

OAI-PMH allows systematic, incremental harvesting of all PANGAEA metadata in supported formats. The following metadata standards are available via OAI-PMH: Dublin Core, DataCite v3 and v4, ISO 19139, and DIF (Directory Interchange Format). This endpoint is used by the portals and registries listed in the discovery section above, and is equally available to any user who wishes to build their own index or integrate PANGAEA metadata into an institutional system.

Programmatic Access with Python: pangaeapy

pangaeapy is the official Python client library for PANGAEA, developed and maintained by the PANGAEA team. It provides a high-level interface for loading and analyzing PANGAEA datasets directly into native Python data structures, without requiring manual HTTP requests or format parsing.

Installation

pip install pangaeapy

Loading a Dataset

The central object in pangaeapy is PanDataSet, which takes a DOI or PANGAEA dataset ID and retrieves both the data and the associated metadata:

from pangaeapy import PanDataSet

ds = PanDataSet('10.1594/PANGAEA.841672')

# Access the data as a pandas DataFrame
print(ds.data.head())

# Access dataset metadata
print(ds.title)
print(ds.authors)
print(ds.parameters)  # list of measured parameters with units and methods

The data attribute is a pandas DataFrame in which each column corresponds to a measured parameter, and each row to a measurement. Parameter metadata — including units, standard names, and method descriptions — is accessible through the parameters attribute.

Searching PANGAEA from Python

pangaeapy also provides a PanQuery class for querying the PANGAEA search engine programmatically. Note that the underlying search API used by pangaeapy is currently internal and undocumented; an official, publicly documented Search REST API is under development. Until that is available, pangaeapy offers the most straightforward path to programmatic search for Python users:

from pangaeapy import PanQuery

q = PanQuery('temperature salinity Atlantic', limit=10)
for result in q.results:
    print(result['doi'], result['title'])

Access to Restricted Datasets

A bearer token obtained from the PANGAEA user profile can be passed to PanDataSet to enable access to your datasets that are subject to a moratorium (not of other authors!):

ds = PanDataSet('10.1594/PANGAEA.841672', token='<your-bearer-token>')

Further Training Materials

Jupyter notebooks with worked examples for data discovery, loading, and analysis with pangaeapy are available in the PANGAEA Community Workshop GitHub repository at https://github.com/pangaea-data-publisher/community-workshop-material (see the Python/ directory).

Programmatic Access with R: pangaear

pangaear is a community-developed R client for PANGAEA, maintained as part of the rOpenSci ecosystem. It provides equivalent functionality to pangaeapy for R users.

Installation

install.packages("pangaear")
# or the development version:
# devtools::install_github("ropensci/pangaear")

Loading a Dataset

library(pangaear)

# Download and load a dataset by DOI
ds <- pg_data(doi = '10.1594/PANGAEA.841672')

# Access the data as a data frame
head(ds1$data)

# Metadata is accessible from the list object
ds1$metadata

Searching PANGAEA from R

res <- pg_search(query = 'temperature salinity Atlantic', count = 10)
print(res$doi)

R scripts and worked examples are available in the R/ directory of the PANGAEA Community Workshop GitHub repository.

The PANGAEA Data Warehouse

The PANGAEA data warehouse (based on Clickhouse) provides a powerful complementary access path for users who need to compile and aggregate data across large numbers of datasets — potentially across hundreds of thousands of individual publications — without having to download and merge files manually.

The data warehouse supports spatially and chronologically constrained queries at the parameter level, returning data together with the DOI of each contributing dataset to maintain full provenance traceability. It is accessible in two ways:

Through the PANGAEA website: The data warehouse interface is integrated into the search results page. After performing a search, users can select "Data Warehouse" to configure and download a parameter-level aggregation from all datasets in the result set.

Documentation: https://wiki.pangaea.de/wiki/Data_warehouse

Programmatically via SOAP API: The data warehouse is also accessible through the PANGAEA web services at https://ws.pangaea.de/. This allows automated, scripted aggregations and is supported by both pangaeapy and pangaear.

Note that data warehouse exports represent compiled data products and do not replace the need to consult individual dataset landing pages to assess the fitness of individual studies for a specific scientific application. Every export includes the DOI name for each data point, ensuring that citation and provenance are preserved.

Summary of Access Methods

Use case Method Entry point
Interactive discovery PANGAEA search https://www.pangaea.de/
Single dataset access Browser / landing page https://doi.pangaea.de/10.1594/PANGAEA.xxxxx
Programmatic data download HTTP content negotiation curl -H 'Accept: text/tab-separated-values' https://doi.org/...
Programmatic metadata download HTTP content negotiation curl -H 'Accept: application/ld+json' https://doi.org/...
Link and format discovery HTTP HEAD / Signposting curl -I https://doi.pangaea.de/...
Restricted dataset access (to your own publications) Bearer token authentication retrievable via https://pangaea.de/user/
Bulk metadata harvesting OAI-PMH https://ws.pangaea.de/oai/
Scripted data access (Python) pangaeapy https://pypi.org/project/pangaeapy/
Scripted data access (R) pangaear https://cran.r-project.org/package=pangaear
Cross-dataset aggregation Data warehouse https://wiki.pangaea.de/wiki/Data_warehouse
Discovery across repositories DataCite Search API https://support.datacite.org/docs/api

Further Resources