Data Access and Reuse
This article describes the different methods available for discovering, accessing, and reusing data published in PANGAEA. It is intended for researchers and data practitioners who want to interact with PANGAEA data beyond the standard web interface, including scripted and automated workflows. The methods range from interactive web-based search to fully programmatic access via HTTP content negotiation and dedicated client libraries.
Hands-on training materials covering many of the topics below are available in the PANGAEA Community Workshop Series, including Jupyter notebooks, R scripts, and slide decks in the PANGAEA community workshop GitHub repository.
Overview
Every dataset published in PANGAEA is assigned a globally unique and persistent Digital Object Identifier (DOI). The DOI is the single entry point for all forms of programmatic access: it resolves to a dataset landing page that, in addition to its human-readable representation, exposes all available metadata formats and data download options through standard HTTP mechanisms. There is no separate PANGAEA data API — all data and metadata access is built on top of the DOI and its landing page using standard web protocols. This design makes PANGAEA data directly accessible without any vendor-specific API key or proprietary client, while remaining fully compatible with FAIR principles.
Web-Based Search and Discovery
PANGAEA Search Interface
The primary discovery tool for PANGAEA data is the PANGAEA search engine at https://www.pangaea.de/, based on Elasticsearch. It supports full-text search and faceted filtering across all published metadata. Faceted navigation allows users to constrain results by topic, device type, geographic region, and temporal coverage. Documentation on search functionality and syntax is available in the Wiki at PANGAEA search.
External Portals and Registries
PANGAEA metadata is harvested by a large number of disciplinary and generic portals via OAI-PMH, making datasets discoverable beyond the PANGAEA website through services such as Google Dataset Search, OpenAIRE, DataCite Commons, DataONE, GBIF, EMODnet, GFBio, and others. PANGAEA is registered in re3data.org, FAIRsharing.org, RIsources, and the EOSC Marketplace.
An alternative for search when a PANGAEA-specific search interface is not required is the DataCite Search API (https://support.datacite.org/docs/api), which searches across a large inventory of research data from many repositories. It returns summary metadata including DOI names, and subsequent data access from any PANGAEA result can then proceed via content negotiation as described below.
DOI Landing Pages and Link Discovery
The Landing Page as Access Hub
Each PANGAEA dataset is represented by a landing page accessible at its DOI. For example:
https://doi.pangaea.de/10.1594/PANGAEA.841672
The landing page serves a dual function: it presents dataset metadata and a data preview in human-readable form, and it simultaneously exposes all available machine-readable representations of the same resource through standard HTTP link headers and HTML <link> elements. This implementation follows the Signposting standard (https://signposting.org/), which allows any HTTP client to discover all alternate representations of a dataset — metadata in various formats, the data file itself, and the ORCID iDs of the authors — without any prior knowledge of PANGAEA-specific URL patterns.
Discovering Available Representations
A simple HTTP HEAD request to the landing page returns the full set of typed link relations in the response header. The following example illustrates this using curl:
curl -LI https://doi.org/10.1594/PANGAEA.841672
The response Link: header will contain relations of the following types:
cite-as— the canonical DOI citation URLdescribedby— links to metadata in various formats (ISO 19139, DataCite XML, PANGAEA XML, BibTeX, RIS, JSON-LD)item— links to the data itself (tab-delimited text, HTML view)author— ORCID iDs of the dataset authors
This mechanism allows scripts, harvesters, and other machine clients to discover and retrieve any representation of a dataset from its DOI alone.
HTTP Content Negotiation
HTTP content negotiation allows a client to request a specific representation of a resource by specifying a MIME type in the Accept header. All PANGAEA dataset landing pages support this mechanism, so both data and metadata can be downloaded programmatically by querying the DOI directly with the appropriate content type — no PANGAEA-specific URL construction is required.
Downloading the Data File
The tabular data file for a dataset can be downloaded as tab-delimited text:
curl -OJLf 'https://doi.pangaea.de/10.1594/PANGAEA.841672?format=textfile'
Equivalently, using content negotiation directly against the DOI:
curl -OJLf -H 'Accept: text/tab-separated-values' https://doi.org/10.1594/PANGAEA.841672
The -OJLf flags instruct curl to save the file using the server-provided (i.e. meaningful) filename (-O, -J), follow redirects (-L, to resolve the DOI), and fail with appropriate exit code (-f) on errors (easier to handle in scripts than the respective HTML errors).
Retrieving Metadata in a Specific Format
The same mechanism applies to metadata. To retrieve a dataset's metadata in ISO 19139/19115 format:
curl -L -H 'Accept: application/vnd.iso19139.metadata+xml' https://doi.org/10.1594/PANGAEA.841672
The following MIME types are currently supported for metadata retrieval via content negotiation:
| Format | MIME type |
|---|---|
| PANGAEA internal XML | application/vnd.pangaea.metadata+xml
|
| DataCite XML (v4) | application/vnd.datacite.datacite+xml
|
| ISO 19139 / 19115 | application/vnd.iso19139.metadata+xml
|
| NASA DIF | application/vnd.nasa.dif-metadata+xml
|
| JSON-LD (Schema.org) | application/ld+json
|
| BibTeX | application/x-bibtex
|
| RIS | application/x-research-info-systems
|
| Plain text citation | text/x-bibliography
|
| Tab-separated data | text/tab-separated-values
|
Alternatively, the same formats can be requested using explicit URL parameters:
https://doi.pangaea.de/10.1594/PANGAEA.841672?format=metadata_datacite4 https://doi.pangaea.de/10.1594/PANGAEA.841672?format=metadata_iso19139 https://doi.pangaea.de/10.1594/PANGAEA.841672?format=metadata_jsonld https://doi.pangaea.de/10.1594/PANGAEA.841672?format=citation_bibtex
Access to Restricted Datasets
A small fraction of PANGAEA datasets are under an active moratorium — access to the data is temporarily restricted while the metadata remains publicly visible. Access to your personal moratorium datasets requires authentication using a bearer token, which is the standard mechanism for API authorization (RFC 6750). Note that access to protected datasets of other authors is, of course, still not possible.
To obtain your bearer token, log in to PANGAEA. Login is supported both with a PANGAEA username and password, and via ORCID iD. After logging in, the user profile page displays the current session token under "Your temporary login token".
Please note: The bearer token is private and should be treated like a password — it must not be shared with others or included in publicly accessible scripts or repositories. If a token has been accidentally disclosed, the user profile page at https://pangaea.de/user/ provides a "Log out from all devices" option, which immediately invalidates all active tokens. When using content negotiation via the DOI resolver at https://doi.org/, it is safe to include the Authorization header: PANGAEA trusts the DOI Foundation's infrastructure, and the request is redirected to PANGAEA's own servers before any protected content is served.
The token can be passed as an Authorization: Bearer header in any HTTP request:
curl -OJLf -H 'Authorization: Bearer <your-token>' -H 'Accept: text/tab-separated-values' https://doi.org/10.1594/PANGAEA.841672
If a request is made without a token to a restricted dataset, the server responds with 401 Bearer Realm, signaling that authentication is required. In recent versions, curl can provide the token with an explicit flag (--oauth2-bearer). The respective command example would then be:
curl -OJLf --oauth2-bearer '<your-token>' -H 'Accept: text/tab-separated-values' https://doi.org/10.1594/PANGAEA.841672
OAI-PMH Metadata Harvesting
For bulk metadata harvesting, PANGAEA provides an OAI-PMH endpoint at:
https://ws.pangaea.de/oai/
OAI-PMH allows systematic, incremental harvesting of all PANGAEA metadata in supported formats. The following metadata standards are available via OAI-PMH: Dublin Core, DataCite v3 and v4, ISO 19139, and DIF (Directory Interchange Format). This endpoint is used by the portals and registries listed in the discovery section above, and is equally available to any user who wishes to build their own index or integrate PANGAEA metadata into an institutional system.
Programmatic Access with Python: pangaeapy
pangaeapy is the official Python client library for PANGAEA, developed and maintained by the PANGAEA team. It provides a high-level interface for loading and analyzing PANGAEA datasets directly into native Python data structures, without requiring manual HTTP requests or format parsing.
- PyPI package: https://pypi.org/project/pangaeapy/
- Source code: https://github.com/pangaea-data-publisher/pangaeapy
- Introduction slides: https://github.com/pangaea-data-publisher/community-workshop-material/blob/master/Introduction_to_pangaeapy.pdf
Installation
pip install pangaeapy
Loading a Dataset
The central object in pangaeapy is PanDataSet, which takes a DOI or PANGAEA dataset ID and retrieves both the data and the associated metadata:
from pangaeapy import PanDataSet
ds = PanDataSet('10.1594/PANGAEA.841672')
# Access the data as a pandas DataFrame
print(ds.data.head())
# Access dataset metadata
print(ds.title)
print(ds.authors)
print(ds.parameters) # list of measured parameters with units and methods
The data attribute is a pandas DataFrame in which each column corresponds to a measured parameter, and each row to a measurement. Parameter metadata — including units, standard names, and method descriptions — is accessible through the parameters attribute.
Searching PANGAEA from Python
pangaeapy also provides a PanQuery class for querying the PANGAEA search engine programmatically. Note that the underlying search API used by pangaeapy is currently internal and undocumented; an official, publicly documented Search REST API is under development. Until that is available, pangaeapy offers the most straightforward path to programmatic search for Python users:
from pangaeapy import PanQuery
q = PanQuery('temperature salinity Atlantic', limit=10)
for result in q.results:
print(result['doi'], result['title'])
Access to Restricted Datasets
A bearer token obtained from the PANGAEA user profile can be passed to PanDataSet to enable access to your datasets that are subject to a moratorium (not of other authors!):
ds = PanDataSet('10.1594/PANGAEA.841672', token='<your-bearer-token>')
Further Training Materials
Jupyter notebooks with worked examples for data discovery, loading, and analysis with pangaeapy are available in the PANGAEA Community Workshop GitHub repository at https://github.com/pangaea-data-publisher/community-workshop-material (see the Python/ directory).
Programmatic Access with R: pangaear
pangaear is a community-developed R client for PANGAEA, maintained as part of the rOpenSci ecosystem. It provides equivalent functionality to pangaeapy for R users.
- CRAN: https://cran.r-project.org/package=pangaear
- GitHub: https://github.com/ropensci/pangaear
- Introduction slides: https://github.com/pangaea-data-publisher/community-workshop-material/blob/master/Intro_PangaeaR.pdf
Installation
install.packages("pangaear")
# or the development version:
# devtools::install_github("ropensci/pangaear")
Loading a Dataset
library(pangaear) # Download and load a dataset by DOI ds <- pg_data(doi = '10.1594/PANGAEA.841672') # Access the data as a data frame head(ds1$data) # Metadata is accessible from the list object ds1$metadata
Searching PANGAEA from R
res <- pg_search(query = 'temperature salinity Atlantic', count = 10) print(res$doi)
R scripts and worked examples are available in the R/ directory of the PANGAEA Community Workshop GitHub repository.
The PANGAEA Data Warehouse
The PANGAEA data warehouse (based on Clickhouse) provides a powerful complementary access path for users who need to compile and aggregate data across large numbers of datasets — potentially across hundreds of thousands of individual publications — without having to download and merge files manually.
The data warehouse supports spatially and chronologically constrained queries at the parameter level, returning data together with the DOI of each contributing dataset to maintain full provenance traceability. It is accessible in two ways:
Through the PANGAEA website: The data warehouse interface is integrated into the search results page. After performing a search, users can select "Data Warehouse" to configure and download a parameter-level aggregation from all datasets in the result set.
Documentation: https://wiki.pangaea.de/wiki/Data_warehouse
Programmatically via SOAP API: The data warehouse is also accessible through the PANGAEA web services at https://ws.pangaea.de/. This allows automated, scripted aggregations and is supported by both pangaeapy and pangaear.
Note that data warehouse exports represent compiled data products and do not replace the need to consult individual dataset landing pages to assess the fitness of individual studies for a specific scientific application. Every export includes the DOI name for each data point, ensuring that citation and provenance are preserved.
Summary of Access Methods
| Use case | Method | Entry point |
|---|---|---|
| Interactive discovery | PANGAEA search | https://www.pangaea.de/ |
| Single dataset access | Browser / landing page | https://doi.pangaea.de/10.1594/PANGAEA.xxxxx |
| Programmatic data download | HTTP content negotiation | curl -H 'Accept: text/tab-separated-values' https://doi.org/...
|
| Programmatic metadata download | HTTP content negotiation | curl -H 'Accept: application/ld+json' https://doi.org/...
|
| Link and format discovery | HTTP HEAD / Signposting | curl -I https://doi.pangaea.de/...
|
| Restricted dataset access (to your own publications) | Bearer token authentication | retrievable via https://pangaea.de/user/ |
| Bulk metadata harvesting | OAI-PMH | https://ws.pangaea.de/oai/ |
| Scripted data access (Python) | pangaeapy | https://pypi.org/project/pangaeapy/ |
| Scripted data access (R) | pangaear | https://cran.r-project.org/package=pangaear |
| Cross-dataset aggregation | Data warehouse | https://wiki.pangaea.de/wiki/Data_warehouse |
| Discovery across repositories | DataCite Search API | https://support.datacite.org/docs/api |
Further Resources
- PANGAEA search — documentation on search syntax and faceted filters
- Data warehouse — data warehouse documentation
- PANGAEA Community Workshops — hands-on training workshops on finding and using PANGAEA data
- PANGAEA Community Workshop GitHub — Jupyter notebooks, R scripts, and slide decks
- pangaeapy on PyPI — Python client
- pangaear on CRAN — R client
- PANGAEA web services — REST and OAI-PMH endpoints
- Signposting standard — link discovery for FAIR data