PANGAEA Wiki - User contributions [en]

Data Access and Reuse

2026-04-22T10:46:41Z

Lmoeller: /* Access to Restricted Datasets */ changed formatting an position of the disclaimer

This article describes the different methods available for discovering, accessing, and reusing data published in PANGAEA. It is intended for researchers and data practitioners who want to interact with PANGAEA data beyond the standard web interface, including scripted and automated workflows. The methods range from interactive web-based search to fully programmatic access via HTTP content negotiation and dedicated client libraries.

Hands-on training materials covering many of the topics below are available in the [[PANGAEA Community Workshops|PANGAEA Community Workshop Series]], including Jupyter notebooks, R scripts, and slide decks in the [https://github.com/pangaea-data-publisher/community-workshop-material PANGAEA community workshop GitHub repository].

== Overview ==
Every dataset published in PANGAEA is assigned a globally unique and persistent Digital Object Identifier (DOI). The DOI is the single entry point for all forms of programmatic access: it resolves to a dataset landing page that, in addition to its human-readable representation, exposes all available metadata formats and data download options through standard HTTP mechanisms. There is no separate PANGAEA data API — all data and metadata access is built on top of the DOI and its landing page using standard web protocols. This design makes PANGAEA data directly accessible without any vendor-specific API key or proprietary client, while remaining fully compatible with FAIR principles.

== Web-Based Search and Discovery ==

=== PANGAEA Search Interface ===
The primary discovery tool for PANGAEA data is the PANGAEA search engine at https://www.pangaea.de/, based on Elasticsearch. It supports full-text search and faceted filtering across all published metadata. Faceted navigation allows users to constrain results by topic, device type, geographic region, and temporal coverage. Documentation on search functionality and syntax is available in the Wiki at [[PANGAEA search]].

=== External Portals and Registries ===
PANGAEA metadata is harvested by a large number of disciplinary and generic portals via OAI-PMH, making datasets discoverable beyond the PANGAEA website through services such as Google Dataset Search, OpenAIRE, DataCite Commons, DataONE, GBIF, EMODnet, GFBio, and others. PANGAEA is registered in re3data.org, FAIRsharing.org, RIsources, and the EOSC Marketplace.

An alternative for search when a PANGAEA-specific search interface is not required is the '''DataCite Search API''' (https://support.datacite.org/docs/api), which searches across a large inventory of research data from many repositories. It returns summary metadata including DOI names, and subsequent data access from any PANGAEA result can then proceed via content negotiation as described below.

== DOI Landing Pages and Link Discovery ==

=== The Landing Page as Access Hub ===
Each PANGAEA dataset is represented by a landing page accessible at its DOI. For example:
<code>https://doi.pangaea.de/10.1594/PANGAEA.841672</code>
The landing page serves a dual function: it presents dataset metadata and a data preview in human-readable form, and it simultaneously exposes all available machine-readable representations of the same resource through standard HTTP link headers and HTML <code><link></code> elements. This implementation follows the '''Signposting standard''' (https://signposting.org/), which allows any HTTP client to discover all alternate representations of a dataset — metadata in various formats, the data file itself, and the ORCID iDs of the authors — without any prior knowledge of PANGAEA-specific URL patterns.

=== Discovering Available Representations ===
A simple HTTP HEAD request to the landing page returns the full set of typed link relations in the response header. The following example illustrates this using <code>curl</code>:
<code>curl -LI https://doi.org/10.1594/PANGAEA.841672</code>
The response <code>Link:</code> header will contain relations of the following types:

* '''<code>cite-as</code>''' — the canonical DOI citation URL
* '''<code>describedby</code>''' — links to metadata in various formats (ISO 19139, DataCite XML, PANGAEA XML, BibTeX, RIS, JSON-LD)
* '''<code>item</code>''' — links to the data itself (tab-delimited text, HTML view)
* '''<code>author</code>''' — ORCID iDs of the dataset authors

This mechanism allows scripts, harvesters, and other machine clients to discover and retrieve any representation of a dataset from its DOI alone.

== HTTP Content Negotiation ==
HTTP content negotiation allows a client to request a specific representation of a resource by specifying a MIME type in the <code>Accept</code> header. All PANGAEA dataset landing pages support this mechanism, so both data and metadata can be downloaded programmatically by querying the DOI directly with the appropriate content type — no PANGAEA-specific URL construction is required.

=== Downloading the Data File ===
The tabular data file for a dataset can be downloaded as tab-delimited text:
<code>curl -OJLf 'https://doi.pangaea.de/10.1594/PANGAEA.841672?format=textfile'</code>
Equivalently, using content negotiation directly against the DOI:
<curl -OJLf -H 'Accept: text/tab-separated-values' https://doi.org/10.1594/PANGAEA.841672</code>
The <code>-OJL</code> flags instruct curl to save the file using the server-provided (i.e. meaningful) filename, follow redirects (to resolve the DOI), and fail with appropriate exit code on errors (easier to handle in scripts than the respective HTML errors).

=== Retrieving Metadata in a Specific Format ===
The same mechanism applies to metadata. To retrieve a dataset's metadata in ISO 19139/19115 format:
<curl -L -H 'Accept: application/vnd.iso19139.metadata+xml' https://doi.org/10.1594/PANGAEA.841672</code>
The following MIME types are currently supported for metadata retrieval via content negotiation:
{| class="wikitable"
!Format
!MIME type
|-
|PANGAEA internal XML
|<code>application/vnd.pangaea.metadata+xml</code>
|-
|DataCite XML (v4)
|<code>application/vnd.datacite.datacite+xml</code>
|-
|ISO 19139 / 19115
|<code>application/vnd.iso19139.metadata+xml</code>
|-
|NASA DIF
|<code>application/vnd.nasa.dif-metadata+xml</code>
|-
|JSON-LD (Schema.org)
|<code>application/ld+json</code>
|-
|BibTeX
|<code>application/x-bibtex</code>
|-
|RIS
|<code>application/x-research-info-systems</code>
|-
|Plain text citation
|<code>text/x-bibliography</code>
|-
|Tab-separated data
|<code>text/tab-separated-values</code>
|}
Alternatively, the same formats can be requested using explicit URL parameters:
https://doi.pangaea.de/10.1594/PANGAEA.841672?format=metadata_datacite4
https://doi.pangaea.de/10.1594/PANGAEA.841672?format=metadata_iso19139
https://doi.pangaea.de/10.1594/PANGAEA.841672?format=metadata_jsonld
https://doi.pangaea.de/10.1594/PANGAEA.841672?format=citation_bibtex

== Access to Restricted Datasets ==
A small fraction of PANGAEA datasets are under an active moratorium — access to the data is temporarily restricted while the metadata remains publicly visible. Access to your personal moratorium datasets requires authentication using a '''bearer token''', which is the standard mechanism for API authorization (RFC 6750). Note that access to protected datasets of other authors is, of course, still not possible.

To obtain your bearer token, log in to PANGAEA at https://pangaea.de/user/login.php. Login is supported both with a PANGAEA username and password, and via ORCID iD. After logging in, the user profile page at https://pangaea.de/user/ displays the current session token under "Your temporary login token."
'''Please note:''' The bearer token is private and should be treated like a password — it must not be shared with others or included in publicly accessible scripts or repositories. If a token has been accidentally disclosed, the user profile page at https://pangaea.de/user/ provides a "Log out from all devices" option, which immediately invalidates all active tokens. When using content negotiation via the DOI resolver at https://doi.org/, it is safe to include the Authorization header: PANGAEA trusts the DOI Foundation's infrastructure, and the request is redirected to PANGAEA's own servers before any protected content is served.

The token can be passed as an <code>Authorization: Bearer</code> header in any HTTP request:
curl -OJLf -H 'Authorization: Bearer <your-token>' -H 'Accept: text/tab-separated-values' https://doi.org/10.1594/PANGAEA.841672
If a request is made without a token to a restricted dataset, the server responds with <code>401 Bearer Realm</code>, signaling that authentication is required. In recent versions, curl can provide the token with an explicit flag (<code>--oauth2-bearer</code>). The respective command example would then be:
<code>curl -OJLf --oauth2-bearer xxx -H 'Accept: text/tab-separated-values' https://doi.org/10.1594/PANGAEA.841672</code>

== OAI-PMH Metadata Harvesting ==
For bulk metadata harvesting, PANGAEA provides an OAI-PMH endpoint at:
<code>https://ws.pangaea.de/oai/</code>
OAI-PMH allows systematic, incremental harvesting of all PANGAEA metadata in supported formats. The following metadata standards are available via OAI-PMH: Dublin Core, DataCite v3 and v4, ISO 19139, and DIF (Directory Interchange Format). This endpoint is used by the portals and registries listed in the discovery section above, and is equally available to any user who wishes to build their own index or integrate PANGAEA metadata into an institutional system.

== Programmatic Access with Python: pangaeapy ==
'''pangaeapy''' is the official Python client library for PANGAEA, developed and maintained by the PANGAEA team. It provides a high-level interface for loading and analyzing PANGAEA datasets directly into native Python data structures, without requiring manual HTTP requests or format parsing.

* PyPI package: https://pypi.org/project/pangaeapy/
* Source code: https://github.com/pangaea-data-publisher/pangaeapy
* Introduction slides: https://github.com/pangaea-data-publisher/community-workshop-material/blob/master/Introduction_to_pangaeapy.pdf

=== Installation ===
<code>pip install pangaeapy</code>

=== Loading a Dataset ===
The central object in pangaeapy is <code>PanDataSet</code>, which takes a DOI or PANGAEA dataset ID and retrieves both the data and the associated metadata:
<code>from pangaeapy import PanDataSet

ds = PanDataSet('10.1594/PANGAEA.841672')

# Access the data as a pandas DataFrame
print(ds.data.head())

# Access dataset metadata
print(ds.title)
print(ds.authors)
print(ds.parameters) # list of measured parameters with units and methods</code>
The <code>data</code> attribute is a pandas DataFrame in which each column corresponds to a measured parameter, and each row to a measurement. Parameter metadata — including units, standard names, and method descriptions — is accessible through the <code>parameters</code> attribute.

=== Searching PANGAEA from Python ===
pangaeapy also provides a <code>PanQuery</code> class for querying the PANGAEA search engine programmatically. Note that the underlying search API used by pangaeapy is currently internal and undocumented; an official, publicly documented Search REST API is under development. Until that is available, pangaeapy offers the most straightforward path to programmatic search for Python users:
<code>from pangaeapy import PanQuery

q = PanQuery('temperature salinity Atlantic', limit=10)
for result in q.results:
print(result['doi'], result['title'])</code>

=== Access to Restricted Datasets ===
A bearer token obtained from the PANGAEA user profile can be passed to <code>PanDataSet</code> to enable access to your datasets that are subject to a moratorium (not of other authors!):
<code>ds = PanDataSet('10.1594/PANGAEA.841672', token='<your-bearer-token>')</code>

=== Further Training Materials ===
Jupyter notebooks with worked examples for data discovery, loading, and analysis with pangaeapy are available in the PANGAEA Community Workshop GitHub repository at https://github.com/pangaea-data-publisher/community-workshop-material (see the <code>Python/</code> directory).

== Programmatic Access with R: pangaear ==
'''pangaear''' is a community-developed R client for PANGAEA, maintained as part of the rOpenSci ecosystem. It provides equivalent functionality to pangaeapy for R users.

* CRAN: https://cran.r-project.org/package=pangaear
* GitHub: https://github.com/ropensci/pangaear
* Introduction slides: https://github.com/pangaea-data-publisher/community-workshop-material/blob/master/Intro_PangaeaR.pdf

=== Installation ===
<code>install.packages("pangaear")
# or the development version:
# devtools::install_github("ropensci/pangaear")</code>

=== Loading a Dataset ===
<code>library(pangaear)

# Download and load a dataset by DOI
ds <- pg_data(doi = '10.1594/PANGAEA.841672')

# Access the data as a data frame
head(ds[[1]]$data)

# Metadata is accessible from the list object
ds[[1]]$metadata</code>

=== Searching PANGAEA from R ===
<code>res <- pg_search(query = 'temperature salinity Atlantic', count = 10)
print(res$doi)</code>
R scripts and worked examples are available in the <code>R/</code> directory of the PANGAEA Community Workshop GitHub repository.

== The PANGAEA Data Warehouse ==
The PANGAEA data warehouse (based on Clickhouse) provides a powerful complementary access path for users who need to compile and aggregate data across large numbers of datasets — potentially across hundreds of thousands of individual publications — without having to download and merge files manually.

The data warehouse supports spatially and chronologically constrained queries at the parameter level, returning data together with the DOI of each contributing dataset to maintain full provenance traceability. It is accessible in two ways:

'''Through the PANGAEA website:''' The data warehouse interface is integrated into the search results page. After performing a search, users can select "Data Warehouse" to configure and download a parameter-level aggregation from all datasets in the result set.

Documentation: https://wiki.pangaea.de/wiki/Data_warehouse

'''Programmatically via REST API:''' The data warehouse is also accessible through the PANGAEA web services at https://ws.pangaea.de/. This allows automated, scripted aggregations and is supported by both pangaeapy and pangaear.

Note that data warehouse exports represent compiled data products and do not replace the need to consult individual dataset landing pages to assess the fitness of individual studies for a specific scientific application. Every export includes the DOI name for each data point, ensuring that citation and provenance are preserved.

== Summary of Access Methods ==
{| class="wikitable"
!Use case
!Method
!Entry point
|-
|Interactive discovery
|PANGAEA search
|https://www.pangaea.de/
|-
|Single dataset access
|Browser / landing page
|https://doi.pangaea.de/10.1594/PANGAEA.xxxxx
|-
|Programmatic data download
|HTTP content negotiation
|<code>curl -H 'Accept: text/tab-separated-values' https://doi.org/...</code>
|-
|Programmatic metadata download
|HTTP content negotiation
|<code>curl -H 'Accept: application/ld+json' https://doi.org/...</code>
|-
|Link and format discovery
|HTTP HEAD / Signposting
|<code>curl -I https://doi.pangaea.de/...</code>
|-
|Restricted dataset access (to your own publications)
|Bearer token authentication
|retrievable via https://pangaea.de/user/
|-
|Bulk metadata harvesting
|OAI-PMH
|https://ws.pangaea.de/oai/
|-
|Scripted data access (Python)
|pangaeapy
|https://pypi.org/project/pangaeapy/
|-
|Scripted data access (R)
|pangaear
|https://cran.r-project.org/package=pangaear
|-
|Cross-dataset aggregation
|Data warehouse
|https://wiki.pangaea.de/wiki/Data_warehouse
|-
|Discovery across repositories
|DataCite Search API
|https://support.datacite.org/docs/api
|}
== Further Resources ==

* [[PANGAEA search]] — documentation on search syntax and faceted filters
* [[Data warehouse]] — data warehouse documentation
* [[PANGAEA Community Workshops]] — hands-on training workshops on finding and using PANGAEA data
* [https://github.com/pangaea-data-publisher/community-workshop-material PANGAEA Community Workshop GitHub] — Jupyter notebooks, R scripts, and slide decks
* [https://pypi.org/project/pangaeapy/ pangaeapy on PyPI] — Python client
* [https://cran.r-project.org/package=pangaear pangaear on CRAN] — R client
* [https://ws.pangaea.de/oai/ PANGAEA web services] — REST and OAI-PMH endpoints
* [https://signposting.org/ Signposting standard] — link discovery for FAIR data

Data Access and Reuse

2026-04-22T10:40:39Z

Lmoeller: /* Access to Restricted Datasets */ Added disclaimer notice on the security of tokens

This article describes the different methods available for discovering, accessing, and reusing data published in PANGAEA. It is intended for researchers and data practitioners who want to interact with PANGAEA data beyond the standard web interface, including scripted and automated workflows. The methods range from interactive web-based search to fully programmatic access via HTTP content negotiation and dedicated client libraries.

Hands-on training materials covering many of the topics below are available in the [[PANGAEA Community Workshops|PANGAEA Community Workshop Series]], including Jupyter notebooks, R scripts, and slide decks in the [https://github.com/pangaea-data-publisher/community-workshop-material PANGAEA community workshop GitHub repository].

== Overview ==
Every dataset published in PANGAEA is assigned a globally unique and persistent Digital Object Identifier (DOI). The DOI is the single entry point for all forms of programmatic access: it resolves to a dataset landing page that, in addition to its human-readable representation, exposes all available metadata formats and data download options through standard HTTP mechanisms. There is no separate PANGAEA data API — all data and metadata access is built on top of the DOI and its landing page using standard web protocols. This design makes PANGAEA data directly accessible without any vendor-specific API key or proprietary client, while remaining fully compatible with FAIR principles.

== Web-Based Search and Discovery ==

=== PANGAEA Search Interface ===
The primary discovery tool for PANGAEA data is the PANGAEA search engine at https://www.pangaea.de/, based on Elasticsearch. It supports full-text search and faceted filtering across all published metadata. Faceted navigation allows users to constrain results by topic, device type, geographic region, and temporal coverage. Documentation on search functionality and syntax is available in the Wiki at [[PANGAEA search]].

=== External Portals and Registries ===
PANGAEA metadata is harvested by a large number of disciplinary and generic portals via OAI-PMH, making datasets discoverable beyond the PANGAEA website through services such as Google Dataset Search, OpenAIRE, DataCite Commons, DataONE, GBIF, EMODnet, GFBio, and others. PANGAEA is registered in re3data.org, FAIRsharing.org, RIsources, and the EOSC Marketplace.

An alternative for search when a PANGAEA-specific search interface is not required is the '''DataCite Search API''' (https://support.datacite.org/docs/api), which searches across a large inventory of research data from many repositories. It returns summary metadata including DOI names, and subsequent data access from any PANGAEA result can then proceed via content negotiation as described below.

== DOI Landing Pages and Link Discovery ==

=== The Landing Page as Access Hub ===
Each PANGAEA dataset is represented by a landing page accessible at its DOI. For example:
<code>https://doi.pangaea.de/10.1594/PANGAEA.841672</code>
The landing page serves a dual function: it presents dataset metadata and a data preview in human-readable form, and it simultaneously exposes all available machine-readable representations of the same resource through standard HTTP link headers and HTML <code><link></code> elements. This implementation follows the '''Signposting standard''' (https://signposting.org/), which allows any HTTP client to discover all alternate representations of a dataset — metadata in various formats, the data file itself, and the ORCID iDs of the authors — without any prior knowledge of PANGAEA-specific URL patterns.

=== Discovering Available Representations ===
A simple HTTP HEAD request to the landing page returns the full set of typed link relations in the response header. The following example illustrates this using <code>curl</code>:
<code>curl -LI https://doi.org/10.1594/PANGAEA.841672</code>
The response <code>Link:</code> header will contain relations of the following types:

* '''<code>cite-as</code>''' — the canonical DOI citation URL
* '''<code>describedby</code>''' — links to metadata in various formats (ISO 19139, DataCite XML, PANGAEA XML, BibTeX, RIS, JSON-LD)
* '''<code>item</code>''' — links to the data itself (tab-delimited text, HTML view)
* '''<code>author</code>''' — ORCID iDs of the dataset authors

This mechanism allows scripts, harvesters, and other machine clients to discover and retrieve any representation of a dataset from its DOI alone.

== HTTP Content Negotiation ==
HTTP content negotiation allows a client to request a specific representation of a resource by specifying a MIME type in the <code>Accept</code> header. All PANGAEA dataset landing pages support this mechanism, so both data and metadata can be downloaded programmatically by querying the DOI directly with the appropriate content type — no PANGAEA-specific URL construction is required.

=== Downloading the Data File ===
The tabular data file for a dataset can be downloaded as tab-delimited text:
<code>curl -OJLf 'https://doi.pangaea.de/10.1594/PANGAEA.841672?format=textfile'</code>
Equivalently, using content negotiation directly against the DOI:
<curl -OJLf -H 'Accept: text/tab-separated-values' https://doi.org/10.1594/PANGAEA.841672</code>
The <code>-OJL</code> flags instruct curl to save the file using the server-provided (i.e. meaningful) filename, follow redirects (to resolve the DOI), and fail with appropriate exit code on errors (easier to handle in scripts than the respective HTML errors).

=== Retrieving Metadata in a Specific Format ===
The same mechanism applies to metadata. To retrieve a dataset's metadata in ISO 19139/19115 format:
<curl -L -H 'Accept: application/vnd.iso19139.metadata+xml' https://doi.org/10.1594/PANGAEA.841672</code>
The following MIME types are currently supported for metadata retrieval via content negotiation:
{| class="wikitable"
!Format
!MIME type
|-
|PANGAEA internal XML
|<code>application/vnd.pangaea.metadata+xml</code>
|-
|DataCite XML (v4)
|<code>application/vnd.datacite.datacite+xml</code>
|-
|ISO 19139 / 19115
|<code>application/vnd.iso19139.metadata+xml</code>
|-
|NASA DIF
|<code>application/vnd.nasa.dif-metadata+xml</code>
|-
|JSON-LD (Schema.org)
|<code>application/ld+json</code>
|-
|BibTeX
|<code>application/x-bibtex</code>
|-
|RIS
|<code>application/x-research-info-systems</code>
|-
|Plain text citation
|<code>text/x-bibliography</code>
|-
|Tab-separated data
|<code>text/tab-separated-values</code>
|}
Alternatively, the same formats can be requested using explicit URL parameters:
https://doi.pangaea.de/10.1594/PANGAEA.841672?format=metadata_datacite4
https://doi.pangaea.de/10.1594/PANGAEA.841672?format=metadata_iso19139
https://doi.pangaea.de/10.1594/PANGAEA.841672?format=metadata_jsonld
https://doi.pangaea.de/10.1594/PANGAEA.841672?format=citation_bibtex

== Access to Restricted Datasets ==
A small fraction of PANGAEA datasets are under an active moratorium — access to the data is temporarily restricted while the metadata remains publicly visible. Access to your personal moratorium datasets requires authentication using a '''bearer token''', which is the standard mechanism for API authorization (RFC 6750). Note that access to protected datasets of other authors is, of course, still not possible.

To obtain your bearer token, log in to PANGAEA at https://pangaea.de/user/login.php. Login is supported both with a PANGAEA username and password, and via ORCID iD. After logging in, the user profile page at https://pangaea.de/user/ displays the current session token under "Your temporary login token." This token can be renewed when logging out and logging in again, and passed as an <code>Authorization: Bearer</code> header in any HTTP request:
curl -OJLf -H 'Authorization: Bearer <your-token>' -H 'Accept: text/tab-separated-values' https://doi.org/10.1594/PANGAEA.841672
If a request is made without a token to a restricted dataset, the server responds with <code>401 Bearer Realm</code>, signaling that authentication is required.<blockquote>'''Please note:''' The bearer token is private and should be treated like a password — it must not be shared with others or included in publicly accessible scripts or repositories. If a token has been accidentally disclosed, the user profile page at <nowiki>https://pangaea.de/user/</nowiki> provides a "Log out from all devices" option, which immediately invalidates all active tokens. When using content negotiation via the DOI resolver at <nowiki>https://doi.org/</nowiki>, it is safe to include the Authorization header: PANGAEA trusts the DOI Foundation's infrastructure, and the request is redirected to PANGAEA's own servers before any protected content is served.</blockquote>In recent versions, curl can provide the token with an explicit flag (<code>--oauth2-bearer</code>). The respective command example would then be:
<code>curl -OJLf --oauth2-bearer xxx -H 'Accept: text/tab-separated-values' https://doi.org/10.1594/PANGAEA.841672</code>

== OAI-PMH Metadata Harvesting ==
For bulk metadata harvesting, PANGAEA provides an OAI-PMH endpoint at:
<code>https://ws.pangaea.de/oai/</code>
OAI-PMH allows systematic, incremental harvesting of all PANGAEA metadata in supported formats. The following metadata standards are available via OAI-PMH: Dublin Core, DataCite v3 and v4, ISO 19139, and DIF (Directory Interchange Format). This endpoint is used by the portals and registries listed in the discovery section above, and is equally available to any user who wishes to build their own index or integrate PANGAEA metadata into an institutional system.

== Programmatic Access with Python: pangaeapy ==
'''pangaeapy''' is the official Python client library for PANGAEA, developed and maintained by the PANGAEA team. It provides a high-level interface for loading and analyzing PANGAEA datasets directly into native Python data structures, without requiring manual HTTP requests or format parsing.

* PyPI package: https://pypi.org/project/pangaeapy/
* Source code: https://github.com/pangaea-data-publisher/pangaeapy
* Introduction slides: https://github.com/pangaea-data-publisher/community-workshop-material/blob/master/Introduction_to_pangaeapy.pdf

=== Installation ===
<code>pip install pangaeapy</code>

=== Loading a Dataset ===
The central object in pangaeapy is <code>PanDataSet</code>, which takes a DOI or PANGAEA dataset ID and retrieves both the data and the associated metadata:
<code>from pangaeapy import PanDataSet

ds = PanDataSet('10.1594/PANGAEA.841672')

# Access the data as a pandas DataFrame
print(ds.data.head())

# Access dataset metadata
print(ds.title)
print(ds.authors)
print(ds.parameters) # list of measured parameters with units and methods</code>
The <code>data</code> attribute is a pandas DataFrame in which each column corresponds to a measured parameter, and each row to a measurement. Parameter metadata — including units, standard names, and method descriptions — is accessible through the <code>parameters</code> attribute.

=== Searching PANGAEA from Python ===
pangaeapy also provides a <code>PanQuery</code> class for querying the PANGAEA search engine programmatically. Note that the underlying search API used by pangaeapy is currently internal and undocumented; an official, publicly documented Search REST API is under development. Until that is available, pangaeapy offers the most straightforward path to programmatic search for Python users:
<code>from pangaeapy import PanQuery

q = PanQuery('temperature salinity Atlantic', limit=10)
for result in q.results:
print(result['doi'], result['title'])</code>

=== Access to Restricted Datasets ===
A bearer token obtained from the PANGAEA user profile can be passed to <code>PanDataSet</code> to enable access to your datasets that are subject to a moratorium (not of other authors!):
<code>ds = PanDataSet('10.1594/PANGAEA.841672', token='<your-bearer-token>')</code>

=== Further Training Materials ===
Jupyter notebooks with worked examples for data discovery, loading, and analysis with pangaeapy are available in the PANGAEA Community Workshop GitHub repository at https://github.com/pangaea-data-publisher/community-workshop-material (see the <code>Python/</code> directory).

== Programmatic Access with R: pangaear ==
'''pangaear''' is a community-developed R client for PANGAEA, maintained as part of the rOpenSci ecosystem. It provides equivalent functionality to pangaeapy for R users.

* CRAN: https://cran.r-project.org/package=pangaear
* GitHub: https://github.com/ropensci/pangaear
* Introduction slides: https://github.com/pangaea-data-publisher/community-workshop-material/blob/master/Intro_PangaeaR.pdf

=== Installation ===
<code>install.packages("pangaear")
# or the development version:
# devtools::install_github("ropensci/pangaear")</code>

=== Loading a Dataset ===
<code>library(pangaear)

# Download and load a dataset by DOI
ds <- pg_data(doi = '10.1594/PANGAEA.841672')

# Access the data as a data frame
head(ds[[1]]$data)

# Metadata is accessible from the list object
ds[[1]]$metadata</code>

=== Searching PANGAEA from R ===
<code>res <- pg_search(query = 'temperature salinity Atlantic', count = 10)
print(res$doi)</code>
R scripts and worked examples are available in the <code>R/</code> directory of the PANGAEA Community Workshop GitHub repository.

== The PANGAEA Data Warehouse ==
The PANGAEA data warehouse (based on Clickhouse) provides a powerful complementary access path for users who need to compile and aggregate data across large numbers of datasets — potentially across hundreds of thousands of individual publications — without having to download and merge files manually.

The data warehouse supports spatially and chronologically constrained queries at the parameter level, returning data together with the DOI of each contributing dataset to maintain full provenance traceability. It is accessible in two ways:

'''Through the PANGAEA website:''' The data warehouse interface is integrated into the search results page. After performing a search, users can select "Data Warehouse" to configure and download a parameter-level aggregation from all datasets in the result set.

Documentation: https://wiki.pangaea.de/wiki/Data_warehouse

'''Programmatically via REST API:''' The data warehouse is also accessible through the PANGAEA web services at https://ws.pangaea.de/. This allows automated, scripted aggregations and is supported by both pangaeapy and pangaear.

Note that data warehouse exports represent compiled data products and do not replace the need to consult individual dataset landing pages to assess the fitness of individual studies for a specific scientific application. Every export includes the DOI name for each data point, ensuring that citation and provenance are preserved.

== Summary of Access Methods ==
{| class="wikitable"
!Use case
!Method
!Entry point
|-
|Interactive discovery
|PANGAEA search
|https://www.pangaea.de/
|-
|Single dataset access
|Browser / landing page
|https://doi.pangaea.de/10.1594/PANGAEA.xxxxx
|-
|Programmatic data download
|HTTP content negotiation
|<code>curl -H 'Accept: text/tab-separated-values' https://doi.org/...</code>
|-
|Programmatic metadata download
|HTTP content negotiation
|<code>curl -H 'Accept: application/ld+json' https://doi.org/...</code>
|-
|Link and format discovery
|HTTP HEAD / Signposting
|<code>curl -I https://doi.pangaea.de/...</code>
|-
|Restricted dataset access (to your own publications)
|Bearer token authentication
|retrievable via https://pangaea.de/user/
|-
|Bulk metadata harvesting
|OAI-PMH
|https://ws.pangaea.de/oai/
|-
|Scripted data access (Python)
|pangaeapy
|https://pypi.org/project/pangaeapy/
|-
|Scripted data access (R)
|pangaear
|https://cran.r-project.org/package=pangaear
|-
|Cross-dataset aggregation
|Data warehouse
|https://wiki.pangaea.de/wiki/Data_warehouse
|-
|Discovery across repositories
|DataCite Search API
|https://support.datacite.org/docs/api
|}
== Further Resources ==

* [[PANGAEA search]] — documentation on search syntax and faceted filters
* [[Data warehouse]] — data warehouse documentation
* [[PANGAEA Community Workshops]] — hands-on training workshops on finding and using PANGAEA data
* [https://github.com/pangaea-data-publisher/community-workshop-material PANGAEA Community Workshop GitHub] — Jupyter notebooks, R scripts, and slide decks
* [https://pypi.org/project/pangaeapy/ pangaeapy on PyPI] — Python client
* [https://cran.r-project.org/package=pangaear pangaear on CRAN] — R client
* [https://ws.pangaea.de/oai/ PANGAEA web services] — REST and OAI-PMH endpoints
* [https://signposting.org/ Signposting standard] — link discovery for FAIR data

Data Access and Reuse

2026-04-22T10:35:04Z

Lmoeller: /* HTTP Content Negotiation */ Describe functions behind the curl flags

This article describes the different methods available for discovering, accessing, and reusing data published in PANGAEA. It is intended for researchers and data practitioners who want to interact with PANGAEA data beyond the standard web interface, including scripted and automated workflows. The methods range from interactive web-based search to fully programmatic access via HTTP content negotiation and dedicated client libraries.

Hands-on training materials covering many of the topics below are available in the [[PANGAEA Community Workshops|PANGAEA Community Workshop Series]], including Jupyter notebooks, R scripts, and slide decks in the [https://github.com/pangaea-data-publisher/community-workshop-material PANGAEA community workshop GitHub repository].

== Overview ==
Every dataset published in PANGAEA is assigned a globally unique and persistent Digital Object Identifier (DOI). The DOI is the single entry point for all forms of programmatic access: it resolves to a dataset landing page that, in addition to its human-readable representation, exposes all available metadata formats and data download options through standard HTTP mechanisms. There is no separate PANGAEA data API — all data and metadata access is built on top of the DOI and its landing page using standard web protocols. This design makes PANGAEA data directly accessible without any vendor-specific API key or proprietary client, while remaining fully compatible with FAIR principles.

== Web-Based Search and Discovery ==

=== PANGAEA Search Interface ===
The primary discovery tool for PANGAEA data is the PANGAEA search engine at https://www.pangaea.de/, based on Elasticsearch. It supports full-text search and faceted filtering across all published metadata. Faceted navigation allows users to constrain results by topic, device type, geographic region, and temporal coverage. Documentation on search functionality and syntax is available in the Wiki at [[PANGAEA search]].

=== External Portals and Registries ===
PANGAEA metadata is harvested by a large number of disciplinary and generic portals via OAI-PMH, making datasets discoverable beyond the PANGAEA website through services such as Google Dataset Search, OpenAIRE, DataCite Commons, DataONE, GBIF, EMODnet, GFBio, and others. PANGAEA is registered in re3data.org, FAIRsharing.org, RIsources, and the EOSC Marketplace.

An alternative for search when a PANGAEA-specific search interface is not required is the '''DataCite Search API''' (https://support.datacite.org/docs/api), which searches across a large inventory of research data from many repositories. It returns summary metadata including DOI names, and subsequent data access from any PANGAEA result can then proceed via content negotiation as described below.

== DOI Landing Pages and Link Discovery ==

=== The Landing Page as Access Hub ===
Each PANGAEA dataset is represented by a landing page accessible at its DOI. For example:
<code>https://doi.pangaea.de/10.1594/PANGAEA.841672</code>
The landing page serves a dual function: it presents dataset metadata and a data preview in human-readable form, and it simultaneously exposes all available machine-readable representations of the same resource through standard HTTP link headers and HTML <code><link></code> elements. This implementation follows the '''Signposting standard''' (https://signposting.org/), which allows any HTTP client to discover all alternate representations of a dataset — metadata in various formats, the data file itself, and the ORCID iDs of the authors — without any prior knowledge of PANGAEA-specific URL patterns.

=== Discovering Available Representations ===
A simple HTTP HEAD request to the landing page returns the full set of typed link relations in the response header. The following example illustrates this using <code>curl</code>:
<code>curl -LI https://doi.org/10.1594/PANGAEA.841672</code>
The response <code>Link:</code> header will contain relations of the following types:

* '''<code>cite-as</code>''' — the canonical DOI citation URL
* '''<code>describedby</code>''' — links to metadata in various formats (ISO 19139, DataCite XML, PANGAEA XML, BibTeX, RIS, JSON-LD)
* '''<code>item</code>''' — links to the data itself (tab-delimited text, HTML view)
* '''<code>author</code>''' — ORCID iDs of the dataset authors

This mechanism allows scripts, harvesters, and other machine clients to discover and retrieve any representation of a dataset from its DOI alone.

== HTTP Content Negotiation ==
HTTP content negotiation allows a client to request a specific representation of a resource by specifying a MIME type in the <code>Accept</code> header. All PANGAEA dataset landing pages support this mechanism, so both data and metadata can be downloaded programmatically by querying the DOI directly with the appropriate content type — no PANGAEA-specific URL construction is required.

=== Downloading the Data File ===
The tabular data file for a dataset can be downloaded as tab-delimited text:
<code>curl -OJLf 'https://doi.pangaea.de/10.1594/PANGAEA.841672?format=textfile'</code>
Equivalently, using content negotiation directly against the DOI:
<curl -OJLf -H 'Accept: text/tab-separated-values' https://doi.org/10.1594/PANGAEA.841672</code>
The <code>-OJL</code> flags instruct curl to save the file using the server-provided (i.e. meaningful) filename, follow redirects (to resolve the DOI), and fail with appropriate exit code on errors (easier to handle in scripts than the respective HTML errors).

=== Retrieving Metadata in a Specific Format ===
The same mechanism applies to metadata. To retrieve a dataset's metadata in ISO 19139/19115 format:
<curl -L -H 'Accept: application/vnd.iso19139.metadata+xml' https://doi.org/10.1594/PANGAEA.841672</code>
The following MIME types are currently supported for metadata retrieval via content negotiation:
{| class="wikitable"
!Format
!MIME type
|-
|PANGAEA internal XML
|<code>application/vnd.pangaea.metadata+xml</code>
|-
|DataCite XML (v4)
|<code>application/vnd.datacite.datacite+xml</code>
|-
|ISO 19139 / 19115
|<code>application/vnd.iso19139.metadata+xml</code>
|-
|NASA DIF
|<code>application/vnd.nasa.dif-metadata+xml</code>
|-
|JSON-LD (Schema.org)
|<code>application/ld+json</code>
|-
|BibTeX
|<code>application/x-bibtex</code>
|-
|RIS
|<code>application/x-research-info-systems</code>
|-
|Plain text citation
|<code>text/x-bibliography</code>
|-
|Tab-separated data
|<code>text/tab-separated-values</code>
|}
Alternatively, the same formats can be requested using explicit URL parameters:
https://doi.pangaea.de/10.1594/PANGAEA.841672?format=metadata_datacite4
https://doi.pangaea.de/10.1594/PANGAEA.841672?format=metadata_iso19139
https://doi.pangaea.de/10.1594/PANGAEA.841672?format=metadata_jsonld
https://doi.pangaea.de/10.1594/PANGAEA.841672?format=citation_bibtex

== Access to Restricted Datasets ==
A small fraction of PANGAEA datasets are under an active moratorium — access to the data is temporarily restricted while the metadata remains publicly visible. Access to your personal moratorium datasets requires authentication using a '''bearer token''', which is the standard mechanism for API authorization (RFC 6750). Note that access to protected datasets of other authors is, of course, still not possible.

To obtain your bearer token, log in to PANGAEA at https://pangaea.de/user/login.php. Login is supported both with a PANGAEA username and password, and via ORCID iD. After logging in, the user profile page at https://pangaea.de/user/ displays the current session token under "Your temporary login token." This token can be renewed when logging out and logging in again, and passed as an <code>Authorization: Bearer</code> header in any HTTP request:
<curl -OJLf \
-H 'Authorization: Bearer <your-token>' -H 'Accept: text/tab-separated-values' https://doi.org/10.1594/PANGAEA.841672</code>
If a request is made without a token to a restricted dataset, the server responds with <code>401 Bearer Realm</code>, signaling that authentication is required.

In recent versions, curl can provide the token with an explicit flag (<code>--oauth2-bearer</code>). The respective command example would then be:
<code>curl -OJLf --oauth2-bearer xxx -H 'Accept: text/tab-separated-values' https://doi.org/10.1594/PANGAEA.841672</code>

== OAI-PMH Metadata Harvesting ==
For bulk metadata harvesting, PANGAEA provides an OAI-PMH endpoint at:
<code>https://ws.pangaea.de/oai/</code>
OAI-PMH allows systematic, incremental harvesting of all PANGAEA metadata in supported formats. The following metadata standards are available via OAI-PMH: Dublin Core, DataCite v3 and v4, ISO 19139, and DIF (Directory Interchange Format). This endpoint is used by the portals and registries listed in the discovery section above, and is equally available to any user who wishes to build their own index or integrate PANGAEA metadata into an institutional system.

== Programmatic Access with Python: pangaeapy ==
'''pangaeapy''' is the official Python client library for PANGAEA, developed and maintained by the PANGAEA team. It provides a high-level interface for loading and analyzing PANGAEA datasets directly into native Python data structures, without requiring manual HTTP requests or format parsing.

* PyPI package: https://pypi.org/project/pangaeapy/
* Source code: https://github.com/pangaea-data-publisher/pangaeapy
* Introduction slides: https://github.com/pangaea-data-publisher/community-workshop-material/blob/master/Introduction_to_pangaeapy.pdf

=== Installation ===
<code>pip install pangaeapy</code>

=== Loading a Dataset ===
The central object in pangaeapy is <code>PanDataSet</code>, which takes a DOI or PANGAEA dataset ID and retrieves both the data and the associated metadata:
<code>from pangaeapy import PanDataSet

ds = PanDataSet('10.1594/PANGAEA.841672')

# Access the data as a pandas DataFrame
print(ds.data.head())

# Access dataset metadata
print(ds.title)
print(ds.authors)
print(ds.parameters) # list of measured parameters with units and methods</code>
The <code>data</code> attribute is a pandas DataFrame in which each column corresponds to a measured parameter, and each row to a measurement. Parameter metadata — including units, standard names, and method descriptions — is accessible through the <code>parameters</code> attribute.

=== Searching PANGAEA from Python ===
pangaeapy also provides a <code>PanQuery</code> class for querying the PANGAEA search engine programmatically. Note that the underlying search API used by pangaeapy is currently internal and undocumented; an official, publicly documented Search REST API is under development. Until that is available, pangaeapy offers the most straightforward path to programmatic search for Python users:
<code>from pangaeapy import PanQuery

q = PanQuery('temperature salinity Atlantic', limit=10)
for result in q.results:
print(result['doi'], result['title'])</code>

=== Access to Restricted Datasets ===
A bearer token obtained from the PANGAEA user profile can be passed to <code>PanDataSet</code> to enable access to your datasets that are subject to a moratorium (not of other authors!):
<code>ds = PanDataSet('10.1594/PANGAEA.841672', token='<your-bearer-token>')</code>

=== Further Training Materials ===
Jupyter notebooks with worked examples for data discovery, loading, and analysis with pangaeapy are available in the PANGAEA Community Workshop GitHub repository at https://github.com/pangaea-data-publisher/community-workshop-material (see the <code>Python/</code> directory).

== Programmatic Access with R: pangaear ==
'''pangaear''' is a community-developed R client for PANGAEA, maintained as part of the rOpenSci ecosystem. It provides equivalent functionality to pangaeapy for R users.

* CRAN: https://cran.r-project.org/package=pangaear
* GitHub: https://github.com/ropensci/pangaear
* Introduction slides: https://github.com/pangaea-data-publisher/community-workshop-material/blob/master/Intro_PangaeaR.pdf

=== Installation ===
<code>install.packages("pangaear")
# or the development version:
# devtools::install_github("ropensci/pangaear")</code>

=== Loading a Dataset ===
<code>library(pangaear)

# Download and load a dataset by DOI
ds <- pg_data(doi = '10.1594/PANGAEA.841672')

# Access the data as a data frame
head(ds[[1]]$data)

# Metadata is accessible from the list object
ds[[1]]$metadata</code>

=== Searching PANGAEA from R ===
<code>res <- pg_search(query = 'temperature salinity Atlantic', count = 10)
print(res$doi)</code>
R scripts and worked examples are available in the <code>R/</code> directory of the PANGAEA Community Workshop GitHub repository.

== The PANGAEA Data Warehouse ==
The PANGAEA data warehouse (based on Clickhouse) provides a powerful complementary access path for users who need to compile and aggregate data across large numbers of datasets — potentially across hundreds of thousands of individual publications — without having to download and merge files manually.

The data warehouse supports spatially and chronologically constrained queries at the parameter level, returning data together with the DOI of each contributing dataset to maintain full provenance traceability. It is accessible in two ways:

'''Through the PANGAEA website:''' The data warehouse interface is integrated into the search results page. After performing a search, users can select "Data Warehouse" to configure and download a parameter-level aggregation from all datasets in the result set.

Documentation: https://wiki.pangaea.de/wiki/Data_warehouse

'''Programmatically via REST API:''' The data warehouse is also accessible through the PANGAEA web services at https://ws.pangaea.de/. This allows automated, scripted aggregations and is supported by both pangaeapy and pangaear.

Note that data warehouse exports represent compiled data products and do not replace the need to consult individual dataset landing pages to assess the fitness of individual studies for a specific scientific application. Every export includes the DOI name for each data point, ensuring that citation and provenance are preserved.

== Summary of Access Methods ==
{| class="wikitable"
!Use case
!Method
!Entry point
|-
|Interactive discovery
|PANGAEA search
|https://www.pangaea.de/
|-
|Single dataset access
|Browser / landing page
|https://doi.pangaea.de/10.1594/PANGAEA.xxxxx
|-
|Programmatic data download
|HTTP content negotiation
|<code>curl -H 'Accept: text/tab-separated-values' https://doi.org/...</code>
|-
|Programmatic metadata download
|HTTP content negotiation
|<code>curl -H 'Accept: application/ld+json' https://doi.org/...</code>
|-
|Link and format discovery
|HTTP HEAD / Signposting
|<code>curl -I https://doi.pangaea.de/...</code>
|-
|Restricted dataset access (to your own publications)
|Bearer token authentication
|retrievable via https://pangaea.de/user/
|-
|Bulk metadata harvesting
|OAI-PMH
|https://ws.pangaea.de/oai/
|-
|Scripted data access (Python)
|pangaeapy
|https://pypi.org/project/pangaeapy/
|-
|Scripted data access (R)
|pangaear
|https://cran.r-project.org/package=pangaear
|-
|Cross-dataset aggregation
|Data warehouse
|https://wiki.pangaea.de/wiki/Data_warehouse
|-
|Discovery across repositories
|DataCite Search API
|https://support.datacite.org/docs/api
|}
== Further Resources ==

* [[PANGAEA search]] — documentation on search syntax and faceted filters
* [[Data warehouse]] — data warehouse documentation
* [[PANGAEA Community Workshops]] — hands-on training workshops on finding and using PANGAEA data
* [https://github.com/pangaea-data-publisher/community-workshop-material PANGAEA Community Workshop GitHub] — Jupyter notebooks, R scripts, and slide decks
* [https://pypi.org/project/pangaeapy/ pangaeapy on PyPI] — Python client
* [https://cran.r-project.org/package=pangaear pangaear on CRAN] — R client
* [https://ws.pangaea.de/oai/ PANGAEA web services] — REST and OAI-PMH endpoints
* [https://signposting.org/ Signposting standard] — link discovery for FAIR data

Data Access and Reuse

2026-04-22T10:31:20Z

Lmoeller: removed \ for line breaks in curl code examples

This article describes the different methods available for discovering, accessing, and reusing data published in PANGAEA. It is intended for researchers and data practitioners who want to interact with PANGAEA data beyond the standard web interface, including scripted and automated workflows. The methods range from interactive web-based search to fully programmatic access via HTTP content negotiation and dedicated client libraries.

Hands-on training materials covering many of the topics below are available in the [[PANGAEA Community Workshops|PANGAEA Community Workshop Series]], including Jupyter notebooks, R scripts, and slide decks in the [https://github.com/pangaea-data-publisher/community-workshop-material PANGAEA community workshop GitHub repository].

== Overview ==
Every dataset published in PANGAEA is assigned a globally unique and persistent Digital Object Identifier (DOI). The DOI is the single entry point for all forms of programmatic access: it resolves to a dataset landing page that, in addition to its human-readable representation, exposes all available metadata formats and data download options through standard HTTP mechanisms. There is no separate PANGAEA data API — all data and metadata access is built on top of the DOI and its landing page using standard web protocols. This design makes PANGAEA data directly accessible without any vendor-specific API key or proprietary client, while remaining fully compatible with FAIR principles.

== Web-Based Search and Discovery ==

=== PANGAEA Search Interface ===
The primary discovery tool for PANGAEA data is the PANGAEA search engine at https://www.pangaea.de/, based on Elasticsearch. It supports full-text search and faceted filtering across all published metadata. Faceted navigation allows users to constrain results by topic, device type, geographic region, and temporal coverage. Documentation on search functionality and syntax is available in the Wiki at [[PANGAEA search]].

=== External Portals and Registries ===
PANGAEA metadata is harvested by a large number of disciplinary and generic portals via OAI-PMH, making datasets discoverable beyond the PANGAEA website through services such as Google Dataset Search, OpenAIRE, DataCite Commons, DataONE, GBIF, EMODnet, GFBio, and others. PANGAEA is registered in re3data.org, FAIRsharing.org, RIsources, and the EOSC Marketplace.

An alternative for search when a PANGAEA-specific search interface is not required is the '''DataCite Search API''' (https://support.datacite.org/docs/api), which searches across a large inventory of research data from many repositories. It returns summary metadata including DOI names, and subsequent data access from any PANGAEA result can then proceed via content negotiation as described below.

== DOI Landing Pages and Link Discovery ==

=== The Landing Page as Access Hub ===
Each PANGAEA dataset is represented by a landing page accessible at its DOI. For example:
<code>https://doi.pangaea.de/10.1594/PANGAEA.841672</code>
The landing page serves a dual function: it presents dataset metadata and a data preview in human-readable form, and it simultaneously exposes all available machine-readable representations of the same resource through standard HTTP link headers and HTML <code><link></code> elements. This implementation follows the '''Signposting standard''' (https://signposting.org/), which allows any HTTP client to discover all alternate representations of a dataset — metadata in various formats, the data file itself, and the ORCID iDs of the authors — without any prior knowledge of PANGAEA-specific URL patterns.

=== Discovering Available Representations ===
A simple HTTP HEAD request to the landing page returns the full set of typed link relations in the response header. The following example illustrates this using <code>curl</code>:
<code>curl -LI https://doi.org/10.1594/PANGAEA.841672</code>
The response <code>Link:</code> header will contain relations of the following types:

* '''<code>cite-as</code>''' — the canonical DOI citation URL
* '''<code>describedby</code>''' — links to metadata in various formats (ISO 19139, DataCite XML, PANGAEA XML, BibTeX, RIS, JSON-LD)
* '''<code>item</code>''' — links to the data itself (tab-delimited text, HTML view)
* '''<code>author</code>''' — ORCID iDs of the dataset authors

This mechanism allows scripts, harvesters, and other machine clients to discover and retrieve any representation of a dataset from its DOI alone.

== HTTP Content Negotiation ==
HTTP content negotiation allows a client to request a specific representation of a resource by specifying a MIME type in the <code>Accept</code> header. All PANGAEA dataset landing pages support this mechanism, so both data and metadata can be downloaded programmatically by querying the DOI directly with the appropriate content type — no PANGAEA-specific URL construction is required.

=== Downloading the Data File ===
The tabular data file for a dataset can be downloaded as tab-delimited text:
<code>curl -OJLf 'https://doi.pangaea.de/10.1594/PANGAEA.841672?format=textfile'</code>
Equivalently, using content negotiation directly against the DOI:
<curl -OJLf -H 'Accept: text/tab-separated-values' https://doi.org/10.1594/PANGAEA.841672</code>
The <code>-OJL</code> flags instruct curl to save the file using the server-provided filename, follow redirects, and fail with appropriate exit code on errors.

=== Retrieving Metadata in a Specific Format ===
The same mechanism applies to metadata. To retrieve a dataset's metadata in ISO 19139/19115 format:
<curl -L -H 'Accept: application/vnd.iso19139.metadata+xml' https://doi.org/10.1594/PANGAEA.841672</code>
The following MIME types are currently supported for metadata retrieval via content negotiation:
{| class="wikitable"
!Format
!MIME type
|-
|PANGAEA internal XML
|<code>application/vnd.pangaea.metadata+xml</code>
|-
|DataCite XML (v4)
|<code>application/vnd.datacite.datacite+xml</code>
|-
|ISO 19139 / 19115
|<code>application/vnd.iso19139.metadata+xml</code>
|-
|NASA DIF
|<code>application/vnd.nasa.dif-metadata+xml</code>
|-
|JSON-LD (Schema.org)
|<code>application/ld+json</code>
|-
|BibTeX
|<code>application/x-bibtex</code>
|-
|RIS
|<code>application/x-research-info-systems</code>
|-
|Plain text citation
|<code>text/x-bibliography</code>
|-
|Tab-separated data
|<code>text/tab-separated-values</code>
|}
Alternatively, the same formats can be requested using explicit URL parameters:
https://doi.pangaea.de/10.1594/PANGAEA.841672?format=metadata_datacite4
https://doi.pangaea.de/10.1594/PANGAEA.841672?format=metadata_iso19139
https://doi.pangaea.de/10.1594/PANGAEA.841672?format=metadata_jsonld
https://doi.pangaea.de/10.1594/PANGAEA.841672?format=citation_bibtex

== Access to Restricted Datasets ==
A small fraction of PANGAEA datasets are under an active moratorium — access to the data is temporarily restricted while the metadata remains publicly visible. Access to your personal moratorium datasets requires authentication using a '''bearer token''', which is the standard mechanism for API authorization (RFC 6750). Note that access to protected datasets of other authors is, of course, still not possible.

To obtain your bearer token, log in to PANGAEA at https://pangaea.de/user/login.php. Login is supported both with a PANGAEA username and password, and via ORCID iD. After logging in, the user profile page at https://pangaea.de/user/ displays the current session token under "Your temporary login token." This token can be renewed when logging out and logging in again, and passed as an <code>Authorization: Bearer</code> header in any HTTP request:
<curl -OJLf \
-H 'Authorization: Bearer <your-token>' -H 'Accept: text/tab-separated-values' https://doi.org/10.1594/PANGAEA.841672</code>
If a request is made without a token to a restricted dataset, the server responds with <code>401 Bearer Realm</code>, signaling that authentication is required.

In recent versions, curl can provide the token with an explicit flag (<code>--oauth2-bearer</code>). The respective command example would then be:
<code>curl -OJLf --oauth2-bearer xxx -H 'Accept: text/tab-separated-values' https://doi.org/10.1594/PANGAEA.841672</code>

== OAI-PMH Metadata Harvesting ==
For bulk metadata harvesting, PANGAEA provides an OAI-PMH endpoint at:
<code>https://ws.pangaea.de/oai/</code>
OAI-PMH allows systematic, incremental harvesting of all PANGAEA metadata in supported formats. The following metadata standards are available via OAI-PMH: Dublin Core, DataCite v3 and v4, ISO 19139, and DIF (Directory Interchange Format). This endpoint is used by the portals and registries listed in the discovery section above, and is equally available to any user who wishes to build their own index or integrate PANGAEA metadata into an institutional system.

== Programmatic Access with Python: pangaeapy ==
'''pangaeapy''' is the official Python client library for PANGAEA, developed and maintained by the PANGAEA team. It provides a high-level interface for loading and analyzing PANGAEA datasets directly into native Python data structures, without requiring manual HTTP requests or format parsing.

* PyPI package: https://pypi.org/project/pangaeapy/
* Source code: https://github.com/pangaea-data-publisher/pangaeapy
* Introduction slides: https://github.com/pangaea-data-publisher/community-workshop-material/blob/master/Introduction_to_pangaeapy.pdf

=== Installation ===
<code>pip install pangaeapy</code>

=== Loading a Dataset ===
The central object in pangaeapy is <code>PanDataSet</code>, which takes a DOI or PANGAEA dataset ID and retrieves both the data and the associated metadata:
<code>from pangaeapy import PanDataSet

ds = PanDataSet('10.1594/PANGAEA.841672')

# Access the data as a pandas DataFrame
print(ds.data.head())

# Access dataset metadata
print(ds.title)
print(ds.authors)
print(ds.parameters) # list of measured parameters with units and methods</code>
The <code>data</code> attribute is a pandas DataFrame in which each column corresponds to a measured parameter, and each row to a measurement. Parameter metadata — including units, standard names, and method descriptions — is accessible through the <code>parameters</code> attribute.

=== Searching PANGAEA from Python ===
pangaeapy also provides a <code>PanQuery</code> class for querying the PANGAEA search engine programmatically. Note that the underlying search API used by pangaeapy is currently internal and undocumented; an official, publicly documented Search REST API is under development. Until that is available, pangaeapy offers the most straightforward path to programmatic search for Python users:
<code>from pangaeapy import PanQuery

q = PanQuery('temperature salinity Atlantic', limit=10)
for result in q.results:
print(result['doi'], result['title'])</code>

=== Access to Restricted Datasets ===
A bearer token obtained from the PANGAEA user profile can be passed to <code>PanDataSet</code> to enable access to your datasets that are subject to a moratorium (not of other authors!):
<code>ds = PanDataSet('10.1594/PANGAEA.841672', token='<your-bearer-token>')</code>

=== Further Training Materials ===
Jupyter notebooks with worked examples for data discovery, loading, and analysis with pangaeapy are available in the PANGAEA Community Workshop GitHub repository at https://github.com/pangaea-data-publisher/community-workshop-material (see the <code>Python/</code> directory).

== Programmatic Access with R: pangaear ==
'''pangaear''' is a community-developed R client for PANGAEA, maintained as part of the rOpenSci ecosystem. It provides equivalent functionality to pangaeapy for R users.

* CRAN: https://cran.r-project.org/package=pangaear
* GitHub: https://github.com/ropensci/pangaear
* Introduction slides: https://github.com/pangaea-data-publisher/community-workshop-material/blob/master/Intro_PangaeaR.pdf

=== Installation ===
<code>install.packages("pangaear")
# or the development version:
# devtools::install_github("ropensci/pangaear")</code>

=== Loading a Dataset ===
<code>library(pangaear)

# Download and load a dataset by DOI
ds <- pg_data(doi = '10.1594/PANGAEA.841672')

# Access the data as a data frame
head(ds[[1]]$data)

# Metadata is accessible from the list object
ds[[1]]$metadata</code>

=== Searching PANGAEA from R ===
<code>res <- pg_search(query = 'temperature salinity Atlantic', count = 10)
print(res$doi)</code>
R scripts and worked examples are available in the <code>R/</code> directory of the PANGAEA Community Workshop GitHub repository.

== The PANGAEA Data Warehouse ==
The PANGAEA data warehouse (based on Clickhouse) provides a powerful complementary access path for users who need to compile and aggregate data across large numbers of datasets — potentially across hundreds of thousands of individual publications — without having to download and merge files manually.

The data warehouse supports spatially and chronologically constrained queries at the parameter level, returning data together with the DOI of each contributing dataset to maintain full provenance traceability. It is accessible in two ways:

'''Through the PANGAEA website:''' The data warehouse interface is integrated into the search results page. After performing a search, users can select "Data Warehouse" to configure and download a parameter-level aggregation from all datasets in the result set.

Documentation: https://wiki.pangaea.de/wiki/Data_warehouse

'''Programmatically via REST API:''' The data warehouse is also accessible through the PANGAEA web services at https://ws.pangaea.de/. This allows automated, scripted aggregations and is supported by both pangaeapy and pangaear.

Note that data warehouse exports represent compiled data products and do not replace the need to consult individual dataset landing pages to assess the fitness of individual studies for a specific scientific application. Every export includes the DOI name for each data point, ensuring that citation and provenance are preserved.

== Summary of Access Methods ==
{| class="wikitable"
!Use case
!Method
!Entry point
|-
|Interactive discovery
|PANGAEA search
|https://www.pangaea.de/
|-
|Single dataset access
|Browser / landing page
|https://doi.pangaea.de/10.1594/PANGAEA.xxxxx
|-
|Programmatic data download
|HTTP content negotiation
|<code>curl -H 'Accept: text/tab-separated-values' https://doi.org/...</code>
|-
|Programmatic metadata download
|HTTP content negotiation
|<code>curl -H 'Accept: application/ld+json' https://doi.org/...</code>
|-
|Link and format discovery
|HTTP HEAD / Signposting
|<code>curl -I https://doi.pangaea.de/...</code>
|-
|Restricted dataset access (to your own publications)
|Bearer token authentication
|retrievable via https://pangaea.de/user/
|-
|Bulk metadata harvesting
|OAI-PMH
|https://ws.pangaea.de/oai/
|-
|Scripted data access (Python)
|pangaeapy
|https://pypi.org/project/pangaeapy/
|-
|Scripted data access (R)
|pangaear
|https://cran.r-project.org/package=pangaear
|-
|Cross-dataset aggregation
|Data warehouse
|https://wiki.pangaea.de/wiki/Data_warehouse
|-
|Discovery across repositories
|DataCite Search API
|https://support.datacite.org/docs/api
|}
== Further Resources ==

* [[PANGAEA search]] — documentation on search syntax and faceted filters
* [[Data warehouse]] — data warehouse documentation
* [[PANGAEA Community Workshops]] — hands-on training workshops on finding and using PANGAEA data
* [https://github.com/pangaea-data-publisher/community-workshop-material PANGAEA Community Workshop GitHub] — Jupyter notebooks, R scripts, and slide decks
* [https://pypi.org/project/pangaeapy/ pangaeapy on PyPI] — Python client
* [https://cran.r-project.org/package=pangaear pangaear on CRAN] — R client
* [https://ws.pangaea.de/oai/ PANGAEA web services] — REST and OAI-PMH endpoints
* [https://signposting.org/ Signposting standard] — link discovery for FAIR data

Data Access and Reuse

2026-04-22T10:26:12Z

Lmoeller: /* Access to Restricted Datasets */

This article describes the different methods available for discovering, accessing, and reusing data published in PANGAEA. It is intended for researchers and data practitioners who want to interact with PANGAEA data beyond the standard web interface, including scripted and automated workflows. The methods range from interactive web-based search to fully programmatic access via HTTP content negotiation and dedicated client libraries.

Hands-on training materials covering many of the topics below are available in the [[PANGAEA Community Workshops|PANGAEA Community Workshop Series]], including Jupyter notebooks, R scripts, and slide decks in the [https://github.com/pangaea-data-publisher/community-workshop-material PANGAEA community workshop GitHub repository].

== Overview ==
Every dataset published in PANGAEA is assigned a globally unique and persistent Digital Object Identifier (DOI). The DOI is the single entry point for all forms of programmatic access: it resolves to a dataset landing page that, in addition to its human-readable representation, exposes all available metadata formats and data download options through standard HTTP mechanisms. There is no separate PANGAEA data API — all data and metadata access is built on top of the DOI and its landing page using standard web protocols. This design makes PANGAEA data directly accessible without any vendor-specific API key or proprietary client, while remaining fully compatible with FAIR principles.

== Web-Based Search and Discovery ==

=== PANGAEA Search Interface ===
The primary discovery tool for PANGAEA data is the PANGAEA search engine at https://www.pangaea.de/, based on Elasticsearch. It supports full-text search and faceted filtering across all published metadata. Faceted navigation allows users to constrain results by topic, device type, geographic region, and temporal coverage. Documentation on search functionality and syntax is available in the Wiki at [[PANGAEA search]].

=== External Portals and Registries ===
PANGAEA metadata is harvested by a large number of disciplinary and generic portals via OAI-PMH, making datasets discoverable beyond the PANGAEA website through services such as Google Dataset Search, OpenAIRE, DataCite Commons, DataONE, GBIF, EMODnet, GFBio, and others. PANGAEA is registered in re3data.org, FAIRsharing.org, RIsources, and the EOSC Marketplace.

An alternative for search when a PANGAEA-specific search interface is not required is the '''DataCite Search API''' (https://support.datacite.org/docs/api), which searches across a large inventory of research data from many repositories. It returns summary metadata including DOI names, and subsequent data access from any PANGAEA result can then proceed via content negotiation as described below.

== DOI Landing Pages and Link Discovery ==

=== The Landing Page as Access Hub ===
Each PANGAEA dataset is represented by a landing page accessible at its DOI. For example:
<code>https://doi.pangaea.de/10.1594/PANGAEA.841672</code>
The landing page serves a dual function: it presents dataset metadata and a data preview in human-readable form, and it simultaneously exposes all available machine-readable representations of the same resource through standard HTTP link headers and HTML <code><link></code> elements. This implementation follows the '''Signposting standard''' (https://signposting.org/), which allows any HTTP client to discover all alternate representations of a dataset — metadata in various formats, the data file itself, and the ORCID iDs of the authors — without any prior knowledge of PANGAEA-specific URL patterns.

=== Discovering Available Representations ===
A simple HTTP HEAD request to the landing page returns the full set of typed link relations in the response header. The following example illustrates this using <code>curl</code>:
<code>curl -LI https://doi.org/10.1594/PANGAEA.841672</code>
The response <code>Link:</code> header will contain relations of the following types:

* '''<code>cite-as</code>''' — the canonical DOI citation URL
* '''<code>describedby</code>''' — links to metadata in various formats (ISO 19139, DataCite XML, PANGAEA XML, BibTeX, RIS, JSON-LD)
* '''<code>item</code>''' — links to the data itself (tab-delimited text, HTML view)
* '''<code>author</code>''' — ORCID iDs of the dataset authors

This mechanism allows scripts, harvesters, and other machine clients to discover and retrieve any representation of a dataset from its DOI alone.

== HTTP Content Negotiation ==
HTTP content negotiation allows a client to request a specific representation of a resource by specifying a MIME type in the <code>Accept</code> header. All PANGAEA dataset landing pages support this mechanism, so both data and metadata can be downloaded programmatically by querying the DOI directly with the appropriate content type — no PANGAEA-specific URL construction is required.

=== Downloading the Data File ===
The tabular data file for a dataset can be downloaded as tab-delimited text:
<code>curl -OJLf 'https://doi.pangaea.de/10.1594/PANGAEA.841672?format=textfile'</code>
Equivalently, using content negotiation directly against the DOI:
<code>curl -OJLf -H 'Accept: text/tab-separated-values' \
https://doi.org/10.1594/PANGAEA.841672</code>
The <code>-OJL</code> flags instruct curl to save the file using the server-provided filename, follow redirects, and fail with appropriate exit code on errors.

=== Retrieving Metadata in a Specific Format ===
The same mechanism applies to metadata. To retrieve a dataset's metadata in ISO 19139/19115 format:
<code>curl -L -H 'Accept: application/vnd.iso19139.metadata+xml' \
https://doi.org/10.1594/PANGAEA.841672</code>
The following MIME types are currently supported for metadata retrieval via content negotiation:
{| class="wikitable"
!Format
!MIME type
|-
|PANGAEA internal XML
|<code>application/vnd.pangaea.metadata+xml</code>
|-
|DataCite XML (v4)
|<code>application/vnd.datacite.datacite+xml</code>
|-
|ISO 19139 / 19115
|<code>application/vnd.iso19139.metadata+xml</code>
|-
|NASA DIF
|<code>application/vnd.nasa.dif-metadata+xml</code>
|-
|JSON-LD (Schema.org)
|<code>application/ld+json</code>
|-
|BibTeX
|<code>application/x-bibtex</code>
|-
|RIS
|<code>application/x-research-info-systems</code>
|-
|Plain text citation
|<code>text/x-bibliography</code>
|-
|Tab-separated data
|<code>text/tab-separated-values</code>
|}
Alternatively, the same formats can be requested using explicit URL parameters:
https://doi.pangaea.de/10.1594/PANGAEA.841672?format=metadata_datacite4
https://doi.pangaea.de/10.1594/PANGAEA.841672?format=metadata_iso19139
https://doi.pangaea.de/10.1594/PANGAEA.841672?format=metadata_jsonld
https://doi.pangaea.de/10.1594/PANGAEA.841672?format=citation_bibtex

== Access to Restricted Datasets ==
A small fraction of PANGAEA datasets are under an active moratorium — access to the data is temporarily restricted while the metadata remains publicly visible. Access to your personal moratorium datasets requires authentication using a '''bearer token''', which is the standard mechanism for API authorization (RFC 6750). Note that access to protected datasets of other authors is, of course, still not possible.

To obtain your bearer token, log in to PANGAEA at https://pangaea.de/user/login.php. Login is supported both with a PANGAEA username and password, and via ORCID iD. After logging in, the user profile page at https://pangaea.de/user/ displays the current session token under "Your temporary login token." This token can be renewed when logging out and logging in again, and passed as an <code>Authorization: Bearer</code> header in any HTTP request:
<code>curl -OJLf \
-H 'Authorization: Bearer <your-token>' \
-H 'Accept: text/tab-separated-values' \
https://doi.org/10.1594/PANGAEA.841672</code>
If a request is made without a token to a restricted dataset, the server responds with <code>401 Bearer Realm</code>, signaling that authentication is required.

In recent versions, curl can provide the token with an explicit flag (<code>--oauth2-bearer</code>). The respective command example would then be:
<code>curl -OJLf --oauth2-bearer xxx -H 'Accept: text/tab-separated-values' https://doi.org/10.1594/PANGAEA.841672</code>

== OAI-PMH Metadata Harvesting ==
For bulk metadata harvesting, PANGAEA provides an OAI-PMH endpoint at:
<code>https://ws.pangaea.de/oai/</code>
OAI-PMH allows systematic, incremental harvesting of all PANGAEA metadata in supported formats. The following metadata standards are available via OAI-PMH: Dublin Core, DataCite v3 and v4, ISO 19139, and DIF (Directory Interchange Format). This endpoint is used by the portals and registries listed in the discovery section above, and is equally available to any user who wishes to build their own index or integrate PANGAEA metadata into an institutional system.

== Programmatic Access with Python: pangaeapy ==
'''pangaeapy''' is the official Python client library for PANGAEA, developed and maintained by the PANGAEA team. It provides a high-level interface for loading and analyzing PANGAEA datasets directly into native Python data structures, without requiring manual HTTP requests or format parsing.

* PyPI package: https://pypi.org/project/pangaeapy/
* Source code: https://github.com/pangaea-data-publisher/pangaeapy
* Introduction slides: https://github.com/pangaea-data-publisher/community-workshop-material/blob/master/Introduction_to_pangaeapy.pdf

=== Installation ===
<code>pip install pangaeapy</code>

=== Loading a Dataset ===
The central object in pangaeapy is <code>PanDataSet</code>, which takes a DOI or PANGAEA dataset ID and retrieves both the data and the associated metadata:
<code>from pangaeapy import PanDataSet

ds = PanDataSet('10.1594/PANGAEA.841672')

# Access the data as a pandas DataFrame
print(ds.data.head())

# Access dataset metadata
print(ds.title)
print(ds.authors)
print(ds.parameters) # list of measured parameters with units and methods</code>
The <code>data</code> attribute is a pandas DataFrame in which each column corresponds to a measured parameter, and each row to a measurement. Parameter metadata — including units, standard names, and method descriptions — is accessible through the <code>parameters</code> attribute.

=== Searching PANGAEA from Python ===
pangaeapy also provides a <code>PanQuery</code> class for querying the PANGAEA search engine programmatically. Note that the underlying search API used by pangaeapy is currently internal and undocumented; an official, publicly documented Search REST API is under development. Until that is available, pangaeapy offers the most straightforward path to programmatic search for Python users:
<code>from pangaeapy import PanQuery

q = PanQuery('temperature salinity Atlantic', limit=10)
for result in q.results:
print(result['doi'], result['title'])</code>

=== Access to Restricted Datasets ===
A bearer token obtained from the PANGAEA user profile can be passed to <code>PanDataSet</code> to enable access to your datasets that are subject to a moratorium (not of other authors!):
<code>ds = PanDataSet('10.1594/PANGAEA.841672', token='<your-bearer-token>')</code>

=== Further Training Materials ===
Jupyter notebooks with worked examples for data discovery, loading, and analysis with pangaeapy are available in the PANGAEA Community Workshop GitHub repository at https://github.com/pangaea-data-publisher/community-workshop-material (see the <code>Python/</code> directory).

== Programmatic Access with R: pangaear ==
'''pangaear''' is a community-developed R client for PANGAEA, maintained as part of the rOpenSci ecosystem. It provides equivalent functionality to pangaeapy for R users.

* CRAN: https://cran.r-project.org/package=pangaear
* GitHub: https://github.com/ropensci/pangaear
* Introduction slides: https://github.com/pangaea-data-publisher/community-workshop-material/blob/master/Intro_PangaeaR.pdf

=== Installation ===
<code>install.packages("pangaear")
# or the development version:
# devtools::install_github("ropensci/pangaear")</code>

=== Loading a Dataset ===
<code>library(pangaear)

# Download and load a dataset by DOI
ds <- pg_data(doi = '10.1594/PANGAEA.841672')

# Access the data as a data frame
head(ds[[1]]$data)

# Metadata is accessible from the list object
ds[[1]]$metadata</code>

=== Searching PANGAEA from R ===
<code>res <- pg_search(query = 'temperature salinity Atlantic', count = 10)
print(res$doi)</code>
R scripts and worked examples are available in the <code>R/</code> directory of the PANGAEA Community Workshop GitHub repository.

== The PANGAEA Data Warehouse ==
The PANGAEA data warehouse (based on Clickhouse) provides a powerful complementary access path for users who need to compile and aggregate data across large numbers of datasets — potentially across hundreds of thousands of individual publications — without having to download and merge files manually.

The data warehouse supports spatially and chronologically constrained queries at the parameter level, returning data together with the DOI of each contributing dataset to maintain full provenance traceability. It is accessible in two ways:

'''Through the PANGAEA website:''' The data warehouse interface is integrated into the search results page. After performing a search, users can select "Data Warehouse" to configure and download a parameter-level aggregation from all datasets in the result set.

Documentation: https://wiki.pangaea.de/wiki/Data_warehouse

'''Programmatically via REST API:''' The data warehouse is also accessible through the PANGAEA web services at https://ws.pangaea.de/. This allows automated, scripted aggregations and is supported by both pangaeapy and pangaear.

Note that data warehouse exports represent compiled data products and do not replace the need to consult individual dataset landing pages to assess the fitness of individual studies for a specific scientific application. Every export includes the DOI name for each data point, ensuring that citation and provenance are preserved.

== Summary of Access Methods ==
{| class="wikitable"
!Use case
!Method
!Entry point
|-
|Interactive discovery
|PANGAEA search
|https://www.pangaea.de/
|-
|Single dataset access
|Browser / landing page
|https://doi.pangaea.de/10.1594/PANGAEA.xxxxx
|-
|Programmatic data download
|HTTP content negotiation
|<code>curl -H 'Accept: text/tab-separated-values' https://doi.org/...</code>
|-
|Programmatic metadata download
|HTTP content negotiation
|<code>curl -H 'Accept: application/ld+json' https://doi.org/...</code>
|-
|Link and format discovery
|HTTP HEAD / Signposting
|<code>curl -I https://doi.pangaea.de/...</code>
|-
|Restricted dataset access (to your own publications)
|Bearer token authentication
|retrievable via https://pangaea.de/user/
|-
|Bulk metadata harvesting
|OAI-PMH
|https://ws.pangaea.de/oai/
|-
|Scripted data access (Python)
|pangaeapy
|https://pypi.org/project/pangaeapy/
|-
|Scripted data access (R)
|pangaear
|https://cran.r-project.org/package=pangaear
|-
|Cross-dataset aggregation
|Data warehouse
|https://wiki.pangaea.de/wiki/Data_warehouse
|-
|Discovery across repositories
|DataCite Search API
|https://support.datacite.org/docs/api
|}
----

== Further Resources ==

* [[PANGAEA search]] — documentation on search syntax and faceted filters
* [[Data warehouse]] — data warehouse documentation
* [[PANGAEA Community Workshops]] — hands-on training workshops on finding and using PANGAEA data
* [https://github.com/pangaea-data-publisher/community-workshop-material PANGAEA Community Workshop GitHub] — Jupyter notebooks, R scripts, and slide decks
* [https://pypi.org/project/pangaeapy/ pangaeapy on PyPI] — Python client
* [https://cran.r-project.org/package=pangaear pangaear on CRAN] — R client
* [https://ws.pangaea.de/oai/ PANGAEA web services] — REST and OAI-PMH endpoints
* [https://signposting.org/ Signposting standard] — link discovery for FAIR data

Data Access and Reuse

2026-04-22T10:25:43Z

Lmoeller: /* Access to Restricted Datasets */ Added alternative way to provide the bearer token

This article describes the different methods available for discovering, accessing, and reusing data published in PANGAEA. It is intended for researchers and data practitioners who want to interact with PANGAEA data beyond the standard web interface, including scripted and automated workflows. The methods range from interactive web-based search to fully programmatic access via HTTP content negotiation and dedicated client libraries.

Hands-on training materials covering many of the topics below are available in the [[PANGAEA Community Workshops|PANGAEA Community Workshop Series]], including Jupyter notebooks, R scripts, and slide decks in the [https://github.com/pangaea-data-publisher/community-workshop-material PANGAEA community workshop GitHub repository].

== Overview ==
Every dataset published in PANGAEA is assigned a globally unique and persistent Digital Object Identifier (DOI). The DOI is the single entry point for all forms of programmatic access: it resolves to a dataset landing page that, in addition to its human-readable representation, exposes all available metadata formats and data download options through standard HTTP mechanisms. There is no separate PANGAEA data API — all data and metadata access is built on top of the DOI and its landing page using standard web protocols. This design makes PANGAEA data directly accessible without any vendor-specific API key or proprietary client, while remaining fully compatible with FAIR principles.

== Web-Based Search and Discovery ==

=== PANGAEA Search Interface ===
The primary discovery tool for PANGAEA data is the PANGAEA search engine at https://www.pangaea.de/, based on Elasticsearch. It supports full-text search and faceted filtering across all published metadata. Faceted navigation allows users to constrain results by topic, device type, geographic region, and temporal coverage. Documentation on search functionality and syntax is available in the Wiki at [[PANGAEA search]].

=== External Portals and Registries ===
PANGAEA metadata is harvested by a large number of disciplinary and generic portals via OAI-PMH, making datasets discoverable beyond the PANGAEA website through services such as Google Dataset Search, OpenAIRE, DataCite Commons, DataONE, GBIF, EMODnet, GFBio, and others. PANGAEA is registered in re3data.org, FAIRsharing.org, RIsources, and the EOSC Marketplace.

An alternative for search when a PANGAEA-specific search interface is not required is the '''DataCite Search API''' (https://support.datacite.org/docs/api), which searches across a large inventory of research data from many repositories. It returns summary metadata including DOI names, and subsequent data access from any PANGAEA result can then proceed via content negotiation as described below.

== DOI Landing Pages and Link Discovery ==

=== The Landing Page as Access Hub ===
Each PANGAEA dataset is represented by a landing page accessible at its DOI. For example:
<code>https://doi.pangaea.de/10.1594/PANGAEA.841672</code>
The landing page serves a dual function: it presents dataset metadata and a data preview in human-readable form, and it simultaneously exposes all available machine-readable representations of the same resource through standard HTTP link headers and HTML <code><link></code> elements. This implementation follows the '''Signposting standard''' (https://signposting.org/), which allows any HTTP client to discover all alternate representations of a dataset — metadata in various formats, the data file itself, and the ORCID iDs of the authors — without any prior knowledge of PANGAEA-specific URL patterns.

=== Discovering Available Representations ===
A simple HTTP HEAD request to the landing page returns the full set of typed link relations in the response header. The following example illustrates this using <code>curl</code>:
<code>curl -LI https://doi.org/10.1594/PANGAEA.841672</code>
The response <code>Link:</code> header will contain relations of the following types:

* '''<code>cite-as</code>''' — the canonical DOI citation URL
* '''<code>describedby</code>''' — links to metadata in various formats (ISO 19139, DataCite XML, PANGAEA XML, BibTeX, RIS, JSON-LD)
* '''<code>item</code>''' — links to the data itself (tab-delimited text, HTML view)
* '''<code>author</code>''' — ORCID iDs of the dataset authors

This mechanism allows scripts, harvesters, and other machine clients to discover and retrieve any representation of a dataset from its DOI alone.

== HTTP Content Negotiation ==
HTTP content negotiation allows a client to request a specific representation of a resource by specifying a MIME type in the <code>Accept</code> header. All PANGAEA dataset landing pages support this mechanism, so both data and metadata can be downloaded programmatically by querying the DOI directly with the appropriate content type — no PANGAEA-specific URL construction is required.

=== Downloading the Data File ===
The tabular data file for a dataset can be downloaded as tab-delimited text:
<code>curl -OJLf 'https://doi.pangaea.de/10.1594/PANGAEA.841672?format=textfile'</code>
Equivalently, using content negotiation directly against the DOI:
<code>curl -OJLf -H 'Accept: text/tab-separated-values' \
https://doi.org/10.1594/PANGAEA.841672</code>
The <code>-OJL</code> flags instruct curl to save the file using the server-provided filename, follow redirects, and fail with appropriate exit code on errors.

=== Retrieving Metadata in a Specific Format ===
The same mechanism applies to metadata. To retrieve a dataset's metadata in ISO 19139/19115 format:
<code>curl -L -H 'Accept: application/vnd.iso19139.metadata+xml' \
https://doi.org/10.1594/PANGAEA.841672</code>
The following MIME types are currently supported for metadata retrieval via content negotiation:
{| class="wikitable"
!Format
!MIME type
|-
|PANGAEA internal XML
|<code>application/vnd.pangaea.metadata+xml</code>
|-
|DataCite XML (v4)
|<code>application/vnd.datacite.datacite+xml</code>
|-
|ISO 19139 / 19115
|<code>application/vnd.iso19139.metadata+xml</code>
|-
|NASA DIF
|<code>application/vnd.nasa.dif-metadata+xml</code>
|-
|JSON-LD (Schema.org)
|<code>application/ld+json</code>
|-
|BibTeX
|<code>application/x-bibtex</code>
|-
|RIS
|<code>application/x-research-info-systems</code>
|-
|Plain text citation
|<code>text/x-bibliography</code>
|-
|Tab-separated data
|<code>text/tab-separated-values</code>
|}
Alternatively, the same formats can be requested using explicit URL parameters:
https://doi.pangaea.de/10.1594/PANGAEA.841672?format=metadata_datacite4
https://doi.pangaea.de/10.1594/PANGAEA.841672?format=metadata_iso19139
https://doi.pangaea.de/10.1594/PANGAEA.841672?format=metadata_jsonld
https://doi.pangaea.de/10.1594/PANGAEA.841672?format=citation_bibtex

== Access to Restricted Datasets ==
A small fraction of PANGAEA datasets are under an active moratorium — access to the data is temporarily restricted while the metadata remains publicly visible. Access to your personal moratorium datasets requires authentication using a '''bearer token''', which is the standard mechanism for API authorization (RFC 6750). Note that access to protected datasets of other authors is, of course, still not possible.

To obtain your bearer token, log in to PANGAEA at https://pangaea.de/user/login.php. Login is supported both with a PANGAEA username and password, and via ORCID iD. After logging in, the user profile page at https://pangaea.de/user/ displays the current session token under "Your temporary login token." This token can be renewed when logging out and logging in again, and passed as an <code>Authorization: Bearer</code> header in any HTTP request:
<code>curl -OJLf \
-H 'Authorization: Bearer <your-token>' \
-H 'Accept: text/tab-separated-values' \
https://doi.org/10.1594/PANGAEA.841672</code>
If a request is made without a token to a restricted dataset, the server responds with <code>401 Bearer Realm</code>, signaling that authentication is required.

In recent versions, curl can provide the token with an explicit flag (<code>--oauth2-bearer</code>). The respective command example would then be:
<code>curl -OJLf --oauth2-bearer xxx -H 'Accept: text/tab-separated-values' https://doi.org/10.1594/PANGAEA.841672</code>

== OAI-PMH Metadata Harvesting ==
For bulk metadata harvesting, PANGAEA provides an OAI-PMH endpoint at:
<code>https://ws.pangaea.de/oai/</code>
OAI-PMH allows systematic, incremental harvesting of all PANGAEA metadata in supported formats. The following metadata standards are available via OAI-PMH: Dublin Core, DataCite v3 and v4, ISO 19139, and DIF (Directory Interchange Format). This endpoint is used by the portals and registries listed in the discovery section above, and is equally available to any user who wishes to build their own index or integrate PANGAEA metadata into an institutional system.

== Programmatic Access with Python: pangaeapy ==
'''pangaeapy''' is the official Python client library for PANGAEA, developed and maintained by the PANGAEA team. It provides a high-level interface for loading and analyzing PANGAEA datasets directly into native Python data structures, without requiring manual HTTP requests or format parsing.

* PyPI package: https://pypi.org/project/pangaeapy/
* Source code: https://github.com/pangaea-data-publisher/pangaeapy
* Introduction slides: https://github.com/pangaea-data-publisher/community-workshop-material/blob/master/Introduction_to_pangaeapy.pdf

=== Installation ===
<code>pip install pangaeapy</code>

=== Loading a Dataset ===
The central object in pangaeapy is <code>PanDataSet</code>, which takes a DOI or PANGAEA dataset ID and retrieves both the data and the associated metadata:
<code>from pangaeapy import PanDataSet

ds = PanDataSet('10.1594/PANGAEA.841672')

# Access the data as a pandas DataFrame
print(ds.data.head())

# Access dataset metadata
print(ds.title)
print(ds.authors)
print(ds.parameters) # list of measured parameters with units and methods</code>
The <code>data</code> attribute is a pandas DataFrame in which each column corresponds to a measured parameter, and each row to a measurement. Parameter metadata — including units, standard names, and method descriptions — is accessible through the <code>parameters</code> attribute.

=== Searching PANGAEA from Python ===
pangaeapy also provides a <code>PanQuery</code> class for querying the PANGAEA search engine programmatically. Note that the underlying search API used by pangaeapy is currently internal and undocumented; an official, publicly documented Search REST API is under development. Until that is available, pangaeapy offers the most straightforward path to programmatic search for Python users:
<code>from pangaeapy import PanQuery

q = PanQuery('temperature salinity Atlantic', limit=10)
for result in q.results:
print(result['doi'], result['title'])</code>

=== Access to Restricted Datasets ===
A bearer token obtained from the PANGAEA user profile can be passed to <code>PanDataSet</code> to enable access to your datasets that are subject to a moratorium (not of other authors!):
<code>ds = PanDataSet('10.1594/PANGAEA.841672', token='<your-bearer-token>')</code>

=== Further Training Materials ===
Jupyter notebooks with worked examples for data discovery, loading, and analysis with pangaeapy are available in the PANGAEA Community Workshop GitHub repository at https://github.com/pangaea-data-publisher/community-workshop-material (see the <code>Python/</code> directory).

== Programmatic Access with R: pangaear ==
'''pangaear''' is a community-developed R client for PANGAEA, maintained as part of the rOpenSci ecosystem. It provides equivalent functionality to pangaeapy for R users.

* CRAN: https://cran.r-project.org/package=pangaear
* GitHub: https://github.com/ropensci/pangaear
* Introduction slides: https://github.com/pangaea-data-publisher/community-workshop-material/blob/master/Intro_PangaeaR.pdf

=== Installation ===
<code>install.packages("pangaear")
# or the development version:
# devtools::install_github("ropensci/pangaear")</code>

=== Loading a Dataset ===
<code>library(pangaear)

# Download and load a dataset by DOI
ds <- pg_data(doi = '10.1594/PANGAEA.841672')

# Access the data as a data frame
head(ds[[1]]$data)

# Metadata is accessible from the list object
ds[[1]]$metadata</code>

=== Searching PANGAEA from R ===
<code>res <- pg_search(query = 'temperature salinity Atlantic', count = 10)
print(res$doi)</code>
R scripts and worked examples are available in the <code>R/</code> directory of the PANGAEA Community Workshop GitHub repository.

== The PANGAEA Data Warehouse ==
The PANGAEA data warehouse (based on Clickhouse) provides a powerful complementary access path for users who need to compile and aggregate data across large numbers of datasets — potentially across hundreds of thousands of individual publications — without having to download and merge files manually.

The data warehouse supports spatially and chronologically constrained queries at the parameter level, returning data together with the DOI of each contributing dataset to maintain full provenance traceability. It is accessible in two ways:

'''Through the PANGAEA website:''' The data warehouse interface is integrated into the search results page. After performing a search, users can select "Data Warehouse" to configure and download a parameter-level aggregation from all datasets in the result set.

Documentation: https://wiki.pangaea.de/wiki/Data_warehouse

'''Programmatically via REST API:''' The data warehouse is also accessible through the PANGAEA web services at https://ws.pangaea.de/. This allows automated, scripted aggregations and is supported by both pangaeapy and pangaear.

Note that data warehouse exports represent compiled data products and do not replace the need to consult individual dataset landing pages to assess the fitness of individual studies for a specific scientific application. Every export includes the DOI name for each data point, ensuring that citation and provenance are preserved.

== Summary of Access Methods ==
{| class="wikitable"
!Use case
!Method
!Entry point
|-
|Interactive discovery
|PANGAEA search
|https://www.pangaea.de/
|-
|Single dataset access
|Browser / landing page
|https://doi.pangaea.de/10.1594/PANGAEA.xxxxx
|-
|Programmatic data download
|HTTP content negotiation
|<code>curl -H 'Accept: text/tab-separated-values' https://doi.org/...</code>
|-
|Programmatic metadata download
|HTTP content negotiation
|<code>curl -H 'Accept: application/ld+json' https://doi.org/...</code>
|-
|Link and format discovery
|HTTP HEAD / Signposting
|<code>curl -I https://doi.pangaea.de/...</code>
|-
|Restricted dataset access (to your own publications)
|Bearer token authentication
|retrievable via https://pangaea.de/user/
|-
|Bulk metadata harvesting
|OAI-PMH
|https://ws.pangaea.de/oai/
|-
|Scripted data access (Python)
|pangaeapy
|https://pypi.org/project/pangaeapy/
|-
|Scripted data access (R)
|pangaear
|https://cran.r-project.org/package=pangaear
|-
|Cross-dataset aggregation
|Data warehouse
|https://wiki.pangaea.de/wiki/Data_warehouse
|-
|Discovery across repositories
|DataCite Search API
|https://support.datacite.org/docs/api
|}
----

== Further Resources ==

* [[PANGAEA search]] — documentation on search syntax and faceted filters
* [[Data warehouse]] — data warehouse documentation
* [[PANGAEA Community Workshops]] — hands-on training workshops on finding and using PANGAEA data
* [https://github.com/pangaea-data-publisher/community-workshop-material PANGAEA Community Workshop GitHub] — Jupyter notebooks, R scripts, and slide decks
* [https://pypi.org/project/pangaeapy/ pangaeapy on PyPI] — Python client
* [https://cran.r-project.org/package=pangaear pangaear on CRAN] — R client
* [https://ws.pangaea.de/oai/ PANGAEA web services] — REST and OAI-PMH endpoints
* [https://signposting.org/ Signposting standard] — link discovery for FAIR data

Data Access and Reuse

2026-04-22T10:10:25Z

Lmoeller: /* Downloading the Data File */ adjusted curl flag description

This article describes the different methods available for discovering, accessing, and reusing data published in PANGAEA. It is intended for researchers and data practitioners who want to interact with PANGAEA data beyond the standard web interface, including scripted and automated workflows. The methods range from interactive web-based search to fully programmatic access via HTTP content negotiation and dedicated client libraries.

Hands-on training materials covering many of the topics below are available in the [[PANGAEA Community Workshops|PANGAEA Community Workshop Series]], including Jupyter notebooks, R scripts, and slide decks in the [https://github.com/pangaea-data-publisher/community-workshop-material PANGAEA community workshop GitHub repository].

== Overview ==
Every dataset published in PANGAEA is assigned a globally unique and persistent Digital Object Identifier (DOI). The DOI is the single entry point for all forms of programmatic access: it resolves to a dataset landing page that, in addition to its human-readable representation, exposes all available metadata formats and data download options through standard HTTP mechanisms. There is no separate PANGAEA data API — all data and metadata access is built on top of the DOI and its landing page using standard web protocols. This design makes PANGAEA data directly accessible without any vendor-specific API key or proprietary client, while remaining fully compatible with FAIR principles.

== Web-Based Search and Discovery ==

=== PANGAEA Search Interface ===
The primary discovery tool for PANGAEA data is the PANGAEA search engine at https://www.pangaea.de/, based on Elasticsearch. It supports full-text search and faceted filtering across all published metadata. Faceted navigation allows users to constrain results by topic, device type, geographic region, and temporal coverage. Documentation on search functionality and syntax is available in the Wiki at [[PANGAEA search]].

=== External Portals and Registries ===
PANGAEA metadata is harvested by a large number of disciplinary and generic portals via OAI-PMH, making datasets discoverable beyond the PANGAEA website through services such as Google Dataset Search, OpenAIRE, DataCite Commons, DataONE, GBIF, EMODnet, GFBio, and others. PANGAEA is registered in re3data.org, FAIRsharing.org, RIsources, and the EOSC Marketplace.

An alternative for search when a PANGAEA-specific search interface is not required is the '''DataCite Search API''' (https://support.datacite.org/docs/api), which searches across a large inventory of research data from many repositories. It returns summary metadata including DOI names, and subsequent data access from any PANGAEA result can then proceed via content negotiation as described below.

== DOI Landing Pages and Link Discovery ==

=== The Landing Page as Access Hub ===
Each PANGAEA dataset is represented by a landing page accessible at its DOI. For example:
<code>https://doi.pangaea.de/10.1594/PANGAEA.841672</code>
The landing page serves a dual function: it presents dataset metadata and a data preview in human-readable form, and it simultaneously exposes all available machine-readable representations of the same resource through standard HTTP link headers and HTML <code><link></code> elements. This implementation follows the '''Signposting standard''' (https://signposting.org/), which allows any HTTP client to discover all alternate representations of a dataset — metadata in various formats, the data file itself, and the ORCID iDs of the authors — without any prior knowledge of PANGAEA-specific URL patterns.

=== Discovering Available Representations ===
A simple HTTP HEAD request to the landing page returns the full set of typed link relations in the response header. The following example illustrates this using <code>curl</code>:
<code>curl -LI https://doi.org/10.1594/PANGAEA.841672</code>
The response <code>Link:</code> header will contain relations of the following types:

* '''<code>cite-as</code>''' — the canonical DOI citation URL
* '''<code>describedby</code>''' — links to metadata in various formats (ISO 19139, DataCite XML, PANGAEA XML, BibTeX, RIS, JSON-LD)
* '''<code>item</code>''' — links to the data itself (tab-delimited text, HTML view)
* '''<code>author</code>''' — ORCID iDs of the dataset authors

This mechanism allows scripts, harvesters, and other machine clients to discover and retrieve any representation of a dataset from its DOI alone.

== HTTP Content Negotiation ==
HTTP content negotiation allows a client to request a specific representation of a resource by specifying a MIME type in the <code>Accept</code> header. All PANGAEA dataset landing pages support this mechanism, so both data and metadata can be downloaded programmatically by querying the DOI directly with the appropriate content type — no PANGAEA-specific URL construction is required.

=== Downloading the Data File ===
The tabular data file for a dataset can be downloaded as tab-delimited text:
<code>curl -OJLf 'https://doi.pangaea.de/10.1594/PANGAEA.841672?format=textfile'</code>
Equivalently, using content negotiation directly against the DOI:
<code>curl -OJLf -H 'Accept: text/tab-separated-values' \
https://doi.org/10.1594/PANGAEA.841672</code>
The <code>-OJL</code> flags instruct curl to save the file using the server-provided filename, follow redirects, and fail with appropriate exit code on errors.

=== Retrieving Metadata in a Specific Format ===
The same mechanism applies to metadata. To retrieve a dataset's metadata in ISO 19139/19115 format:
<code>curl -L -H 'Accept: application/vnd.iso19139.metadata+xml' \
https://doi.org/10.1594/PANGAEA.841672</code>
The following MIME types are currently supported for metadata retrieval via content negotiation:
{| class="wikitable"
!Format
!MIME type
|-
|PANGAEA internal XML
|<code>application/vnd.pangaea.metadata+xml</code>
|-
|DataCite XML (v4)
|<code>application/vnd.datacite.datacite+xml</code>
|-
|ISO 19139 / 19115
|<code>application/vnd.iso19139.metadata+xml</code>
|-
|NASA DIF
|<code>application/vnd.nasa.dif-metadata+xml</code>
|-
|JSON-LD (Schema.org)
|<code>application/ld+json</code>
|-
|BibTeX
|<code>application/x-bibtex</code>
|-
|RIS
|<code>application/x-research-info-systems</code>
|-
|Plain text citation
|<code>text/x-bibliography</code>
|-
|Tab-separated data
|<code>text/tab-separated-values</code>
|}
Alternatively, the same formats can be requested using explicit URL parameters:
https://doi.pangaea.de/10.1594/PANGAEA.841672?format=metadata_datacite4
https://doi.pangaea.de/10.1594/PANGAEA.841672?format=metadata_iso19139
https://doi.pangaea.de/10.1594/PANGAEA.841672?format=metadata_jsonld
https://doi.pangaea.de/10.1594/PANGAEA.841672?format=citation_bibtex

== Access to Restricted Datasets ==
A small fraction of PANGAEA datasets are under an active moratorium — access to the data is temporarily restricted while the metadata remains publicly visible. Access to your personal moratorium datasets requires authentication using a '''bearer token''', which is the standard mechanism for API authorization (RFC 6750). Note that access to protected datasets of other authors is, of course, still not possible.

To obtain your bearer token, log in to PANGAEA at https://pangaea.de/user/login.php. Login is supported both with a PANGAEA username and password, and via ORCID iD. After logging in, the user profile page at https://pangaea.de/user/ displays the current session token under "Your temporary login token." This token can be renewed when logging out and logging in again, and passed as an <code>Authorization: Bearer</code> header in any HTTP request:
<code>curl -OJLf \
-H 'Authorization: Bearer <your-token>' \
-H 'Accept: text/tab-separated-values' \
https://doi.org/10.1594/PANGAEA.841672</code>
If a request is made without a token to a restricted dataset, the server responds with <code>401 Bearer Realm</code>, signaling that authentication is required.

== OAI-PMH Metadata Harvesting ==
For bulk metadata harvesting, PANGAEA provides an OAI-PMH endpoint at:
<code>https://ws.pangaea.de/oai/</code>
OAI-PMH allows systematic, incremental harvesting of all PANGAEA metadata in supported formats. The following metadata standards are available via OAI-PMH: Dublin Core, DataCite v3 and v4, ISO 19139, and DIF (Directory Interchange Format). This endpoint is used by the portals and registries listed in the discovery section above, and is equally available to any user who wishes to build their own index or integrate PANGAEA metadata into an institutional system.

== Programmatic Access with Python: pangaeapy ==
'''pangaeapy''' is the official Python client library for PANGAEA, developed and maintained by the PANGAEA team. It provides a high-level interface for loading and analyzing PANGAEA datasets directly into native Python data structures, without requiring manual HTTP requests or format parsing.

* PyPI package: https://pypi.org/project/pangaeapy/
* Source code: https://github.com/pangaea-data-publisher/pangaeapy
* Introduction slides: https://github.com/pangaea-data-publisher/community-workshop-material/blob/master/Introduction_to_pangaeapy.pdf

=== Installation ===
<code>pip install pangaeapy</code>

=== Loading a Dataset ===
The central object in pangaeapy is <code>PanDataSet</code>, which takes a DOI or PANGAEA dataset ID and retrieves both the data and the associated metadata:
<code>from pangaeapy import PanDataSet

ds = PanDataSet('10.1594/PANGAEA.841672')

# Access the data as a pandas DataFrame
print(ds.data.head())

# Access dataset metadata
print(ds.title)
print(ds.authors)
print(ds.parameters) # list of measured parameters with units and methods</code>
The <code>data</code> attribute is a pandas DataFrame in which each column corresponds to a measured parameter, and each row to a measurement. Parameter metadata — including units, standard names, and method descriptions — is accessible through the <code>parameters</code> attribute.

=== Searching PANGAEA from Python ===
pangaeapy also provides a <code>PanQuery</code> class for querying the PANGAEA search engine programmatically. Note that the underlying search API used by pangaeapy is currently internal and undocumented; an official, publicly documented Search REST API is under development. Until that is available, pangaeapy offers the most straightforward path to programmatic search for Python users:
<code>from pangaeapy import PanQuery

q = PanQuery('temperature salinity Atlantic', limit=10)
for result in q.results:
print(result['doi'], result['title'])</code>

=== Access to Restricted Datasets ===
A bearer token obtained from the PANGAEA user profile can be passed to <code>PanDataSet</code> to enable access to your datasets that are subject to a moratorium (not of other authors!):
<code>ds = PanDataSet('10.1594/PANGAEA.841672', token='<your-bearer-token>')</code>

=== Further Training Materials ===
Jupyter notebooks with worked examples for data discovery, loading, and analysis with pangaeapy are available in the PANGAEA Community Workshop GitHub repository at https://github.com/pangaea-data-publisher/community-workshop-material (see the <code>Python/</code> directory).

== Programmatic Access with R: pangaear ==
'''pangaear''' is a community-developed R client for PANGAEA, maintained as part of the rOpenSci ecosystem. It provides equivalent functionality to pangaeapy for R users.

* CRAN: https://cran.r-project.org/package=pangaear
* GitHub: https://github.com/ropensci/pangaear
* Introduction slides: https://github.com/pangaea-data-publisher/community-workshop-material/blob/master/Intro_PangaeaR.pdf

=== Installation ===
<code>install.packages("pangaear")
# or the development version:
# devtools::install_github("ropensci/pangaear")</code>

=== Loading a Dataset ===
<code>library(pangaear)

# Download and load a dataset by DOI
ds <- pg_data(doi = '10.1594/PANGAEA.841672')

# Access the data as a data frame
head(ds[[1]]$data)

# Metadata is accessible from the list object
ds[[1]]$metadata</code>

=== Searching PANGAEA from R ===
<code>res <- pg_search(query = 'temperature salinity Atlantic', count = 10)
print(res$doi)</code>
R scripts and worked examples are available in the <code>R/</code> directory of the PANGAEA Community Workshop GitHub repository.

== The PANGAEA Data Warehouse ==
The PANGAEA data warehouse (based on Clickhouse) provides a powerful complementary access path for users who need to compile and aggregate data across large numbers of datasets — potentially across hundreds of thousands of individual publications — without having to download and merge files manually.

The data warehouse supports spatially and chronologically constrained queries at the parameter level, returning data together with the DOI of each contributing dataset to maintain full provenance traceability. It is accessible in two ways:

'''Through the PANGAEA website:''' The data warehouse interface is integrated into the search results page. After performing a search, users can select "Data Warehouse" to configure and download a parameter-level aggregation from all datasets in the result set.

Documentation: https://wiki.pangaea.de/wiki/Data_warehouse

'''Programmatically via REST API:''' The data warehouse is also accessible through the PANGAEA web services at https://ws.pangaea.de/. This allows automated, scripted aggregations and is supported by both pangaeapy and pangaear.

Note that data warehouse exports represent compiled data products and do not replace the need to consult individual dataset landing pages to assess the fitness of individual studies for a specific scientific application. Every export includes the DOI name for each data point, ensuring that citation and provenance are preserved.

== Summary of Access Methods ==
{| class="wikitable"
!Use case
!Method
!Entry point
|-
|Interactive discovery
|PANGAEA search
|https://www.pangaea.de/
|-
|Single dataset access
|Browser / landing page
|https://doi.pangaea.de/10.1594/PANGAEA.xxxxx
|-
|Programmatic data download
|HTTP content negotiation
|<code>curl -H 'Accept: text/tab-separated-values' https://doi.org/...</code>
|-
|Programmatic metadata download
|HTTP content negotiation
|<code>curl -H 'Accept: application/ld+json' https://doi.org/...</code>
|-
|Link and format discovery
|HTTP HEAD / Signposting
|<code>curl -I https://doi.pangaea.de/...</code>
|-
|Restricted dataset access (to your own publications)
|Bearer token authentication
|retrievable via https://pangaea.de/user/
|-
|Bulk metadata harvesting
|OAI-PMH
|https://ws.pangaea.de/oai/
|-
|Scripted data access (Python)
|pangaeapy
|https://pypi.org/project/pangaeapy/
|-
|Scripted data access (R)
|pangaear
|https://cran.r-project.org/package=pangaear
|-
|Cross-dataset aggregation
|Data warehouse
|https://wiki.pangaea.de/wiki/Data_warehouse
|-
|Discovery across repositories
|DataCite Search API
|https://support.datacite.org/docs/api
|}
----

== Further Resources ==

* [[PANGAEA search]] — documentation on search syntax and faceted filters
* [[Data warehouse]] — data warehouse documentation
* [[PANGAEA Community Workshops]] — hands-on training workshops on finding and using PANGAEA data
* [https://github.com/pangaea-data-publisher/community-workshop-material PANGAEA Community Workshop GitHub] — Jupyter notebooks, R scripts, and slide decks
* [https://pypi.org/project/pangaeapy/ pangaeapy on PyPI] — Python client
* [https://cran.r-project.org/package=pangaear pangaear on CRAN] — R client
* [https://ws.pangaea.de/oai/ PANGAEA web services] — REST and OAI-PMH endpoints
* [https://signposting.org/ Signposting standard] — link discovery for FAIR data

Data Access and Reuse

2026-04-22T09:44:01Z

Lmoeller: /* Access to Restricted Datasets */

This article describes the different methods available for discovering, accessing, and reusing data published in PANGAEA. It is intended for researchers and data practitioners who want to interact with PANGAEA data beyond the standard web interface, including scripted and automated workflows. The methods range from interactive web-based search to fully programmatic access via HTTP content negotiation and dedicated client libraries.

Hands-on training materials covering many of the topics below are available in the [[PANGAEA Community Workshops|PANGAEA Community Workshop Series]], including Jupyter notebooks, R scripts, and slide decks in the [https://github.com/pangaea-data-publisher/community-workshop-material PANGAEA community workshop GitHub repository].

== Overview ==
Every dataset published in PANGAEA is assigned a globally unique and persistent Digital Object Identifier (DOI). The DOI is the single entry point for all forms of programmatic access: it resolves to a dataset landing page that, in addition to its human-readable representation, exposes all available metadata formats and data download options through standard HTTP mechanisms. There is no separate PANGAEA data API — all data and metadata access is built on top of the DOI and its landing page using standard web protocols. This design makes PANGAEA data directly accessible without any vendor-specific API key or proprietary client, while remaining fully compatible with FAIR principles.

== Web-Based Search and Discovery ==

=== PANGAEA Search Interface ===
The primary discovery tool for PANGAEA data is the PANGAEA search engine at https://www.pangaea.de/, based on Elasticsearch. It supports full-text search and faceted filtering across all published metadata. Faceted navigation allows users to constrain results by topic, device type, geographic region, and temporal coverage. Documentation on search functionality and syntax is available in the Wiki at [[PANGAEA search]].

=== External Portals and Registries ===
PANGAEA metadata is harvested by a large number of disciplinary and generic portals via OAI-PMH, making datasets discoverable beyond the PANGAEA website through services such as Google Dataset Search, OpenAIRE, DataCite Commons, DataONE, GBIF, EMODnet, GFBio, and others. PANGAEA is registered in re3data.org, FAIRsharing.org, RIsources, and the EOSC Marketplace.

An alternative for search when a PANGAEA-specific search interface is not required is the '''DataCite Search API''' (https://support.datacite.org/docs/api), which searches across a large inventory of research data from many repositories. It returns summary metadata including DOI names, and subsequent data access from any PANGAEA result can then proceed via content negotiation as described below.

== DOI Landing Pages and Link Discovery ==

=== The Landing Page as Access Hub ===
Each PANGAEA dataset is represented by a landing page accessible at its DOI. For example:
<code>https://doi.pangaea.de/10.1594/PANGAEA.841672</code>
The landing page serves a dual function: it presents dataset metadata and a data preview in human-readable form, and it simultaneously exposes all available machine-readable representations of the same resource through standard HTTP link headers and HTML <code><link></code> elements. This implementation follows the '''Signposting standard''' (https://signposting.org/), which allows any HTTP client to discover all alternate representations of a dataset — metadata in various formats, the data file itself, and the ORCID iDs of the authors — without any prior knowledge of PANGAEA-specific URL patterns.

=== Discovering Available Representations ===
A simple HTTP HEAD request to the landing page returns the full set of typed link relations in the response header. The following example illustrates this using <code>curl</code>:
<code>curl -LI https://doi.org/10.1594/PANGAEA.841672</code>
The response <code>Link:</code> header will contain relations of the following types:

* '''<code>cite-as</code>''' — the canonical DOI citation URL
* '''<code>describedby</code>''' — links to metadata in various formats (ISO 19139, DataCite XML, PANGAEA XML, BibTeX, RIS, JSON-LD)
* '''<code>item</code>''' — links to the data itself (tab-delimited text, HTML view)
* '''<code>author</code>''' — ORCID iDs of the dataset authors

This mechanism allows scripts, harvesters, and other machine clients to discover and retrieve any representation of a dataset from its DOI alone.

== HTTP Content Negotiation ==
HTTP content negotiation allows a client to request a specific representation of a resource by specifying a MIME type in the <code>Accept</code> header. All PANGAEA dataset landing pages support this mechanism, so both data and metadata can be downloaded programmatically by querying the DOI directly with the appropriate content type — no PANGAEA-specific URL construction is required.

=== Downloading the Data File ===
The tabular data file for a dataset can be downloaded as tab-delimited text:
<code>curl -OJLf 'https://doi.pangaea.de/10.1594/PANGAEA.841672?format=textfile'</code>
Equivalently, using content negotiation directly against the DOI:
<code>curl -OJLf -H 'Accept: text/tab-separated-values' \
https://doi.org/10.1594/PANGAEA.841672</code>
The <code>-OJL</code> flags instruct curl to save the file using the server-provided filename, follow redirects, and fail silently on errors.

=== Retrieving Metadata in a Specific Format ===
The same mechanism applies to metadata. To retrieve a dataset's metadata in ISO 19139/19115 format:
<code>curl -L -H 'Accept: application/vnd.iso19139.metadata+xml' \
https://doi.org/10.1594/PANGAEA.841672</code>
The following MIME types are currently supported for metadata retrieval via content negotiation:
{| class="wikitable"
!Format
!MIME type
|-
|PANGAEA internal XML
|<code>application/vnd.pangaea.metadata+xml</code>
|-
|DataCite XML (v4)
|<code>application/vnd.datacite.datacite+xml</code>
|-
|ISO 19139 / 19115
|<code>application/vnd.iso19139.metadata+xml</code>
|-
|NASA DIF
|<code>application/vnd.nasa.dif-metadata+xml</code>
|-
|JSON-LD (Schema.org)
|<code>application/ld+json</code>
|-
|BibTeX
|<code>application/x-bibtex</code>
|-
|RIS
|<code>application/x-research-info-systems</code>
|-
|Plain text citation
|<code>text/x-bibliography</code>
|-
|Tab-separated data
|<code>text/tab-separated-values</code>
|}
Alternatively, the same formats can be requested using explicit URL parameters:
https://doi.pangaea.de/10.1594/PANGAEA.841672?format=metadata_datacite4
https://doi.pangaea.de/10.1594/PANGAEA.841672?format=metadata_iso19139
https://doi.pangaea.de/10.1594/PANGAEA.841672?format=metadata_jsonld
https://doi.pangaea.de/10.1594/PANGAEA.841672?format=citation_bibtex

== Access to Restricted Datasets ==
A small fraction of PANGAEA datasets are under an active moratorium — access to the data is temporarily restricted while the metadata remains publicly visible. Access to your personal moratorium datasets requires authentication using a '''bearer token''', which is the standard mechanism for API authorization (RFC 6750). Note that access to protected datasets of other authors is, of course, still not possible.

To obtain your bearer token, log in to PANGAEA at https://pangaea.de/user/login.php. Login is supported both with a PANGAEA username and password, and via ORCID iD. After logging in, the user profile page at https://pangaea.de/user/ displays the current session token under "Your temporary login token." This token can be renewed when logging out and logging in again, and passed as an <code>Authorization: Bearer</code> header in any HTTP request:
<code>curl -OJLf \
-H 'Authorization: Bearer <your-token>' \
-H 'Accept: text/tab-separated-values' \
https://doi.org/10.1594/PANGAEA.841672</code>
If a request is made without a token to a restricted dataset, the server responds with <code>401 Bearer Realm</code>, signaling that authentication is required.

== OAI-PMH Metadata Harvesting ==
For bulk metadata harvesting, PANGAEA provides an OAI-PMH endpoint at:
<code>https://ws.pangaea.de/oai/</code>
OAI-PMH allows systematic, incremental harvesting of all PANGAEA metadata in supported formats. The following metadata standards are available via OAI-PMH: Dublin Core, DataCite v3 and v4, ISO 19139, and DIF (Directory Interchange Format). This endpoint is used by the portals and registries listed in the discovery section above, and is equally available to any user who wishes to build their own index or integrate PANGAEA metadata into an institutional system.

== Programmatic Access with Python: pangaeapy ==
'''pangaeapy''' is the official Python client library for PANGAEA, developed and maintained by the PANGAEA team. It provides a high-level interface for loading and analyzing PANGAEA datasets directly into native Python data structures, without requiring manual HTTP requests or format parsing.

* PyPI package: https://pypi.org/project/pangaeapy/
* Source code: https://github.com/pangaea-data-publisher/pangaeapy
* Introduction slides: https://github.com/pangaea-data-publisher/community-workshop-material/blob/master/Introduction_to_pangaeapy.pdf

=== Installation ===
<code>pip install pangaeapy</code>

=== Loading a Dataset ===
The central object in pangaeapy is <code>PanDataSet</code>, which takes a DOI or PANGAEA dataset ID and retrieves both the data and the associated metadata:
<code>from pangaeapy import PanDataSet

ds = PanDataSet('10.1594/PANGAEA.841672')

# Access the data as a pandas DataFrame
print(ds.data.head())

# Access dataset metadata
print(ds.title)
print(ds.authors)
print(ds.parameters) # list of measured parameters with units and methods</code>
The <code>data</code> attribute is a pandas DataFrame in which each column corresponds to a measured parameter, and each row to a measurement. Parameter metadata — including units, standard names, and method descriptions — is accessible through the <code>parameters</code> attribute.

=== Searching PANGAEA from Python ===
pangaeapy also provides a <code>PanQuery</code> class for querying the PANGAEA search engine programmatically. Note that the underlying search API used by pangaeapy is currently internal and undocumented; an official, publicly documented Search REST API is under development. Until that is available, pangaeapy offers the most straightforward path to programmatic search for Python users:
<code>from pangaeapy import PanQuery

q = PanQuery('temperature salinity Atlantic', limit=10)
for result in q.results:
print(result['doi'], result['title'])</code>

=== Access to Restricted Datasets ===
A bearer token obtained from the PANGAEA user profile can be passed to <code>PanDataSet</code> to enable access to your datasets that are subject to a moratorium (not of other authors!):
<code>ds = PanDataSet('10.1594/PANGAEA.841672', token='<your-bearer-token>')</code>

=== Further Training Materials ===
Jupyter notebooks with worked examples for data discovery, loading, and analysis with pangaeapy are available in the PANGAEA Community Workshop GitHub repository at https://github.com/pangaea-data-publisher/community-workshop-material (see the <code>Python/</code> directory).

== Programmatic Access with R: pangaear ==
'''pangaear''' is a community-developed R client for PANGAEA, maintained as part of the rOpenSci ecosystem. It provides equivalent functionality to pangaeapy for R users.

* CRAN: https://cran.r-project.org/package=pangaear
* GitHub: https://github.com/ropensci/pangaear
* Introduction slides: https://github.com/pangaea-data-publisher/community-workshop-material/blob/master/Intro_PangaeaR.pdf

=== Installation ===
<code>install.packages("pangaear")
# or the development version:
# devtools::install_github("ropensci/pangaear")</code>

=== Loading a Dataset ===
<code>library(pangaear)

# Download and load a dataset by DOI
ds <- pg_data(doi = '10.1594/PANGAEA.841672')

# Access the data as a data frame
head(ds[[1]]$data)

# Metadata is accessible from the list object
ds[[1]]$metadata</code>

=== Searching PANGAEA from R ===
<code>res <- pg_search(query = 'temperature salinity Atlantic', count = 10)
print(res$doi)</code>
R scripts and worked examples are available in the <code>R/</code> directory of the PANGAEA Community Workshop GitHub repository.

== The PANGAEA Data Warehouse ==
The PANGAEA data warehouse (based on Clickhouse) provides a powerful complementary access path for users who need to compile and aggregate data across large numbers of datasets — potentially across hundreds of thousands of individual publications — without having to download and merge files manually.

The data warehouse supports spatially and chronologically constrained queries at the parameter level, returning data together with the DOI of each contributing dataset to maintain full provenance traceability. It is accessible in two ways:

'''Through the PANGAEA website:''' The data warehouse interface is integrated into the search results page. After performing a search, users can select "Data Warehouse" to configure and download a parameter-level aggregation from all datasets in the result set.

Documentation: https://wiki.pangaea.de/wiki/Data_warehouse

'''Programmatically via REST API:''' The data warehouse is also accessible through the PANGAEA web services at https://ws.pangaea.de/. This allows automated, scripted aggregations and is supported by both pangaeapy and pangaear.

Note that data warehouse exports represent compiled data products and do not replace the need to consult individual dataset landing pages to assess the fitness of individual studies for a specific scientific application. Every export includes the DOI name for each data point, ensuring that citation and provenance are preserved.

== Summary of Access Methods ==
{| class="wikitable"
!Use case
!Method
!Entry point
|-
|Interactive discovery
|PANGAEA search
|https://www.pangaea.de/
|-
|Single dataset access
|Browser / landing page
|https://doi.pangaea.de/10.1594/PANGAEA.xxxxx
|-
|Programmatic data download
|HTTP content negotiation
|<code>curl -H 'Accept: text/tab-separated-values' https://doi.org/...</code>
|-
|Programmatic metadata download
|HTTP content negotiation
|<code>curl -H 'Accept: application/ld+json' https://doi.org/...</code>
|-
|Link and format discovery
|HTTP HEAD / Signposting
|<code>curl -I https://doi.pangaea.de/...</code>
|-
|Restricted dataset access (to your own publications)
|Bearer token authentication
|retrievable via https://pangaea.de/user/
|-
|Bulk metadata harvesting
|OAI-PMH
|https://ws.pangaea.de/oai/
|-
|Scripted data access (Python)
|pangaeapy
|https://pypi.org/project/pangaeapy/
|-
|Scripted data access (R)
|pangaear
|https://cran.r-project.org/package=pangaear
|-
|Cross-dataset aggregation
|Data warehouse
|https://wiki.pangaea.de/wiki/Data_warehouse
|-
|Discovery across repositories
|DataCite Search API
|https://support.datacite.org/docs/api
|}
----

== Further Resources ==

* [[PANGAEA search]] — documentation on search syntax and faceted filters
* [[Data warehouse]] — data warehouse documentation
* [[PANGAEA Community Workshops]] — hands-on training workshops on finding and using PANGAEA data
* [https://github.com/pangaea-data-publisher/community-workshop-material PANGAEA Community Workshop GitHub] — Jupyter notebooks, R scripts, and slide decks
* [https://pypi.org/project/pangaeapy/ pangaeapy on PyPI] — Python client
* [https://cran.r-project.org/package=pangaear pangaear on CRAN] — R client
* [https://ws.pangaea.de/oai/ PANGAEA web services] — REST and OAI-PMH endpoints
* [https://signposting.org/ Signposting standard] — link discovery for FAIR data

Data Access and Reuse

2026-04-22T09:37:32Z

Lmoeller: /* Summary of Access Methods */

This article describes the different methods available for discovering, accessing, and reusing data published in PANGAEA. It is intended for researchers and data practitioners who want to interact with PANGAEA data beyond the standard web interface, including scripted and automated workflows. The methods range from interactive web-based search to fully programmatic access via HTTP content negotiation and dedicated client libraries.

Hands-on training materials covering many of the topics below are available in the [[PANGAEA Community Workshops|PANGAEA Community Workshop Series]], including Jupyter notebooks, R scripts, and slide decks in the [https://github.com/pangaea-data-publisher/community-workshop-material PANGAEA community workshop GitHub repository].

== Overview ==
Every dataset published in PANGAEA is assigned a globally unique and persistent Digital Object Identifier (DOI). The DOI is the single entry point for all forms of programmatic access: it resolves to a dataset landing page that, in addition to its human-readable representation, exposes all available metadata formats and data download options through standard HTTP mechanisms. There is no separate PANGAEA data API — all data and metadata access is built on top of the DOI and its landing page using standard web protocols. This design makes PANGAEA data directly accessible without any vendor-specific API key or proprietary client, while remaining fully compatible with FAIR principles.

== Web-Based Search and Discovery ==

=== PANGAEA Search Interface ===
The primary discovery tool for PANGAEA data is the PANGAEA search engine at https://www.pangaea.de/, based on Elasticsearch. It supports full-text search and faceted filtering across all published metadata. Faceted navigation allows users to constrain results by topic, device type, geographic region, and temporal coverage. Documentation on search functionality and syntax is available in the Wiki at [[PANGAEA search]].

=== External Portals and Registries ===
PANGAEA metadata is harvested by a large number of disciplinary and generic portals via OAI-PMH, making datasets discoverable beyond the PANGAEA website through services such as Google Dataset Search, OpenAIRE, DataCite Commons, DataONE, GBIF, EMODnet, GFBio, and others. PANGAEA is registered in re3data.org, FAIRsharing.org, RIsources, and the EOSC Marketplace.

An alternative for search when a PANGAEA-specific search interface is not required is the '''DataCite Search API''' (https://support.datacite.org/docs/api), which searches across a large inventory of research data from many repositories. It returns summary metadata including DOI names, and subsequent data access from any PANGAEA result can then proceed via content negotiation as described below.

== DOI Landing Pages and Link Discovery ==

=== The Landing Page as Access Hub ===
Each PANGAEA dataset is represented by a landing page accessible at its DOI. For example:
<code>https://doi.pangaea.de/10.1594/PANGAEA.841672</code>
The landing page serves a dual function: it presents dataset metadata and a data preview in human-readable form, and it simultaneously exposes all available machine-readable representations of the same resource through standard HTTP link headers and HTML <code><link></code> elements. This implementation follows the '''Signposting standard''' (https://signposting.org/), which allows any HTTP client to discover all alternate representations of a dataset — metadata in various formats, the data file itself, and the ORCID iDs of the authors — without any prior knowledge of PANGAEA-specific URL patterns.

=== Discovering Available Representations ===
A simple HTTP HEAD request to the landing page returns the full set of typed link relations in the response header. The following example illustrates this using <code>curl</code>:
<code>curl -LI https://doi.org/10.1594/PANGAEA.841672</code>
The response <code>Link:</code> header will contain relations of the following types:

* '''<code>cite-as</code>''' — the canonical DOI citation URL
* '''<code>describedby</code>''' — links to metadata in various formats (ISO 19139, DataCite XML, PANGAEA XML, BibTeX, RIS, JSON-LD)
* '''<code>item</code>''' — links to the data itself (tab-delimited text, HTML view)
* '''<code>author</code>''' — ORCID iDs of the dataset authors

This mechanism allows scripts, harvesters, and other machine clients to discover and retrieve any representation of a dataset from its DOI alone.

== HTTP Content Negotiation ==
HTTP content negotiation allows a client to request a specific representation of a resource by specifying a MIME type in the <code>Accept</code> header. All PANGAEA dataset landing pages support this mechanism, so both data and metadata can be downloaded programmatically by querying the DOI directly with the appropriate content type — no PANGAEA-specific URL construction is required.

=== Downloading the Data File ===
The tabular data file for a dataset can be downloaded as tab-delimited text:
<code>curl -OJLf 'https://doi.pangaea.de/10.1594/PANGAEA.841672?format=textfile'</code>
Equivalently, using content negotiation directly against the DOI:
<code>curl -OJLf -H 'Accept: text/tab-separated-values' \
https://doi.org/10.1594/PANGAEA.841672</code>
The <code>-OJL</code> flags instruct curl to save the file using the server-provided filename, follow redirects, and fail silently on errors.

=== Retrieving Metadata in a Specific Format ===
The same mechanism applies to metadata. To retrieve a dataset's metadata in ISO 19139/19115 format:
<code>curl -L -H 'Accept: application/vnd.iso19139.metadata+xml' \
https://doi.org/10.1594/PANGAEA.841672</code>
The following MIME types are currently supported for metadata retrieval via content negotiation:
{| class="wikitable"
!Format
!MIME type
|-
|PANGAEA internal XML
|<code>application/vnd.pangaea.metadata+xml</code>
|-
|DataCite XML (v4)
|<code>application/vnd.datacite.datacite+xml</code>
|-
|ISO 19139 / 19115
|<code>application/vnd.iso19139.metadata+xml</code>
|-
|NASA DIF
|<code>application/vnd.nasa.dif-metadata+xml</code>
|-
|JSON-LD (Schema.org)
|<code>application/ld+json</code>
|-
|BibTeX
|<code>application/x-bibtex</code>
|-
|RIS
|<code>application/x-research-info-systems</code>
|-
|Plain text citation
|<code>text/x-bibliography</code>
|-
|Tab-separated data
|<code>text/tab-separated-values</code>
|}
Alternatively, the same formats can be requested using explicit URL parameters:
https://doi.pangaea.de/10.1594/PANGAEA.841672?format=metadata_datacite4
https://doi.pangaea.de/10.1594/PANGAEA.841672?format=metadata_iso19139
https://doi.pangaea.de/10.1594/PANGAEA.841672?format=metadata_jsonld
https://doi.pangaea.de/10.1594/PANGAEA.841672?format=citation_bibtex

== Access to Restricted Datasets ==
A small fraction of PANGAEA datasets are under an active moratorium — access to the data is temporarily restricted while the metadata remains publicly visible. Access to your personal moratorium datasets requires authentication using a '''bearer token''', which is the standard mechanism for API authorization (RFC 6750). Note that access to protected datasets of other authors is, of course, still not possible.

To obtain your bearer token, log in to PANGAEA at https://pangaea.de/user/login.php. Login is supported both with a PANGAEA username and password, and via ORCID iD. After logging in, the user profile page at https://pangaea.de/user/ displays the current session token under "Your temporary login token." This token can be renewed when logging out and logging in again, and passed as an <code>Authorization: Bearer</code> header in any HTTP request:
<code>curl -OJLf \
-H 'Authorization: Bearer <your-token>' \
-H 'Accept: text/tab-separated-values' \
https://doi.org/10.1594/PANGAEA.841672</code>
If a request is made without a token to a restricted dataset, the server responds with <code>401 Bearer Realm</code>, signaling that authentication is required.

== OAI-PMH Metadata Harvesting ==
For bulk metadata harvesting, PANGAEA provides an OAI-PMH endpoint at:
<code>https://ws.pangaea.de/oai/</code>
OAI-PMH allows systematic, incremental harvesting of all PANGAEA metadata in supported formats. The following metadata standards are available via OAI-PMH: Dublin Core, DataCite v3 and v4, ISO 19139, and DIF (Directory Interchange Format). This endpoint is used by the portals and registries listed in the discovery section above, and is equally available to any user who wishes to build their own index or integrate PANGAEA metadata into an institutional system.

== Programmatic Access with Python: pangaeapy ==
'''pangaeapy''' is the official Python client library for PANGAEA, developed and maintained by the PANGAEA team. It provides a high-level interface for loading and analyzing PANGAEA datasets directly into native Python data structures, without requiring manual HTTP requests or format parsing.

* PyPI package: https://pypi.org/project/pangaeapy/
* Source code: https://github.com/pangaea-data-publisher/pangaeapy
* Introduction slides: https://github.com/pangaea-data-publisher/community-workshop-material/blob/master/Introduction_to_pangaeapy.pdf

=== Installation ===
<code>pip install pangaeapy</code>

=== Loading a Dataset ===
The central object in pangaeapy is <code>PanDataSet</code>, which takes a DOI or PANGAEA dataset ID and retrieves both the data and the associated metadata:
<code>from pangaeapy import PanDataSet

ds = PanDataSet('10.1594/PANGAEA.841672')

# Access the data as a pandas DataFrame
print(ds.data.head())

# Access dataset metadata
print(ds.title)
print(ds.authors)
print(ds.parameters) # list of measured parameters with units and methods</code>
The <code>data</code> attribute is a pandas DataFrame in which each column corresponds to a measured parameter, and each row to a measurement. Parameter metadata — including units, standard names, and method descriptions — is accessible through the <code>parameters</code> attribute.

=== Searching PANGAEA from Python ===
pangaeapy also provides a <code>PanQuery</code> class for querying the PANGAEA search engine programmatically. Note that the underlying search API used by pangaeapy is currently internal and undocumented; an official, publicly documented Search REST API is under development. Until that is available, pangaeapy offers the most straightforward path to programmatic search for Python users:
<code>from pangaeapy import PanQuery

q = PanQuery('temperature salinity Atlantic', limit=10)
for result in q.results:
print(result['doi'], result['title'])</code>

=== Access to Restricted Datasets ===
A bearer token obtained from the PANGAEA user profile can be passed to <code>PanDataSet</code> to enable access to moratorium datasets (Please note: this does not enable you to access restricted data of other authors but yours alone!):
<code>ds = PanDataSet('10.1594/PANGAEA.841672', token='<your-bearer-token>')</code>

=== Further Training Materials ===
Jupyter notebooks with worked examples for data discovery, loading, and analysis with pangaeapy are available in the PANGAEA Community Workshop GitHub repository at https://github.com/pangaea-data-publisher/community-workshop-material (see the <code>Python/</code> directory).

== Programmatic Access with R: pangaear ==
'''pangaear''' is a community-developed R client for PANGAEA, maintained as part of the rOpenSci ecosystem. It provides equivalent functionality to pangaeapy for R users.

* CRAN: https://cran.r-project.org/package=pangaear
* GitHub: https://github.com/ropensci/pangaear
* Introduction slides: https://github.com/pangaea-data-publisher/community-workshop-material/blob/master/Intro_PangaeaR.pdf

=== Installation ===
<code>install.packages("pangaear")
# or the development version:
# devtools::install_github("ropensci/pangaear")</code>

=== Loading a Dataset ===
<code>library(pangaear)

# Download and load a dataset by DOI
ds <- pg_data(doi = '10.1594/PANGAEA.841672')

# Access the data as a data frame
head(ds[[1]]$data)

# Metadata is accessible from the list object
ds[[1]]$metadata</code>

=== Searching PANGAEA from R ===
<code>res <- pg_search(query = 'temperature salinity Atlantic', count = 10)
print(res$doi)</code>
R scripts and worked examples are available in the <code>R/</code> directory of the PANGAEA Community Workshop GitHub repository.

== The PANGAEA Data Warehouse ==
The PANGAEA data warehouse (based on Clickhouse) provides a powerful complementary access path for users who need to compile and aggregate data across large numbers of datasets — potentially across hundreds of thousands of individual publications — without having to download and merge files manually.

The data warehouse supports spatially and chronologically constrained queries at the parameter level, returning data together with the DOI of each contributing dataset to maintain full provenance traceability. It is accessible in two ways:

'''Through the PANGAEA website:''' The data warehouse interface is integrated into the search results page. After performing a search, users can select "Data Warehouse" to configure and download a parameter-level aggregation from all datasets in the result set.

Documentation: https://wiki.pangaea.de/wiki/Data_warehouse

'''Programmatically via REST API:''' The data warehouse is also accessible through the PANGAEA web services at https://ws.pangaea.de/. This allows automated, scripted aggregations and is supported by both pangaeapy and pangaear.

Note that data warehouse exports represent compiled data products and do not replace the need to consult individual dataset landing pages to assess the fitness of individual studies for a specific scientific application. Every export includes the DOI name for each data point, ensuring that citation and provenance are preserved.

== Summary of Access Methods ==
{| class="wikitable"
!Use case
!Method
!Entry point
|-
|Interactive discovery
|PANGAEA search
|https://www.pangaea.de/
|-
|Single dataset access
|Browser / landing page
|https://doi.pangaea.de/10.1594/PANGAEA.xxxxx
|-
|Programmatic data download
|HTTP content negotiation
|<code>curl -H 'Accept: text/tab-separated-values' https://doi.org/...</code>
|-
|Programmatic metadata download
|HTTP content negotiation
|<code>curl -H 'Accept: application/ld+json' https://doi.org/...</code>
|-
|Link and format discovery
|HTTP HEAD / Signposting
|<code>curl -I https://doi.pangaea.de/...</code>
|-
|Restricted dataset access (to your own publications)
|Bearer token authentication
|retrievable via https://pangaea.de/user/
|-
|Bulk metadata harvesting
|OAI-PMH
|https://ws.pangaea.de/oai/
|-
|Scripted data access (Python)
|pangaeapy
|https://pypi.org/project/pangaeapy/
|-
|Scripted data access (R)
|pangaear
|https://cran.r-project.org/package=pangaear
|-
|Cross-dataset aggregation
|Data warehouse
|https://wiki.pangaea.de/wiki/Data_warehouse
|-
|Discovery across repositories
|DataCite Search API
|https://support.datacite.org/docs/api
|}
----

== Further Resources ==

* [[PANGAEA search]] — documentation on search syntax and faceted filters
* [[Data warehouse]] — data warehouse documentation
* [[PANGAEA Community Workshops]] — hands-on training workshops on finding and using PANGAEA data
* [https://github.com/pangaea-data-publisher/community-workshop-material PANGAEA Community Workshop GitHub] — Jupyter notebooks, R scripts, and slide decks
* [https://pypi.org/project/pangaeapy/ pangaeapy on PyPI] — Python client
* [https://cran.r-project.org/package=pangaear pangaear on CRAN] — R client
* [https://ws.pangaea.de/oai/ PANGAEA web services] — REST and OAI-PMH endpoints
* [https://signposting.org/ Signposting standard] — link discovery for FAIR data

Data Access and Reuse

2026-04-22T09:20:07Z

Lmoeller: repaired dysfunctional links

This article describes the different methods available for discovering, accessing, and reusing data published in PANGAEA. It is intended for researchers and data practitioners who want to interact with PANGAEA data beyond the standard web interface, including scripted and automated workflows. The methods range from interactive web-based search to fully programmatic access via HTTP content negotiation and dedicated client libraries.

Hands-on training materials covering many of the topics below are available in the [[PANGAEA Community Workshops|PANGAEA Community Workshop Series]], including Jupyter notebooks, R scripts, and slide decks in the [https://github.com/pangaea-data-publisher/community-workshop-material PANGAEA community workshop GitHub repository].

== Overview ==
Every dataset published in PANGAEA is assigned a globally unique and persistent Digital Object Identifier (DOI). The DOI is the single entry point for all forms of programmatic access: it resolves to a dataset landing page that, in addition to its human-readable representation, exposes all available metadata formats and data download options through standard HTTP mechanisms. There is no separate PANGAEA data API — all data and metadata access is built on top of the DOI and its landing page using standard web protocols. This design makes PANGAEA data directly accessible without any vendor-specific API key or proprietary client, while remaining fully compatible with FAIR principles.

== Web-Based Search and Discovery ==

=== PANGAEA Search Interface ===
The primary discovery tool for PANGAEA data is the PANGAEA search engine at https://www.pangaea.de/, based on Elasticsearch. It supports full-text search and faceted filtering across all published metadata. Faceted navigation allows users to constrain results by topic, device type, geographic region, and temporal coverage. Documentation on search functionality and syntax is available in the Wiki at [[PANGAEA search]].

=== External Portals and Registries ===
PANGAEA metadata is harvested by a large number of disciplinary and generic portals via OAI-PMH, making datasets discoverable beyond the PANGAEA website through services such as Google Dataset Search, OpenAIRE, DataCite Commons, DataONE, GBIF, EMODnet, GFBio, and others. PANGAEA is registered in re3data.org, FAIRsharing.org, RIsources, and the EOSC Marketplace.

An alternative for search when a PANGAEA-specific search interface is not required is the '''DataCite Search API''' (https://support.datacite.org/docs/api), which searches across a large inventory of research data from many repositories. It returns summary metadata including DOI names, and subsequent data access from any PANGAEA result can then proceed via content negotiation as described below.

== DOI Landing Pages and Link Discovery ==

=== The Landing Page as Access Hub ===
Each PANGAEA dataset is represented by a landing page accessible at its DOI. For example:
<code>https://doi.pangaea.de/10.1594/PANGAEA.841672</code>
The landing page serves a dual function: it presents dataset metadata and a data preview in human-readable form, and it simultaneously exposes all available machine-readable representations of the same resource through standard HTTP link headers and HTML <code><link></code> elements. This implementation follows the '''Signposting standard''' (https://signposting.org/), which allows any HTTP client to discover all alternate representations of a dataset — metadata in various formats, the data file itself, and the ORCID iDs of the authors — without any prior knowledge of PANGAEA-specific URL patterns.

=== Discovering Available Representations ===
A simple HTTP HEAD request to the landing page returns the full set of typed link relations in the response header. The following example illustrates this using <code>curl</code>:
<code>curl -LI https://doi.org/10.1594/PANGAEA.841672</code>
The response <code>Link:</code> header will contain relations of the following types:

* '''<code>cite-as</code>''' — the canonical DOI citation URL
* '''<code>describedby</code>''' — links to metadata in various formats (ISO 19139, DataCite XML, PANGAEA XML, BibTeX, RIS, JSON-LD)
* '''<code>item</code>''' — links to the data itself (tab-delimited text, HTML view)
* '''<code>author</code>''' — ORCID iDs of the dataset authors

This mechanism allows scripts, harvesters, and other machine clients to discover and retrieve any representation of a dataset from its DOI alone.

== HTTP Content Negotiation ==
HTTP content negotiation allows a client to request a specific representation of a resource by specifying a MIME type in the <code>Accept</code> header. All PANGAEA dataset landing pages support this mechanism, so both data and metadata can be downloaded programmatically by querying the DOI directly with the appropriate content type — no PANGAEA-specific URL construction is required.

=== Downloading the Data File ===
The tabular data file for a dataset can be downloaded as tab-delimited text:
<code>curl -OJLf 'https://doi.pangaea.de/10.1594/PANGAEA.841672?format=textfile'</code>
Equivalently, using content negotiation directly against the DOI:
<code>curl -OJLf -H 'Accept: text/tab-separated-values' \
https://doi.org/10.1594/PANGAEA.841672</code>
The <code>-OJL</code> flags instruct curl to save the file using the server-provided filename, follow redirects, and fail silently on errors.

=== Retrieving Metadata in a Specific Format ===
The same mechanism applies to metadata. To retrieve a dataset's metadata in ISO 19139/19115 format:
<code>curl -L -H 'Accept: application/vnd.iso19139.metadata+xml' \
https://doi.org/10.1594/PANGAEA.841672</code>
The following MIME types are currently supported for metadata retrieval via content negotiation:
{| class="wikitable"
!Format
!MIME type
|-
|PANGAEA internal XML
|<code>application/vnd.pangaea.metadata+xml</code>
|-
|DataCite XML (v4)
|<code>application/vnd.datacite.datacite+xml</code>
|-
|ISO 19139 / 19115
|<code>application/vnd.iso19139.metadata+xml</code>
|-
|NASA DIF
|<code>application/vnd.nasa.dif-metadata+xml</code>
|-
|JSON-LD (Schema.org)
|<code>application/ld+json</code>
|-
|BibTeX
|<code>application/x-bibtex</code>
|-
|RIS
|<code>application/x-research-info-systems</code>
|-
|Plain text citation
|<code>text/x-bibliography</code>
|-
|Tab-separated data
|<code>text/tab-separated-values</code>
|}
Alternatively, the same formats can be requested using explicit URL parameters:
https://doi.pangaea.de/10.1594/PANGAEA.841672?format=metadata_datacite4
https://doi.pangaea.de/10.1594/PANGAEA.841672?format=metadata_iso19139
https://doi.pangaea.de/10.1594/PANGAEA.841672?format=metadata_jsonld
https://doi.pangaea.de/10.1594/PANGAEA.841672?format=citation_bibtex

== Access to Restricted Datasets ==
A small fraction of PANGAEA datasets are under an active moratorium — access to the data is temporarily restricted while the metadata remains publicly visible. Access to your personal moratorium datasets requires authentication using a '''bearer token''', which is the standard mechanism for API authorization (RFC 6750). Note that access to protected datasets of other authors is, of course, still not possible.

To obtain your bearer token, log in to PANGAEA at https://pangaea.de/user/login.php. Login is supported both with a PANGAEA username and password, and via ORCID iD. After logging in, the user profile page at https://pangaea.de/user/ displays the current session token under "Your temporary login token." This token can be renewed when logging out and logging in again, and passed as an <code>Authorization: Bearer</code> header in any HTTP request:
<code>curl -OJLf \
-H 'Authorization: Bearer <your-token>' \
-H 'Accept: text/tab-separated-values' \
https://doi.org/10.1594/PANGAEA.841672</code>
If a request is made without a token to a restricted dataset, the server responds with <code>401 Bearer Realm</code>, signaling that authentication is required.

== OAI-PMH Metadata Harvesting ==
For bulk metadata harvesting, PANGAEA provides an OAI-PMH endpoint at:
<code>https://ws.pangaea.de/oai/</code>
OAI-PMH allows systematic, incremental harvesting of all PANGAEA metadata in supported formats. The following metadata standards are available via OAI-PMH: Dublin Core, DataCite v3 and v4, ISO 19139, and DIF (Directory Interchange Format). This endpoint is used by the portals and registries listed in the discovery section above, and is equally available to any user who wishes to build their own index or integrate PANGAEA metadata into an institutional system.

== Programmatic Access with Python: pangaeapy ==
'''pangaeapy''' is the official Python client library for PANGAEA, developed and maintained by the PANGAEA team. It provides a high-level interface for loading and analyzing PANGAEA datasets directly into native Python data structures, without requiring manual HTTP requests or format parsing.

* PyPI package: https://pypi.org/project/pangaeapy/
* Source code: https://github.com/pangaea-data-publisher/pangaeapy
* Introduction slides: https://github.com/pangaea-data-publisher/community-workshop-material/blob/master/Introduction_to_pangaeapy.pdf

=== Installation ===
<code>pip install pangaeapy</code>

=== Loading a Dataset ===
The central object in pangaeapy is <code>PanDataSet</code>, which takes a DOI or PANGAEA dataset ID and retrieves both the data and the associated metadata:
<code>from pangaeapy import PanDataSet

ds = PanDataSet('10.1594/PANGAEA.841672')

# Access the data as a pandas DataFrame
print(ds.data.head())

# Access dataset metadata
print(ds.title)
print(ds.authors)
print(ds.parameters) # list of measured parameters with units and methods</code>
The <code>data</code> attribute is a pandas DataFrame in which each column corresponds to a measured parameter, and each row to a measurement. Parameter metadata — including units, standard names, and method descriptions — is accessible through the <code>parameters</code> attribute.

=== Searching PANGAEA from Python ===
pangaeapy also provides a <code>PanQuery</code> class for querying the PANGAEA search engine programmatically. Note that the underlying search API used by pangaeapy is currently internal and undocumented; an official, publicly documented Search REST API is under development. Until that is available, pangaeapy offers the most straightforward path to programmatic search for Python users:
<code>from pangaeapy import PanQuery

q = PanQuery('temperature salinity Atlantic', limit=10)
for result in q.results:
print(result['doi'], result['title'])</code>

=== Access to Restricted Datasets ===
A bearer token obtained from the PANGAEA user profile can be passed to <code>PanDataSet</code> to enable access to moratorium datasets (Please note: this does not enable you to access restricted data of other authors but yours alone!):
<code>ds = PanDataSet('10.1594/PANGAEA.841672', token='<your-bearer-token>')</code>

=== Further Training Materials ===
Jupyter notebooks with worked examples for data discovery, loading, and analysis with pangaeapy are available in the PANGAEA Community Workshop GitHub repository at https://github.com/pangaea-data-publisher/community-workshop-material (see the <code>Python/</code> directory).

== Programmatic Access with R: pangaear ==
'''pangaear''' is a community-developed R client for PANGAEA, maintained as part of the rOpenSci ecosystem. It provides equivalent functionality to pangaeapy for R users.

* CRAN: https://cran.r-project.org/package=pangaear
* GitHub: https://github.com/ropensci/pangaear
* Introduction slides: https://github.com/pangaea-data-publisher/community-workshop-material/blob/master/Intro_PangaeaR.pdf

=== Installation ===
<code>install.packages("pangaear")
# or the development version:
# devtools::install_github("ropensci/pangaear")</code>

=== Loading a Dataset ===
<code>library(pangaear)

# Download and load a dataset by DOI
ds <- pg_data(doi = '10.1594/PANGAEA.841672')

# Access the data as a data frame
head(ds[[1]]$data)

# Metadata is accessible from the list object
ds[[1]]$metadata</code>

=== Searching PANGAEA from R ===
<code>res <- pg_search(query = 'temperature salinity Atlantic', count = 10)
print(res$doi)</code>
R scripts and worked examples are available in the <code>R/</code> directory of the PANGAEA Community Workshop GitHub repository.

== The PANGAEA Data Warehouse ==
The PANGAEA data warehouse (based on Clickhouse) provides a powerful complementary access path for users who need to compile and aggregate data across large numbers of datasets — potentially across hundreds of thousands of individual publications — without having to download and merge files manually.

The data warehouse supports spatially and chronologically constrained queries at the parameter level, returning data together with the DOI of each contributing dataset to maintain full provenance traceability. It is accessible in two ways:

'''Through the PANGAEA website:''' The data warehouse interface is integrated into the search results page. After performing a search, users can select "Data Warehouse" to configure and download a parameter-level aggregation from all datasets in the result set.

Documentation: https://wiki.pangaea.de/wiki/Data_warehouse

'''Programmatically via REST API:''' The data warehouse is also accessible through the PANGAEA web services at https://ws.pangaea.de/. This allows automated, scripted aggregations and is supported by both pangaeapy and pangaear.

Note that data warehouse exports represent compiled data products and do not replace the need to consult individual dataset landing pages to assess the fitness of individual studies for a specific scientific application. Every export includes the DOI name for each data point, ensuring that citation and provenance are preserved.

== Summary of Access Methods ==
{| class="wikitable"
!Use case
!Method
!Entry point
|-
|Interactive discovery
|PANGAEA search
|https://www.pangaea.de/
|-
|Single dataset access
|Browser / landing page
|https://doi.pangaea.de/10.1594/PANGAEA.xxxxx
|-
|Programmatic data download
|HTTP content negotiation
|<code>curl -H 'Accept: text/tab-separated-values' https://doi.org/...</code>
|-
|Programmatic metadata download
|HTTP content negotiation
|<code>curl -H 'Accept: application/ld+json' https://doi.org/...</code>
|-
|Link and format discovery
|HTTP HEAD / Signposting
|<code>curl -I https://doi.pangaea.de/...</code>
|-
|Restricted dataset access
|Bearer token authentication
|Via https://pangaea.de/user/
|-
|Bulk metadata harvesting
|OAI-PMH
|https://ws.pangaea.de/oai/
|-
|Scripted data access (Python)
|pangaeapy
|https://pypi.org/project/pangaeapy/
|-
|Scripted data access (R)
|pangaear
|https://cran.r-project.org/package=pangaear
|-
|Cross-dataset aggregation
|Data warehouse
|https://wiki.pangaea.de/wiki/Data_warehouse
|-
|Discovery across repositories
|DataCite Search API
|https://support.datacite.org/docs/api
|}
----

== Further Resources ==

* [[PANGAEA search]] — documentation on search syntax and faceted filters
* [[Data warehouse]] — data warehouse documentation
* [[PANGAEA Community Workshops]] — hands-on training workshops on finding and using PANGAEA data
* [https://github.com/pangaea-data-publisher/community-workshop-material PANGAEA Community Workshop GitHub] — Jupyter notebooks, R scripts, and slide decks
* [https://pypi.org/project/pangaeapy/ pangaeapy on PyPI] — Python client
* [https://cran.r-project.org/package=pangaear pangaear on CRAN] — R client
* [https://ws.pangaea.de/oai/ PANGAEA web services] — REST and OAI-PMH endpoints
* [https://signposting.org/ Signposting standard] — link discovery for FAIR data

Data Access and Reuse

2026-04-22T09:14:21Z

Lmoeller:

This article describes the different methods available for discovering, accessing, and reusing data published in PANGAEA. It is intended for researchers and data practitioners who want to interact with PANGAEA data beyond the standard web interface, including scripted and automated workflows. The methods range from interactive web-based search to fully programmatic access via HTTP content negotiation and dedicated client libraries.

Hands-on training materials covering many of the topics below are available in the [[PANGAEA Community Workshops|PANGAEA Community Workshop Series]], including Jupyter notebooks, R scripts, and slide decks in the [https://github.com/pangaea-data-publisher/community-workshop-material PANGAEA community workshop GitHub repository].

== Overview ==
Every dataset published in PANGAEA is assigned a globally unique and persistent Digital Object Identifier (DOI). The DOI is the single entry point for all forms of programmatic access: it resolves to a dataset landing page that, in addition to its human-readable representation, exposes all available metadata formats and data download options through standard HTTP mechanisms. There is no separate PANGAEA data API — all data and metadata access is built on top of the DOI and its landing page using standard web protocols. This design makes PANGAEA data directly accessible without any vendor-specific API key or proprietary client, while remaining fully compatible with FAIR principles.

== Web-Based Search and Discovery ==

=== PANGAEA Search Interface ===
The primary discovery tool for PANGAEA data is the PANGAEA search engine at https://www.pangaea.de/, based on Elasticsearch. It supports full-text search and faceted filtering across all published metadata. Faceted navigation allows users to constrain results by topic, device type, geographic region, and temporal coverage. Documentation on search functionality and syntax is available in the Wiki at [[PANGAEA search]].

=== External Portals and Registries ===
PANGAEA metadata is harvested by a large number of disciplinary and generic portals via OAI-PMH, making datasets discoverable beyond the PANGAEA website through services such as Google Dataset Search, OpenAIRE, DataCite Commons, DataONE, GBIF, EMODnet, GFBio, and others. PANGAEA is registered in re3data.org, FAIRsharing.org, RIsources, and the EOSC Marketplace.

An alternative for search when a PANGAEA-specific search interface is not required is the '''DataCite Search API''' (https://support.datacite.org/docs/api), which searches across a large inventory of research data from many repositories. It returns summary metadata including DOI names, and subsequent data access from any PANGAEA result can then proceed via content negotiation as described below.

== DOI Landing Pages and Link Discovery ==

=== The Landing Page as Access Hub ===
Each PANGAEA dataset is represented by a landing page accessible at its DOI. For example:
<code>https://doi.pangaea.de/10.1594/PANGAEA.841672</code>
The landing page serves a dual function: it presents dataset metadata and a data preview in human-readable form, and it simultaneously exposes all available machine-readable representations of the same resource through standard HTTP link headers and HTML <code><link></code> elements. This implementation follows the '''Signposting standard''' (https://signposting.org/), which allows any HTTP client to discover all alternate representations of a dataset — metadata in various formats, the data file itself, and the ORCID iDs of the authors — without any prior knowledge of PANGAEA-specific URL patterns.

=== Discovering Available Representations ===
A simple HTTP HEAD request to the landing page returns the full set of typed link relations in the response header. The following example illustrates this using <code>curl</code>:
<code>curl -LI https://doi.org/10.1594/PANGAEA.841672</code>
The response <code>Link:</code> header will contain relations of the following types:

* '''<code>cite-as</code>''' — the canonical DOI citation URL
* '''<code>describedby</code>''' — links to metadata in various formats (ISO 19139, DataCite XML, PANGAEA XML, BibTeX, RIS, JSON-LD)
* '''<code>item</code>''' — links to the data itself (tab-delimited text, HTML view)
* '''<code>author</code>''' — ORCID iDs of the dataset authors

This mechanism allows scripts, harvesters, and other machine clients to discover and retrieve any representation of a dataset from its DOI alone.

== HTTP Content Negotiation ==
HTTP content negotiation allows a client to request a specific representation of a resource by specifying a MIME type in the <code>Accept</code> header. All PANGAEA dataset landing pages support this mechanism, so both data and metadata can be downloaded programmatically by querying the DOI directly with the appropriate content type — no PANGAEA-specific URL construction is required.

=== Downloading the Data File ===
The tabular data file for a dataset can be downloaded as tab-delimited text:
<code>curl -OJLf 'https://doi.pangaea.de/10.1594/PANGAEA.841672?format=textfile'</code>
Equivalently, using content negotiation directly against the DOI:
<code>curl -OJLf -H 'Accept: text/tab-separated-values' \
https://doi.org/10.1594/PANGAEA.841672</code>
The <code>-OJL</code> flags instruct curl to save the file using the server-provided filename, follow redirects, and fail silently on errors.

=== Retrieving Metadata in a Specific Format ===
The same mechanism applies to metadata. To retrieve a dataset's metadata in ISO 19139/19115 format:
<code>curl -L -H 'Accept: application/vnd.iso19139.metadata+xml' \
https://doi.org/10.1594/PANGAEA.841672</code>
The following MIME types are currently supported for metadata retrieval via content negotiation:
{| class="wikitable"
!Format
!MIME type
|-
|PANGAEA internal XML
|<code>application/vnd.pangaea.metadata+xml</code>
|-
|DataCite XML (v4)
|<code>application/vnd.datacite.datacite+xml</code>
|-
|ISO 19139 / 19115
|<code>application/vnd.iso19139.metadata+xml</code>
|-
|NASA DIF
|<code>application/vnd.nasa.dif-metadata+xml</code>
|-
|JSON-LD (Schema.org)
|<code>application/ld+json</code>
|-
|BibTeX
|<code>application/x-bibtex</code>
|-
|RIS
|<code>application/x-research-info-systems</code>
|-
|Plain text citation
|<code>text/x-bibliography</code>
|-
|Tab-separated data
|<code>text/tab-separated-values</code>
|}
Alternatively, the same formats can be requested using explicit URL parameters:
https://doi.pangaea.de/10.1594/PANGAEA.841672?format=metadata_datacite4
https://doi.pangaea.de/10.1594/PANGAEA.841672?format=metadata_iso19139
https://doi.pangaea.de/10.1594/PANGAEA.841672?format=metadata_jsonld
https://doi.pangaea.de/10.1594/PANGAEA.841672?format=citation_bibtex

== Access to Restricted Datasets ==
A small fraction of PANGAEA datasets are under an active moratorium — access to the data is temporarily restricted while the metadata remains publicly visible. Access to your personal moratorium datasets requires authentication using a '''bearer token''', which is the standard mechanism for API authorization (RFC 6750). Note that access to protected datasets of other authors is, of course, to possible.

To obtain your bearer token, log in to PANGAEA at https://pangaea.de/user/login.php. Login is supported both with a PANGAEA username and password, and via ORCID iD. After logging in, the user profile page at https://pangaea.de/user/ displays the current session token under "Your temporary login token." This token can be renewed when logging out and logging in again, and passed as an <code>Authorization: Bearer</code> header in any HTTP request:
<code>curl -OJLf \
-H 'Authorization: Bearer <your-token>' \
-H 'Accept: text/tab-separated-values' \
https://doi.org/10.1594/PANGAEA.841672</code>
If a request is made without a token to a restricted dataset, the server responds with <code>401 Bearer Realm</code>, signaling that authentication is required.

== OAI-PMH Metadata Harvesting ==
For bulk metadata harvesting, PANGAEA provides an OAI-PMH endpoint at:
<code>https://ws.pangaea.de/oai/</code>
OAI-PMH allows systematic, incremental harvesting of all PANGAEA metadata in supported formats. The following metadata standards are available via OAI-PMH: Dublin Core, DataCite v3 and v4, ISO 19139, and DIF (Directory Interchange Format). This endpoint is used by the portals and registries listed in the discovery section above, and is equally available to any user who wishes to build their own index or integrate PANGAEA metadata into an institutional system.

== Programmatic Access with Python: pangaeapy ==
'''pangaeapy''' is the official Python client library for PANGAEA, developed and maintained by the PANGAEA team. It provides a high-level interface for loading and analyzing PANGAEA datasets directly into native Python data structures, without requiring manual HTTP requests or format parsing.

* PyPI package: https://pypi.org/project/pangaeapy/
* Source code: https://github.com/pangaea-data-publisher/pangaeapy
* Introduction slides: https://github.com/pangaea-data-publisher/community-workshop-material/blob/master/Introduction_to_pangaeapy.pdf

=== Installation ===
<code>pip install pangaeapy</code>

=== Loading a Dataset ===
The central object in pangaeapy is <code>PanDataSet</code>, which takes a DOI or PANGAEA dataset ID and retrieves both the data and the associated metadata:
<code>from pangaeapy import PanDataSet

ds = PanDataSet('10.1594/PANGAEA.841672')

# Access the data as a pandas DataFrame
print(ds.data.head())

# Access dataset metadata
print(ds.title)
print(ds.authors)
print(ds.parameters) # list of measured parameters with units and methods</code>
The <code>data</code> attribute is a pandas DataFrame in which each column corresponds to a measured parameter, and each row to a measurement. Parameter metadata — including units, standard names, and method descriptions — is accessible through the <code>parameters</code> attribute.

=== Searching PANGAEA from Python ===
pangaeapy also provides a <code>PanQuery</code> class for querying the PANGAEA search engine programmatically. Note that the underlying search API used by pangaeapy is currently internal and undocumented; an official, publicly documented Search REST API is under development. Until that is available, pangaeapy offers the most straightforward path to programmatic search for Python users:
<code>from pangaeapy import PanQuery

q = PanQuery('temperature salinity Atlantic', limit=10)
for result in q.results:
print(result['doi'], result['title'])</code>

=== Access to Restricted Datasets ===
A bearer token obtained from the PANGAEA user profile can be passed to <code>PanDataSet</code> to enable access to moratorium datasets (Please note: this does not enable you to access restricted data of other authors but yours alone!):
<code>ds = PanDataSet('10.1594/PANGAEA.841672', token='<your-bearer-token>')</code>

=== Further Training Materials ===
Jupyter notebooks with worked examples for data discovery, loading, and analysis with pangaeapy are available in the PANGAEA Community Workshop GitHub repository at https://github.com/pangaea-data-publisher/community-workshop-material (see the <code>Python/</code> directory).

== Programmatic Access with R: pangaear ==
'''pangaear''' is a community-developed R client for PANGAEA, maintained as part of the rOpenSci ecosystem. It provides equivalent functionality to pangaeapy for R users.

* CRAN: https://cran.r-project.org/package=pangaear
* GitHub: https://github.com/ropensci/pangaear
* Introduction slides: https://github.com/pangaea-data-publisher/community-workshop-material/blob/master/Intro_PangaeaR.pdf

=== Installation ===
<code>install.packages("pangaear")
# or the development version:
# devtools::install_github("ropensci/pangaear")</code>

=== Loading a Dataset ===
<code>library(pangaear)

# Download and load a dataset by DOI
ds <- pg_data(doi = '10.1594/PANGAEA.841672')

# Access the data as a data frame
head(ds[[1]]$data)

# Metadata is accessible from the list object
ds[[1]]$metadata</code>

=== Searching PANGAEA from R ===
<code>res <- pg_search(query = 'temperature salinity Atlantic', count = 10)
print(res$doi)</code>
R scripts and worked examples are available in the <code>R/</code> directory of the PANGAEA Community Workshop GitHub repository.

== The PANGAEA Data Warehouse ==
The PANGAEA data warehouse (based on Clickhouse) provides a powerful complementary access path for users who need to compile and aggregate data across large numbers of datasets — potentially across hundreds of thousands of individual publications — without having to download and merge files manually.

The data warehouse supports spatially and chronologically constrained queries at the parameter level, returning data together with the DOI of each contributing dataset to maintain full provenance traceability. It is accessible in two ways:

'''Through the PANGAEA website:''' The data warehouse interface is integrated into the search results page. After performing a search, users can select "Data Warehouse" to configure and download a parameter-level aggregation from all datasets in the result set.

Documentation: https://wiki.pangaea.de/wiki/Data_warehouse

'''Programmatically via REST API:''' The data warehouse is also accessible through the PANGAEA web services at https://ws.pangaea.de/. This allows automated, scripted aggregations and is supported by both pangaeapy and pangaear.

Note that data warehouse exports represent compiled data products and do not replace the need to consult individual dataset landing pages to assess the fitness of individual studies for a specific scientific application. Every export includes the DOI name for each data point, ensuring that citation and provenance are preserved.

== Summary of Access Methods ==
{| class="wikitable"
!Use case
!Method
!Entry point
|-
|Interactive discovery
|PANGAEA search
|https://www.pangaea.de/
|-
|Single dataset access
|Browser / landing page
|https://doi.pangaea.de/10.1594/PANGAEA.xxxxx
|-
|Programmatic data download
|HTTP content negotiation
|<code>curl -H 'Accept: text/tab-separated-values' https://doi.org/...</code>
|-
|Programmatic metadata download
|HTTP content negotiation
|<code>curl -H 'Accept: application/ld+json' https://doi.org/...</code>
|-
|Link and format discovery
|HTTP HEAD / Signposting
|<code>curl -I https://doi.pangaea.de/...</code>
|-
|Restricted dataset access
|Bearer token authentication
|Via https://pangaea.de/user/
|-
|Bulk metadata harvesting
|OAI-PMH
|https://ws.pangaea.de/oai/
|-
|Scripted data access (Python)
|pangaeapy
|https://pypi.org/project/pangaeapy/
|-
|Scripted data access (R)
|pangaear
|https://cran.r-project.org/package=pangaear
|-
|Cross-dataset aggregation
|Data warehouse
|https://wiki.pangaea.de/wiki/Data_warehouse
|-
|Discovery across repositories
|DataCite Search API
|https://support.datacite.org/docs/api
|}
----

== Further Resources ==

* [[PANGAEA search]] — documentation on search syntax and faceted filters
* [[Data warehouse]] — data warehouse documentation
* [[PANGAEA Community Workshops]] — hands-on training workshops on finding and using PANGAEA data
* PANGAEA Community Workshop GitHub — Jupyter notebooks, R scripts, and slide decks
* pangaeapy on PyPI — Python client
* pangaear on CRAN — R client
* PANGAEA web services — REST and OAI-PMH endpoints
* Signposting standard — link discovery for FAIR data

Data Access and Reuse

2026-04-22T09:12:29Z

Lmoeller: Activate external links

This article describes the different methods available for discovering, accessing, and reusing data published in PANGAEA. It is intended for researchers and data practitioners who want to interact with PANGAEA data beyond the standard web interface, including scripted and automated workflows. The methods range from interactive web-based search to fully programmatic access via HTTP content negotiation and dedicated client libraries.

Hands-on training materials covering many of the topics below are available in the [[PANGAEA Community Workshops|PANGAEA Community Workshop Series]], including Jupyter notebooks, R scripts, and slide decks in the [PANGAEA community workshop GitHub repository](https://github.com/pangaea-data-publisher/community-workshop-material).

== Overview ==
Every dataset published in PANGAEA is assigned a globally unique and persistent Digital Object Identifier (DOI). The DOI is the single entry point for all forms of programmatic access: it resolves to a dataset landing page that, in addition to its human-readable representation, exposes all available metadata formats and data download options through standard HTTP mechanisms. There is no separate PANGAEA data API — all data and metadata access is built on top of the DOI and its landing page using standard web protocols. This design makes PANGAEA data directly accessible without any vendor-specific API key or proprietary client, while remaining fully compatible with FAIR principles.

== Web-Based Search and Discovery ==

=== PANGAEA Search Interface ===
The primary discovery tool for PANGAEA data is the PANGAEA search engine at https://www.pangaea.de/, based on Elasticsearch. It supports full-text search and faceted filtering across all published metadata. Faceted navigation allows users to constrain results by topic, device type, geographic region, and temporal coverage. Documentation on search functionality and syntax is available in the Wiki at [[PANGAEA search]].

=== External Portals and Registries ===
PANGAEA metadata is harvested by a large number of disciplinary and generic portals via OAI-PMH, making datasets discoverable beyond the PANGAEA website through services such as Google Dataset Search, OpenAIRE, DataCite Commons, DataONE, GBIF, EMODnet, GFBio, and others. PANGAEA is registered in re3data.org, FAIRsharing.org, RIsources, and the EOSC Marketplace.

An alternative for search when a PANGAEA-specific search interface is not required is the '''DataCite Search API''' (https://support.datacite.org/docs/api), which searches across a large inventory of research data from many repositories. It returns summary metadata including DOI names, and subsequent data access from any PANGAEA result can then proceed via content negotiation as described below.

== DOI Landing Pages and Link Discovery ==

=== The Landing Page as Access Hub ===
Each PANGAEA dataset is represented by a landing page accessible at its DOI. For example:
<code>https://doi.pangaea.de/10.1594/PANGAEA.841672</code>
The landing page serves a dual function: it presents dataset metadata and a data preview in human-readable form, and it simultaneously exposes all available machine-readable representations of the same resource through standard HTTP link headers and HTML <code><link></code> elements. This implementation follows the '''Signposting standard''' (https://signposting.org/), which allows any HTTP client to discover all alternate representations of a dataset — metadata in various formats, the data file itself, and the ORCID iDs of the authors — without any prior knowledge of PANGAEA-specific URL patterns.

=== Discovering Available Representations ===
A simple HTTP HEAD request to the landing page returns the full set of typed link relations in the response header. The following example illustrates this using <code>curl</code>:
<code>curl -LI https://doi.org/10.1594/PANGAEA.841672</code>
The response <code>Link:</code> header will contain relations of the following types:

* '''<code>cite-as</code>''' — the canonical DOI citation URL
* '''<code>describedby</code>''' — links to metadata in various formats (ISO 19139, DataCite XML, PANGAEA XML, BibTeX, RIS, JSON-LD)
* '''<code>item</code>''' — links to the data itself (tab-delimited text, HTML view)
* '''<code>author</code>''' — ORCID iDs of the dataset authors

This mechanism allows scripts, harvesters, and other machine clients to discover and retrieve any representation of a dataset from its DOI alone.

== HTTP Content Negotiation ==
HTTP content negotiation allows a client to request a specific representation of a resource by specifying a MIME type in the <code>Accept</code> header. All PANGAEA dataset landing pages support this mechanism, so both data and metadata can be downloaded programmatically by querying the DOI directly with the appropriate content type — no PANGAEA-specific URL construction is required.

=== Downloading the Data File ===
The tabular data file for a dataset can be downloaded as tab-delimited text:
<code>curl -OJLf 'https://doi.pangaea.de/10.1594/PANGAEA.841672?format=textfile'</code>
Equivalently, using content negotiation directly against the DOI:
<code>curl -OJLf -H 'Accept: text/tab-separated-values' \
https://doi.org/10.1594/PANGAEA.841672</code>
The <code>-OJL</code> flags instruct curl to save the file using the server-provided filename, follow redirects, and fail silently on errors.

=== Retrieving Metadata in a Specific Format ===
The same mechanism applies to metadata. To retrieve a dataset's metadata in ISO 19139/19115 format:
<code>curl -L -H 'Accept: application/vnd.iso19139.metadata+xml' \
https://doi.org/10.1594/PANGAEA.841672</code>
The following MIME types are currently supported for metadata retrieval via content negotiation:
{| class="wikitable"
!Format
!MIME type
|-
|PANGAEA internal XML
|<code>application/vnd.pangaea.metadata+xml</code>
|-
|DataCite XML (v4)
|<code>application/vnd.datacite.datacite+xml</code>
|-
|ISO 19139 / 19115
|<code>application/vnd.iso19139.metadata+xml</code>
|-
|NASA DIF
|<code>application/vnd.nasa.dif-metadata+xml</code>
|-
|JSON-LD (Schema.org)
|<code>application/ld+json</code>
|-
|BibTeX
|<code>application/x-bibtex</code>
|-
|RIS
|<code>application/x-research-info-systems</code>
|-
|Plain text citation
|<code>text/x-bibliography</code>
|-
|Tab-separated data
|<code>text/tab-separated-values</code>
|}
Alternatively, the same formats can be requested using explicit URL parameters:
https://doi.pangaea.de/10.1594/PANGAEA.841672?format=metadata_datacite4
https://doi.pangaea.de/10.1594/PANGAEA.841672?format=metadata_iso19139
https://doi.pangaea.de/10.1594/PANGAEA.841672?format=metadata_jsonld
https://doi.pangaea.de/10.1594/PANGAEA.841672?format=citation_bibtex

== Access to Restricted Datasets ==
A small fraction of PANGAEA datasets are under an active moratorium — access to the data is temporarily restricted while the metadata remains publicly visible. Access to your personal moratorium datasets requires authentication using a '''bearer token''', which is the standard mechanism for API authorization (RFC 6750). Note that access to protected datasets of other authors is, of course, to possible.

To obtain your bearer token, log in to PANGAEA at https://pangaea.de/user/login.php. Login is supported both with a PANGAEA username and password, and via ORCID iD. After logging in, the user profile page at https://pangaea.de/user/ displays the current session token under "Your temporary login token." This token can be renewed when logging out and logging in again, and passed as an <code>Authorization: Bearer</code> header in any HTTP request:
<code>curl -OJLf \
-H 'Authorization: Bearer <your-token>' \
-H 'Accept: text/tab-separated-values' \
https://doi.org/10.1594/PANGAEA.841672</code>
If a request is made without a token to a restricted dataset, the server responds with <code>401 Bearer Realm</code>, signaling that authentication is required.

== OAI-PMH Metadata Harvesting ==
For bulk metadata harvesting, PANGAEA provides an OAI-PMH endpoint at:
<code>https://ws.pangaea.de/oai/</code>
OAI-PMH allows systematic, incremental harvesting of all PANGAEA metadata in supported formats. The following metadata standards are available via OAI-PMH: Dublin Core, DataCite v3 and v4, ISO 19139, and DIF (Directory Interchange Format). This endpoint is used by the portals and registries listed in the discovery section above, and is equally available to any user who wishes to build their own index or integrate PANGAEA metadata into an institutional system.

== Programmatic Access with Python: pangaeapy ==
'''pangaeapy''' is the official Python client library for PANGAEA, developed and maintained by the PANGAEA team. It provides a high-level interface for loading and analyzing PANGAEA datasets directly into native Python data structures, without requiring manual HTTP requests or format parsing.

* PyPI package: https://pypi.org/project/pangaeapy/
* Source code: https://github.com/pangaea-data-publisher/pangaeapy
* Introduction slides: https://github.com/pangaea-data-publisher/community-workshop-material/blob/master/Introduction_to_pangaeapy.pdf

=== Installation ===
<code>pip install pangaeapy</code>

=== Loading a Dataset ===
The central object in pangaeapy is <code>PanDataSet</code>, which takes a DOI or PANGAEA dataset ID and retrieves both the data and the associated metadata:
<code>from pangaeapy import PanDataSet

ds = PanDataSet('10.1594/PANGAEA.841672')

# Access the data as a pandas DataFrame
print(ds.data.head())

# Access dataset metadata
print(ds.title)
print(ds.authors)
print(ds.parameters) # list of measured parameters with units and methods</code>
The <code>data</code> attribute is a pandas DataFrame in which each column corresponds to a measured parameter, and each row to a measurement. Parameter metadata — including units, standard names, and method descriptions — is accessible through the <code>parameters</code> attribute.

=== Searching PANGAEA from Python ===
pangaeapy also provides a <code>PanQuery</code> class for querying the PANGAEA search engine programmatically. Note that the underlying search API used by pangaeapy is currently internal and undocumented; an official, publicly documented Search REST API is under development. Until that is available, pangaeapy offers the most straightforward path to programmatic search for Python users:
<code>from pangaeapy import PanQuery

q = PanQuery('temperature salinity Atlantic', limit=10)
for result in q.results:
print(result['doi'], result['title'])</code>

=== Access to Restricted Datasets ===
A bearer token obtained from the PANGAEA user profile can be passed to <code>PanDataSet</code> to enable access to moratorium datasets (Please note: this does not enable you to access restricted data of other authors but yours alone!):
<code>ds = PanDataSet('10.1594/PANGAEA.841672', token='<your-bearer-token>')</code>

=== Further Training Materials ===
Jupyter notebooks with worked examples for data discovery, loading, and analysis with pangaeapy are available in the PANGAEA Community Workshop GitHub repository at https://github.com/pangaea-data-publisher/community-workshop-material (see the <code>Python/</code> directory).

== Programmatic Access with R: pangaear ==
'''pangaear''' is a community-developed R client for PANGAEA, maintained as part of the rOpenSci ecosystem. It provides equivalent functionality to pangaeapy for R users.

* CRAN: https://cran.r-project.org/package=pangaear
* GitHub: https://github.com/ropensci/pangaear
* Introduction slides: https://github.com/pangaea-data-publisher/community-workshop-material/blob/master/Intro_PangaeaR.pdf

=== Installation ===
<code>install.packages("pangaear")
# or the development version:
# devtools::install_github("ropensci/pangaear")</code>

=== Loading a Dataset ===
<code>library(pangaear)

# Download and load a dataset by DOI
ds <- pg_data(doi = '10.1594/PANGAEA.841672')

# Access the data as a data frame
head(ds[[1]]$data)

# Metadata is accessible from the list object
ds[[1]]$metadata</code>

=== Searching PANGAEA from R ===
<code>res <- pg_search(query = 'temperature salinity Atlantic', count = 10)
print(res$doi)</code>
R scripts and worked examples are available in the <code>R/</code> directory of the PANGAEA Community Workshop GitHub repository.

== The PANGAEA Data Warehouse ==
The PANGAEA data warehouse (based on Clickhouse) provides a powerful complementary access path for users who need to compile and aggregate data across large numbers of datasets — potentially across hundreds of thousands of individual publications — without having to download and merge files manually.

The data warehouse supports spatially and chronologically constrained queries at the parameter level, returning data together with the DOI of each contributing dataset to maintain full provenance traceability. It is accessible in two ways:

'''Through the PANGAEA website:''' The data warehouse interface is integrated into the search results page. After performing a search, users can select "Data Warehouse" to configure and download a parameter-level aggregation from all datasets in the result set.

Documentation: https://wiki.pangaea.de/wiki/Data_warehouse

'''Programmatically via REST API:''' The data warehouse is also accessible through the PANGAEA web services at https://ws.pangaea.de/. This allows automated, scripted aggregations and is supported by both pangaeapy and pangaear.

Note that data warehouse exports represent compiled data products and do not replace the need to consult individual dataset landing pages to assess the fitness of individual studies for a specific scientific application. Every export includes the DOI name for each data point, ensuring that citation and provenance are preserved.

== Summary of Access Methods ==
{| class="wikitable"
!Use case
!Method
!Entry point
|-
|Interactive discovery
|PANGAEA search
|https://www.pangaea.de/
|-
|Single dataset access
|Browser / landing page
|https://doi.pangaea.de/10.1594/PANGAEA.xxxxx
|-
|Programmatic data download
|HTTP content negotiation
|<code>curl -H 'Accept: text/tab-separated-values' https://doi.org/...</code>
|-
|Programmatic metadata download
|HTTP content negotiation
|<code>curl -H 'Accept: application/ld+json' https://doi.org/...</code>
|-
|Link and format discovery
|HTTP HEAD / Signposting
|<code>curl -I https://doi.pangaea.de/...</code>
|-
|Restricted dataset access
|Bearer token authentication
|Via https://pangaea.de/user/
|-
|Bulk metadata harvesting
|OAI-PMH
|https://ws.pangaea.de/oai/
|-
|Scripted data access (Python)
|pangaeapy
|https://pypi.org/project/pangaeapy/
|-
|Scripted data access (R)
|pangaear
|https://cran.r-project.org/package=pangaear
|-
|Cross-dataset aggregation
|Data warehouse
|https://wiki.pangaea.de/wiki/Data_warehouse
|-
|Discovery across repositories
|DataCite Search API
|https://support.datacite.org/docs/api
|}
----

== Further Resources ==

* [[PANGAEA search]] — documentation on search syntax and faceted filters
* [[Data warehouse]] — data warehouse documentation
* [[PANGAEA Community Workshops]] — hands-on training workshops on finding and using PANGAEA data
* PANGAEA Community Workshop GitHub — Jupyter notebooks, R scripts, and slide decks
* pangaeapy on PyPI — Python client
* pangaear on CRAN — R client
* PANGAEA web services — REST and OAI-PMH endpoints
* Signposting standard — link discovery for FAIR data

Data Access and Reuse

2026-04-22T09:05:13Z

Lmoeller:

This article describes the different methods available for discovering, accessing, and reusing data published in PANGAEA. It is intended for researchers and data practitioners who want to interact with PANGAEA data beyond the standard web interface, including scripted and automated workflows. The methods range from interactive web-based search to fully programmatic access via HTTP content negotiation and dedicated client libraries.

Hands-on training materials covering many of the topics below are available in the [[PANGAEA Community Workshops|PANGAEA Community Workshop Series]], including Jupyter notebooks, R scripts, and slide decks in the [PANGAEA community workshop GitHub repository](https://github.com/pangaea-data-publisher/community-workshop-material).

== Overview ==
Every dataset published in PANGAEA is assigned a globally unique and persistent Digital Object Identifier (DOI). The DOI is the single entry point for all forms of programmatic access: it resolves to a dataset landing page that, in addition to its human-readable representation, exposes all available metadata formats and data download options through standard HTTP mechanisms. There is no separate PANGAEA data API — all data and metadata access is built on top of the DOI and its landing page using standard web protocols. This design makes PANGAEA data directly accessible without any vendor-specific API key or proprietary client, while remaining fully compatible with FAIR principles.

== Web-Based Search and Discovery ==

=== PANGAEA Search Interface ===
The primary discovery tool for PANGAEA data is the PANGAEA search engine at <nowiki>https://www.pangaea.de/</nowiki>, based on Elasticsearch. It supports full-text search and faceted filtering across all published metadata. Faceted navigation allows users to constrain results by topic, device type, geographic region, and temporal coverage. Documentation on search functionality and syntax is available in the Wiki at <nowiki>[[PANGAEA search]]</nowiki>.

=== External Portals and Registries ===
PANGAEA metadata is harvested by a large number of disciplinary and generic portals via OAI-PMH, making datasets discoverable beyond the PANGAEA website through services such as Google Dataset Search, OpenAIRE, DataCite Commons, DataONE, GBIF, EMODnet, GFBio, and others. PANGAEA is registered in re3data.org, FAIRsharing.org, RIsources, and the EOSC Marketplace.

An alternative for search when a PANGAEA-specific search interface is not required is the '''DataCite Search API''' (<nowiki>https://support.datacite.org/docs/api</nowiki>), which searches across a large inventory of research data from many repositories. It returns summary metadata including DOI names, and subsequent data access from any PANGAEA result can then proceed via content negotiation as described below.

== DOI Landing Pages and Link Discovery ==

=== The Landing Page as Access Hub ===
Each PANGAEA dataset is represented by a landing page accessible at its DOI. For example:
<code><nowiki>https://doi.pangaea.de/10.1594/PANGAEA.841672</nowiki></code>
The landing page serves a dual function: it presents dataset metadata and a data preview in human-readable form, and it simultaneously exposes all available machine-readable representations of the same resource through standard HTTP link headers and HTML <code><link></code> elements. This implementation follows the '''Signposting standard''' (<nowiki>https://signposting.org/</nowiki>), which allows any HTTP client to discover all alternate representations of a dataset — metadata in various formats, the data file itself, and the ORCID iDs of the authors — without any prior knowledge of PANGAEA-specific URL patterns.

=== Discovering Available Representations ===
A simple HTTP HEAD request to the landing page returns the full set of typed link relations in the response header. The following example illustrates this using <code>curl</code>:
<code>curl -LI <nowiki>https://doi.org/10.1594/PANGAEA.841672</nowiki></code>
The response <code>Link:</code> header will contain relations of the following types:

* '''<code>cite-as</code>''' — the canonical DOI citation URL
* '''<code>describedby</code>''' — links to metadata in various formats (ISO 19139, DataCite XML, PANGAEA XML, BibTeX, RIS, JSON-LD)
* '''<code>item</code>''' — links to the data itself (tab-delimited text, HTML view)
* '''<code>author</code>''' — ORCID iDs of the dataset authors

This mechanism allows scripts, harvesters, and other machine clients to discover and retrieve any representation of a dataset from its DOI alone.

== HTTP Content Negotiation ==
HTTP content negotiation allows a client to request a specific representation of a resource by specifying a MIME type in the <code>Accept</code> header. All PANGAEA dataset landing pages support this mechanism, so both data and metadata can be downloaded programmatically by querying the DOI directly with the appropriate content type — no PANGAEA-specific URL construction is required.

=== Downloading the Data File ===
The tabular data file for a dataset can be downloaded as tab-delimited text:
<code>curl -OJLf '<nowiki>https://doi.pangaea.de/10.1594/PANGAEA.841672?format=textfile'</nowiki></code>
Equivalently, using content negotiation directly against the DOI:
<code>curl -OJLf -H 'Accept: text/tab-separated-values' \
<nowiki>https://doi.org/10.1594/PANGAEA.841672</nowiki></code>
The <code>-OJL</code> flags instruct curl to save the file using the server-provided filename, follow redirects, and fail silently on errors.

=== Retrieving Metadata in a Specific Format ===
The same mechanism applies to metadata. To retrieve a dataset's metadata in ISO 19139/19115 format:
<code>curl -L -H 'Accept: application/vnd.iso19139.metadata+xml' \
<nowiki>https://doi.org/10.1594/PANGAEA.841672</nowiki></code>
The following MIME types are currently supported for metadata retrieval via content negotiation:
{| class="wikitable"
!Format
!MIME type
|-
|PANGAEA internal XML
|<code>application/vnd.pangaea.metadata+xml</code>
|-
|DataCite XML (v4)
|<code>application/vnd.datacite.datacite+xml</code>
|-
|ISO 19139 / 19115
|<code>application/vnd.iso19139.metadata+xml</code>
|-
|NASA DIF
|<code>application/vnd.nasa.dif-metadata+xml</code>
|-
|JSON-LD (Schema.org)
|<code>application/ld+json</code>
|-
|BibTeX
|<code>application/x-bibtex</code>
|-
|RIS
|<code>application/x-research-info-systems</code>
|-
|Plain text citation
|<code>text/x-bibliography</code>
|-
|Tab-separated data
|<code>text/tab-separated-values</code>
|}
Alternatively, the same formats can be requested using explicit URL parameters:
<code><nowiki>https://doi.pangaea.de/10.1594/PANGAEA.841672?format=metadata_datacite4</nowiki>
<nowiki>https://doi.pangaea.de/10.1594/PANGAEA.841672?format=metadata_iso19139</nowiki>
<nowiki>https://doi.pangaea.de/10.1594/PANGAEA.841672?format=metadata_jsonld</nowiki>
<nowiki>https://doi.pangaea.de/10.1594/PANGAEA.841672?format=citation_bibtex</nowiki></code>

== Access to Restricted Datasets ==
A small fraction of PANGAEA datasets are under an active moratorium — access to the data is temporarily restricted while the metadata remains publicly visible. Access to your personal moratorium datasets requires authentication using a '''bearer token''', which is the standard mechanism for API authorization (RFC 6750). Note that access to protected datasets of other authors is, of course, to possible.

To obtain your bearer token, log in to PANGAEA at <nowiki>https://pangaea.de/user/login.php</nowiki>. Login is supported both with a PANGAEA username and password, and via ORCID iD. After logging in, the user profile page at <nowiki>https://pangaea.de/user/</nowiki> displays the current session token under "Your temporary login token." This token can be renewed when logging out and logging in again, and passed as an <code>Authorization: Bearer</code> header in any HTTP request:
<code>curl -OJLf \
-H 'Authorization: Bearer <your-token>' \
-H 'Accept: text/tab-separated-values' \
<nowiki>https://doi.org/10.1594/PANGAEA.841672</nowiki></code>
If a request is made without a token to a restricted dataset, the server responds with <code>401 Bearer Realm</code>, signaling that authentication is required.

== OAI-PMH Metadata Harvesting ==
For bulk metadata harvesting, PANGAEA provides an OAI-PMH endpoint at:
<code><nowiki>https://ws.pangaea.de/oai/</nowiki></code>
OAI-PMH allows systematic, incremental harvesting of all PANGAEA metadata in supported formats. The following metadata standards are available via OAI-PMH: Dublin Core, DataCite v3 and v4, ISO 19139, and DIF (Directory Interchange Format). This endpoint is used by the portals and registries listed in the discovery section above, and is equally available to any user who wishes to build their own index or integrate PANGAEA metadata into an institutional system.

== Programmatic Access with Python: pangaeapy ==
'''pangaeapy''' is the official Python client library for PANGAEA, developed and maintained by the PANGAEA team. It provides a high-level interface for loading and analyzing PANGAEA datasets directly into native Python data structures, without requiring manual HTTP requests or format parsing.

* PyPI package: <nowiki>https://pypi.org/project/pangaeapy/</nowiki>
* Source code: <nowiki>https://github.com/pangaea-data-publisher/pangaeapy</nowiki>
* Introduction slides: <nowiki>https://github.com/pangaea-data-publisher/community-workshop-material/blob/master/Introduction_to_pangaeapy.pdf</nowiki>

=== Installation ===
<code>pip install pangaeapy</code>

=== Loading a Dataset ===
The central object in pangaeapy is <code>PanDataSet</code>, which takes a DOI or PANGAEA dataset ID and retrieves both the data and the associated metadata:
<code>from pangaeapy import PanDataSet

ds = PanDataSet('10.1594/PANGAEA.841672')

# Access the data as a pandas DataFrame
print(ds.data.head())

# Access dataset metadata
print(ds.title)
print(ds.authors)
print(ds.parameters) # list of measured parameters with units and methods</code>
The <code>data</code> attribute is a pandas DataFrame in which each column corresponds to a measured parameter, and each row to a measurement. Parameter metadata — including units, standard names, and method descriptions — is accessible through the <code>parameters</code> attribute.

=== Searching PANGAEA from Python ===
pangaeapy also provides a <code>PanQuery</code> class for querying the PANGAEA search engine programmatically. Note that the underlying search API used by pangaeapy is currently internal and undocumented; an official, publicly documented Search REST API is under development. Until that is available, pangaeapy offers the most straightforward path to programmatic search for Python users:
<code>from pangaeapy import PanQuery

q = PanQuery('temperature salinity Atlantic', limit=10)
for result in q.results:
print(result['doi'], result['title'])</code>

=== Access to Restricted Datasets ===
A bearer token obtained from the PANGAEA user profile can be passed to <code>PanDataSet</code> to enable access to moratorium datasets (Please note: this does not enable you to access restricted data of other authors but yours alone!):
<code>ds = PanDataSet('10.1594/PANGAEA.841672', token='<your-bearer-token>')</code>

=== Further Training Materials ===
Jupyter notebooks with worked examples for data discovery, loading, and analysis with pangaeapy are available in the PANGAEA Community Workshop GitHub repository at <nowiki>https://github.com/pangaea-data-publisher/community-workshop-material</nowiki> (see the <code>Python/</code> directory).

== Programmatic Access with R: pangaear ==
'''pangaear''' is a community-developed R client for PANGAEA, maintained as part of the rOpenSci ecosystem. It provides equivalent functionality to pangaeapy for R users.

* CRAN: <nowiki>https://cran.r-project.org/package=pangaear</nowiki>
* GitHub: <nowiki>https://github.com/ropensci/pangaear</nowiki>
* Introduction slides: <nowiki>https://github.com/pangaea-data-publisher/community-workshop-material/blob/master/Intro_PangaeaR.pdf</nowiki>

=== Installation ===
<code>install.packages("pangaear")
# or the development version:
# devtools::install_github("ropensci/pangaear")</code>

=== Loading a Dataset ===
<code>library(pangaear)

# Download and load a dataset by DOI
ds <- pg_data(doi = '10.1594/PANGAEA.841672')

# Access the data as a data frame
head(ds<nowiki>[[1]]</nowiki>$data)

# Metadata is accessible from the list object
ds<nowiki>[[1]]</nowiki>$metadata</code>

=== Searching PANGAEA from R ===
<code>res <- pg_search(query = 'temperature salinity Atlantic', count = 10)
print(res$doi)</code>
R scripts and worked examples are available in the <code>R/</code> directory of the PANGAEA Community Workshop GitHub repository.

== The PANGAEA Data Warehouse ==
The PANGAEA data warehouse (based on Clickhouse) provides a powerful complementary access path for users who need to compile and aggregate data across large numbers of datasets — potentially across hundreds of thousands of individual publications — without having to download and merge files manually.

The data warehouse supports spatially and chronologically constrained queries at the parameter level, returning data together with the DOI of each contributing dataset to maintain full provenance traceability. It is accessible in two ways:

'''Through the PANGAEA website:''' The data warehouse interface is integrated into the search results page. After performing a search, users can select "Data Warehouse" to configure and download a parameter-level aggregation from all datasets in the result set.

Documentation: <nowiki>https://wiki.pangaea.de/wiki/Data_warehouse</nowiki>

'''Programmatically via REST API:''' The data warehouse is also accessible through the PANGAEA web services at <nowiki>https://ws.pangaea.de/</nowiki>. This allows automated, scripted aggregations and is supported by both pangaeapy and pangaear.

Note that data warehouse exports represent compiled data products and do not replace the need to consult individual dataset landing pages to assess the fitness of individual studies for a specific scientific application. Every export includes the DOI name for each data point, ensuring that citation and provenance are preserved.

== Summary of Access Methods ==
{| class="wikitable"
!Use case
!Method
!Entry point
|-
|Interactive discovery
|PANGAEA search
|<nowiki>https://www.pangaea.de/</nowiki>
|-
|Single dataset access
|Browser / landing page
|<nowiki>https://doi.pangaea.de/10.1594/PANGAEA.xxxxx</nowiki>
|-
|Programmatic data download
|HTTP content negotiation
|<code>curl -H 'Accept: text/tab-separated-values' <nowiki>https://doi.org/</nowiki>...</code>
|-
|Programmatic metadata download
|HTTP content negotiation
|<code>curl -H 'Accept: application/ld+json' <nowiki>https://doi.org/</nowiki>...</code>
|-
|Link and format discovery
|HTTP HEAD / Signposting
|<code>curl -I <nowiki>https://doi.pangaea.de/</nowiki>...</code>
|-
|Restricted dataset access
|Bearer token authentication
|Via <nowiki>https://pangaea.de/user/</nowiki>
|-
|Bulk metadata harvesting
|OAI-PMH
|<nowiki>https://ws.pangaea.de/oai/</nowiki>
|-
|Scripted data access (Python)
|pangaeapy
|<nowiki>https://pypi.org/project/pangaeapy/</nowiki>
|-
|Scripted data access (R)
|pangaear
|<nowiki>https://cran.r-project.org/package=pangaear</nowiki>
|-
|Cross-dataset aggregation
|Data warehouse
|<nowiki>https://wiki.pangaea.de/wiki/Data_warehouse</nowiki>
|-
|Discovery across repositories
|DataCite Search API
|<nowiki>https://support.datacite.org/docs/api</nowiki>
|}
----

== Further Resources ==

* <nowiki>[[PANGAEA search]]</nowiki> — documentation on search syntax and faceted filters
* <nowiki>[[Data warehouse]]</nowiki> — data warehouse documentation
* <nowiki>[[PANGAEA Community Workshops]]</nowiki> — hands-on training workshops on finding and using PANGAEA data
* PANGAEA Community Workshop GitHub — Jupyter notebooks, R scripts, and slide decks
* pangaeapy on PyPI — Python client
* pangaear on CRAN — R client
* PANGAEA web services — REST and OAI-PMH endpoints
* Signposting standard — link discovery for FAIR data

Data Access and Reuse

2026-04-22T08:54:10Z

Lmoeller: Created page with "This article describes the different methods available for discovering, accessing, and reusing data published in PANGAEA. It is intended for researchers and data practitioners who want to interact with PANGAEA data beyond the standard web interface, including scripted and automated workflows. The methods range from interactive web-based search to fully programmatic access via HTTP content negotiation and dedicated client libraries. Hands-on training materials covering m..."

Technology

2026-04-13T09:49:47Z

Lmoeller: /* Middleware: Processing and Transformation */ Made links work

PANGAEA is built on a three-tiered client/server architecture that has evolved continuously since the system's founding in the early 1990s. The architecture separates backend storage and database systems, middleware processing and transformation components, and frontend delivery and user interfaces into distinct layers that are individually maintainable and replaceable. This design philosophy has enabled PANGAEA to migrate core components — including its primary database engine, its editorial system, and its search infrastructure — without disruption to archived data or to users, and underpins the system's long-term stability as a trustworthy data repository.

A detailed description of the PANGAEA information system and its workflows is provided in Felden et al. (2023) and Diepenbroek et al. (2017); this article provides an up-to-date overview oriented toward the general structure and key components.

== System Architecture ==

The technical architecture of PANGAEA follows a three-tiered model comprising a backend, a middleware layer, and a frontend. All hard- and software services are hosted and operated by the data and computing center of the Alfred Wegener Institute, Helmholtz Centre for Polar and Marine Research (AWI) in Bremerhaven. Most backend and middleware systems, as well as all frontend web servers and search engines, run on virtual machines (VMware) operated with Ubuntu Linux. Virtualization provides sufficient capacity and performance while enabling high availability and transparent hardware renewal on a typical cycle of three to four years.
[[File:PANGAEA overview architecture.png|alt=Schema illustrating the system architecture according to Felden et al., 2023|thumb|Schema illustrating the system architecture according to Felden et al., 2023|none|600x600px]]

PANGAEA currently operates nine virtual machines at the AWI computing center, using a total of 53 CPUs, 162 GB RAM, and 28 TB of disk space. The relational database holdings currently occupy approximately 5 TB, with approximately 1 PB of data stored on tape.

== Backend: Storage and Databases ==

=== Relational Database ===
The primary store for all structured data and metadata in PANGAEA is a '''PostgreSQL''' relational database management system (RDBMS). The data model is highly normalized, reflecting the full observational context of each measurement: events (when and where data were collected), campaigns (cruises or field expeditions), methods and devices, parameters, references, and institutional provenance information are all stored in defined relational structures. This normalized model allows data descriptions to be compiled dynamically and serialized into a wide range of output formats on demand, without modifying the underlying archived data.

Database integrity is continuously maintained through '''PostgreSQL streaming replication''' to a dedicated backup system, which enables point-in-time recovery to any moment prior to a failure event. In addition, full database backup copies (`pg_base_backup`) are created each weekend and retained for three weeks, providing an independent recovery layer supplementary to the streaming replica.

=== Data Warehouse ===
Fast access to large, aggregated compilations of data across the full PANGAEA holdings is provided by a '''Clickhouse''' data warehouse, which mirrors the numerical data inventory from the relational database. The data warehouse supports spatially and chronologically constrained queries at the parameter level across potentially hundreds of thousands of individual datasets, and is accessible both through the PANGAEA website and programmatically via a REST API. All data warehouse exports include the DOI for each contributing dataset, ensuring that provenance and citation remain intact in compiled data products.

=== Archival Storage ===
For high-volume and binary data — including geophysical datasets, images, video, and community-specific formats such as NetCDF — data are stored in consistent formats on hard-disk arrays and robotic tape archives. The central archival storage system consists of two '''SpectraLogic TFinity ExaScale''' robotic tape libraries housed in separate buildings at AWI, with a combined capacity of up to 60 PB. Backup operations use high-capacity LTO-tape drives within this environment.

All data are stored redundantly using '''erasure coding''' across disk and tape. Data on disk is replicated to tape nightly and saved to snapshots retained for six months. Tape copies are replicated to a physically separate building within two hours of creation; decommissioned tapes are retained for one year before reuse. Virtual machine working data is captured in nightly machine snapshots. Write caches are battery-backed to ensure data integrity at the point of write.

=== Off-Site Replica ===
Since 2025, PANGAEA operates a '''minimal viable repository service''' as an off-site replica at MARUM/University of Bremen, hosted in the Green IT Housing Center of Bremen University. This facility provides geographic and infrastructural separation from the primary AWI systems, enabling recovery of data delivery services within 24 hours following a technical breakdown or cyberattack. Disaster recovery and switchover procedures involving this facility are regularly exercised. The replica is kept current through snapshot-based replication several times per day and currently covers all dataset metadata — including individual landing pages and all harvesting endpoints — full representations of tabular data publications, the Elasticsearch search index, and the relevant web frontend. Extension of the replica to include binary data files is under active development, with formal commitment to be finalized through the ongoing MARUM hosting agreement negotiations.

The off-site installation operates in a fully isolated environment with Layer 2 network separation. Access is restricted to VPN connections from a single designated gateway host at AWI, with access tokens and keys limited to the corresponding gateway and PANGAEA DevOps staff.
== Middleware: Processing and Transformation ==
The middleware layer manages the flow of data and metadata between the backend storage systems and the frontend interfaces. Its core functions are marshaling, indexing, transformation, and dissemination.

Metadata are dynamically marshaled from the relational database to a PANGAEA-specific internal XML format, from which they are transformed via '''XSLT and XML-to-JSON pipelines''' into a range of content standards for delivery to users and harvesting services. Currently supported output standards include JSON-LD (Schema.org with Croissant extension), DataCite XML, Dublin Core XML, ISO 19115/ISO 19139, and DIF/FGDC. Dissemination occurs via OAI-PMH, HTTP content negotiation following the Signposting standard (https://signposting.org/), and other protocols. Metadata marshaling from the PostgreSQL database to the Elasticsearch search index is handled asynchronously by dedicated middleware components, which also manage automated DOI minting at DataCite upon dataset publication.

The marshaled metadata are stored and indexed in '''Elasticsearch''', which serves as the primary index for all public search and metadata access interfaces. This architecture separates the authoritative relational record from the search-optimized representation, allowing the search index to be rebuilt or updated without modifying archived data.

The flexible metadata framework '''PanFMP''' (https://www.panfmp.org/) underpins the mapping between the PANGAEA internal schema and the various external output standards, ensuring that new or updated standards can be accommodated without structural changes to the underlying data.

Data submissions, user requests, and bug reports are managed through a '''JIRA''' (Atlassian) issue tracking system, which serves as the primary communication channel between data providers and the PANGAEA editorial team throughout the submission and publication process.
== Frontend: Editorial System ==
The PANGAEA editorial system is the primary tool through which data editors review, curate, harmonize, and publish submitted data and metadata. It is a '''web-based client/server application''' developed entirely in-house, operating directly on the PostgreSQL databases.

The backend of the editorial system is built on '''Java 17''' using the '''Dropwizard''' framework, exposing a REST API. The frontend is implemented in '''React''' with the '''Ant Design''' component library. Source code for both components, together with shared libraries and test suites, is maintained in an institutional '''GitLab''' instance hosted at AWI, structured as an '''Nx monorepo'''. The repository encompasses automated unit tests and '''Cypress''' end-to-end tests. All new versions are deployed exclusively through a '''GitLab CI/CD pipeline''', with releases gated on successful completion of the full automated test suite.

In production, the editorial system runs on Ubuntu virtual machines hosted on VMware infrastructure. '''Apache''' serves the React frontend and acts as a reverse proxy to the backend services. Four parallel instances are operated simultaneously, providing dedicated test and production environments for two editorial system versions at any given time. This setup supports controlled feature validation alongside uninterrupted service operation. Infrastructure components — including VMware, file systems, monitoring, GitLab, and the database platform — are managed by AWI, while the PANGAEA group is responsible for the Apache web server, Java services, and React frontend.

The relational database model underlying the editorial system enforces structural and semantic consistency between submitted metadata and the existing data inventory. The editorial system also integrates a '''Terminology Catalogue (TC)''', which manages controlled vocabularies and ontologies used to harmonize data and metadata during ingest. Supported terminologies include WoRMS, ITIS, ChEBI, EnvO, PATO, QUDT, and the NERC vocabulary server, among others. A detailed description of the Terminology Catalogue and its role in data archiving and publication is given in Diepenbroek et al. (2017).
== Frontend: Public Web Interface and Search ==

=== Web Delivery Stack ===
The public web infrastructure is fronted by an '''NGINX''' reverse proxy that manages incoming traffic and supports '''HTTP/3 (QUIC)''' and '''HTTP/2''' alongside HTTP/1.1 for broad client compatibility. Behind this entry point, a dual-backend architecture separates concerns by function: a '''PHP 8''' environment handles general information pages and the PANGAEA Wiki, while a '''Java 17''' application layer running on '''Eclipse Jetty 12''' manages core repository services — including dataset landing pages, DOI and handle resolution, and content negotiation for both human- and machine-readable data access. The Jetty-based layer also manages delivery of high-volume and binary files from the tape archive by staging requested files to a local hard-disk cache before serving them to users.

=== Search ===
The PANGAEA search engine supports full-text and faceted search across all published metadata. Faceted navigation is enabled by semantic metadata enrichment performed during the marshaling process: terminology-based annotations are added to metadata records, allowing consistent filtering by topic, device type, geographic region, and other dimensions. PANGAEA's approach to data discovery complies with the recent recommendations of the Research Data Alliance (RDA) Interest Group "Data Discovery Paradigms" (Wu, M. et al., 2026, <nowiki>https://doi.org/10.5334/dsj-2026-006</nowiki>). Search documentation is available in the Wiki at https://wiki.pangaea.de/wiki/PANGAEA_search.

=== Map Search ===
Geographic search and visualization are implemented using '''leaflet.js''', serving map tiles from four configurable sources: the AWI basemap, OpenStreetMap, Google (hybrid), and ESRI.

=== Dataset Landing Pages ===
Each published dataset is represented by a landing page resolved through its DOI. Landing pages present the full dataset metadata in human-readable form and are enriched with structured '''schema.org''' markup in JSON-LD extended with Croissant properties, ensuring machine-actionability and compatibility with generic web search engines and data registries. The schema.org metadata is also accessible via HTTP content negotiation following the Signposting standard. Minor editorial corrections to datasets that do not affect scientific content are documented in a "Change history" section of the landing page, recording the date and a summary of each change applied.

=== Programmatic Access ===
PANGAEA offers programmatic access to data and metadata through a range of web services (REST and SOAP). The OAI-PMH endpoint supports metadata harvesting in all supported standards. Client libraries for '''Python''' (pangaeapy, developed by PANGAEA) and '''R''' (pangaear, developed by the community) allow researchers to load and transform PANGAEA data directly into native data structures for analysis in environments such as Jupyter notebooks.

== Monitoring and Analytics ==
Service health across the PANGAEA infrastructure is monitored through the '''AWI Grafana/Telegraf''' stack, covering service availability, application logs, and resource saturation for all production systems. This is supplemented by external availability checks via '''UptimeRobot''', which provides independent verification of public-facing service endpoints from outside the AWI network. All external links in PANGAEA metadata records — references to related literature, other dataset versions, and external resources — are automatically checked for broken (HTTP 404) or permanently redirected (HTTP 301) responses on a weekly basis.

User engagement and download metrics are captured by an integrated '''Matomo''' analytics instance configured to produce usage statistics compliant with the '''COUNTER''' (Counting Online Usage of Networked Electronic Resources) standard, ensuring that data impact is measured according to international scholarly norms. Usage statistics are publicly visible on each dataset landing page; details are documented in the Wiki at https://wiki.pangaea.de/wiki/Data_Usage_Statistics.

== Security ==
Backend and middleware systems are protected behind a firewall; frontend systems operate in a demilitarized zone (DMZ) accessible from outside with restricted but still firewalled access. Frontend systems have no write access and only limited read access to the backend database and tape archives. Public access from the website and REST APIs is served from data replicas hosted in Elasticsearch or through read-only remote filesystem access. Data curators access production systems via virtual private networks (VPN), which enforce a basic check that client operating systems are up to date.

Physical access to the AWI computing center is managed through an electronic access control system for all relevant entrances; key distribution is documented, and guest policies are formally established. The AWI maintains uninterruptible power supplies (UPS) capable of sustaining all PANGAEA-relevant hardware for up to 60 minutes, backed by a diesel-powered emergency generator providing a further 23 hours of operation. Fire alerts are automatically forwarded to the fire department with a contractual response time under 15 minutes.

Security of the technical infrastructure is further maintained through the use of asymmetric key infrastructures, mandatory minimum-length passwords for all user classes, short-cycle security patching for all hardware and software components, professional monitoring tools for hardware, firewall, software, services, performance, and attacks, and regular security training for all technical and non-technical staff. Security risks are assessed on an ongoing basis by AWI's institutional IT security team, which coordinates incident response and monitors threat landscapes relevant to research data infrastructure. PANGAEA technical staff participates in regular security reviews.

Access controls for restricted or moratorium datasets are enforced at the application layer. Dataset metadata remains publicly accessible in all cases; access to the data itself is restricted to authorized users at the individual level for the duration of the moratorium, typically a maximum of two years from submission.

The off-site replica at MARUM operates in a fully isolated network environment (Layer 2 separation) with access restricted to a firewalled VPN gateway at AWI. The Green IT Housing Center is monitored 24/7 by Bremen University staff located in very close proximity to the datacenter (15 m, adjacent building), with automated fire alarms, redundant power supply with battery backup, and multilevel physical access control.

== Software Inventory ==
The following table summarizes the principal software components of the PANGAEA infrastructure and their development model.
{| class="wikitable"
! Component !! Software !! Development Model
|-
| Primary database || PostgreSQL || Open source (community-supported)
|-
| Data warehouse || Clickhouse || Open source (community-supported)
|-
| Search and metadata index || Elasticsearch || Open source (community-supported)
|-
| Editorial system backend || Java 17 / Dropwizard || In-house development
|-
| Editorial system frontend || React / Ant Design || In-house development (open source framework)
|-
| Source control || GitLab || Open source (community-supported, self-hosted)
|-
| CI/CD pipeline || GitLab CI/CD || Open source (community-supported)
|-
| Web server / reverse proxy (editorial) || Apache || Open source (community-supported)
|-
| Web traffic management || NGINX || Open source (community-supported)
|-
| General web and Wiki frontend || PHP 8 || Open source (community-supported)
|-
| Repository services backend || Java 17 / Eclipse Jetty 12 || Open source (community-supported)
|-
| Map visualization || leaflet.js || Open source (community-supported)
|-
| Metadata framework || PanFMP || In-house development (open source)
|-
| Issue tracking || JIRA (Atlassian) || Commercial
|-
| Monitoring stack || Grafana / Telegraf || Open source (community-supported)
|-
| External uptime monitoring || UptimeRobot || Commercial SaaS
|-
| Analytics || Matomo || Open source (community-supported)
|-
| Infrastructure virtualization || VMware || Commercial
|-
| Tape archive hardware || SpectraLogic TFinity ExaScale || Commercial hardware
|-
| Python data client || pangaeapy || In-house development (open source)
|-
| R data client || pangaear || Community development (open source)
|}

== Documentation and Change Management ==
General documentation of PANGAEA systems and services is maintained in the public '''PANGAEA Wiki''' (https://wiki.pangaea.de/). A separate internal '''Confluence''' Wiki, kept operationally isolated from the main PANGAEA infrastructure for disaster-safety purposes, contains detailed documentation of server configurations, installation and maintenance procedures, relationships between system components, VM snapshot procedures, backup routines, and service restart priorities for incident response.

All changes to published data and metadata are recorded in the editorial system's version history. Substantive revisions to published datasets result in a new dataset version with a new DOI; all prior versions remain accessible and cross-referenced. Minor editorial corrections that do not affect scientific content are applied without creating a new identifier and are transparently documented in the "Change history" section of the dataset landing page, including the date and a brief summary of each change applied.

== References ==

* Felden, J., Möller, L., Schindler, U., Huber, R., Schumacher, S., Koppe, R., Diepenbroek, M. & Glöckner, F.O. (2023). PANGAEA — Data Publisher for Earth & Environmental Science. ''Scientific Data'', 10, 347. https://doi.org/10.1038/s41597-023-02269-x
* Diepenbroek, M., Schindler, U., Huber, R., Pesant, S., Stocker, M., Felden, J., Buss, M. & Weinrebe, M. (2017). Terminology supported archiving and publication of environmental science data in PANGAEA. ''Journal of Biotechnology'', 261, 177–186. https://doi.org/10.1016/j.jbiotec.2017.07.016
* Wu, M. et al. (2026). RDA Interest Group on Data Discovery Paradigms. ''Data Science Journal''. https://doi.org/10.5334/dsj-2026-006

== See Also ==

*[[PANGAEA search]]
*[[Data submission]]
*[[Authors Guides]]
*[[Curation levels]]
*[[Processing levels]]
*[[Data Usage Statistics]]
*[[Format]]
*[[PANGAEA XML schema]]
*[[Best practice manuals and templates]]
*[[Continuity Plan]] — https://www.pangaea.de/about/continuity.php
*[[Preservation Plan]] — https://www.pangaea.de/about/preservation.php

Technology

2026-04-13T09:15:16Z

Lmoeller: /* Off-Site Replica */ cosmetic

PANGAEA is built on a three-tiered client/server architecture that has evolved continuously since the system's founding in the early 1990s. The architecture separates backend storage and database systems, middleware processing and transformation components, and frontend delivery and user interfaces into distinct layers that are individually maintainable and replaceable. This design philosophy has enabled PANGAEA to migrate core components — including its primary database engine, its editorial system, and its search infrastructure — without disruption to archived data or to users, and underpins the system's long-term stability as a trustworthy data repository.

A detailed description of the PANGAEA information system and its workflows is provided in Felden et al. (2023) and Diepenbroek et al. (2017); this article provides an up-to-date overview oriented toward the general structure and key components.

== System Architecture ==

The technical architecture of PANGAEA follows a three-tiered model comprising a backend, a middleware layer, and a frontend. All hard- and software services are hosted and operated by the data and computing center of the Alfred Wegener Institute, Helmholtz Centre for Polar and Marine Research (AWI) in Bremerhaven. Most backend and middleware systems, as well as all frontend web servers and search engines, run on virtual machines (VMware) operated with Ubuntu Linux. Virtualization provides sufficient capacity and performance while enabling high availability and transparent hardware renewal on a typical cycle of three to four years.
[[File:PANGAEA overview architecture.png|alt=Schema illustrating the system architecture according to Felden et al., 2023|thumb|Schema illustrating the system architecture according to Felden et al., 2023|none|600x600px]]

PANGAEA currently operates nine virtual machines at the AWI computing center, using a total of 53 CPUs, 162 GB RAM, and 28 TB of disk space. The relational database holdings currently occupy approximately 5 TB, with approximately 1 PB of data stored on tape.

== Backend: Storage and Databases ==

=== Relational Database ===
The primary store for all structured data and metadata in PANGAEA is a '''PostgreSQL''' relational database management system (RDBMS). The data model is highly normalized, reflecting the full observational context of each measurement: events (when and where data were collected), campaigns (cruises or field expeditions), methods and devices, parameters, references, and institutional provenance information are all stored in defined relational structures. This normalized model allows data descriptions to be compiled dynamically and serialized into a wide range of output formats on demand, without modifying the underlying archived data.

Database integrity is continuously maintained through '''PostgreSQL streaming replication''' to a dedicated backup system, which enables point-in-time recovery to any moment prior to a failure event. In addition, full database backup copies (`pg_base_backup`) are created each weekend and retained for three weeks, providing an independent recovery layer supplementary to the streaming replica.

=== Data Warehouse ===
Fast access to large, aggregated compilations of data across the full PANGAEA holdings is provided by a '''Clickhouse''' data warehouse, which mirrors the numerical data inventory from the relational database. The data warehouse supports spatially and chronologically constrained queries at the parameter level across potentially hundreds of thousands of individual datasets, and is accessible both through the PANGAEA website and programmatically via a REST API. All data warehouse exports include the DOI for each contributing dataset, ensuring that provenance and citation remain intact in compiled data products.

=== Archival Storage ===
For high-volume and binary data — including geophysical datasets, images, video, and community-specific formats such as NetCDF — data are stored in consistent formats on hard-disk arrays and robotic tape archives. The central archival storage system consists of two '''SpectraLogic TFinity ExaScale''' robotic tape libraries housed in separate buildings at AWI, with a combined capacity of up to 60 PB. Backup operations use high-capacity LTO-tape drives within this environment.

All data are stored redundantly using '''erasure coding''' across disk and tape. Data on disk is replicated to tape nightly and saved to snapshots retained for six months. Tape copies are replicated to a physically separate building within two hours of creation; decommissioned tapes are retained for one year before reuse. Virtual machine working data is captured in nightly machine snapshots. Write caches are battery-backed to ensure data integrity at the point of write.

=== Off-Site Replica ===
Since 2025, PANGAEA operates a '''minimal viable repository service''' as an off-site replica at MARUM/University of Bremen, hosted in the Green IT Housing Center of Bremen University. This facility provides geographic and infrastructural separation from the primary AWI systems, enabling recovery of data delivery services within 24 hours following a technical breakdown or cyberattack. Disaster recovery and switchover procedures involving this facility are regularly exercised. The replica is kept current through snapshot-based replication several times per day and currently covers all dataset metadata — including individual landing pages and all harvesting endpoints — full representations of tabular data publications, the Elasticsearch search index, and the relevant web frontend. Extension of the replica to include binary data files is under active development, with formal commitment to be finalized through the ongoing MARUM hosting agreement negotiations.

The off-site installation operates in a fully isolated environment with Layer 2 network separation. Access is restricted to VPN connections from a single designated gateway host at AWI, with access tokens and keys limited to the corresponding gateway and PANGAEA DevOps staff.
== Middleware: Processing and Transformation ==
The middleware layer manages the flow of data and metadata between the backend storage systems and the frontend interfaces. Its core functions are marshaling, indexing, transformation, and dissemination.

Metadata are dynamically marshaled from the relational database to a PANGAEA-specific internal XML format, from which they are transformed via '''XSLT and XML-to-JSON pipelines''' into a range of content standards for delivery to users and harvesting services. Currently supported output standards include JSON-LD (Schema.org with Croissant extension), DataCite XML, Dublin Core XML, ISO 19115/ISO 19139, and DIF/FGDC. Dissemination occurs via OAI-PMH, HTTP content negotiation following the Signposting standard (<nowiki>https://signposting.org/</nowiki>), and other protocols. Metadata marshaling from the PostgreSQL database to the Elasticsearch search index is handled asynchronously by dedicated middleware components, which also manage automated DOI minting at DataCite upon dataset publication.

The marshaled metadata are stored and indexed in '''Elasticsearch''', which serves as the primary index for all public search and metadata access interfaces. This architecture separates the authoritative relational record from the search-optimized representation, allowing the search index to be rebuilt or updated without modifying archived data.

The flexible metadata framework '''PanFMP''' (<nowiki>https://www.panfmp.org/</nowiki>) underpins the mapping between the PANGAEA internal schema and the various external output standards, ensuring that new or updated standards can be accommodated without structural changes to the underlying data.

Data submissions, user requests, and bug reports are managed through a '''JIRA''' (Atlassian) issue tracking system, which serves as the primary communication channel between data providers and the PANGAEA editorial team throughout the submission and publication process.
== Frontend: Editorial System ==
The PANGAEA editorial system is the primary tool through which data editors review, curate, harmonize, and publish submitted data and metadata. It is a '''web-based client/server application''' developed entirely in-house, operating directly on the PostgreSQL databases.

The backend of the editorial system is built on '''Java 17''' using the '''Dropwizard''' framework, exposing a REST API. The frontend is implemented in '''React''' with the '''Ant Design''' component library. Source code for both components, together with shared libraries and test suites, is maintained in an institutional '''GitLab''' instance hosted at AWI, structured as an '''Nx monorepo'''. The repository encompasses automated unit tests and '''Cypress''' end-to-end tests. All new versions are deployed exclusively through a '''GitLab CI/CD pipeline''', with releases gated on successful completion of the full automated test suite.

In production, the editorial system runs on Ubuntu virtual machines hosted on VMware infrastructure. '''Apache''' serves the React frontend and acts as a reverse proxy to the backend services. Four parallel instances are operated simultaneously, providing dedicated test and production environments for two editorial system versions at any given time. This setup supports controlled feature validation alongside uninterrupted service operation. Infrastructure components — including VMware, file systems, monitoring, GitLab, and the database platform — are managed by AWI, while the PANGAEA group is responsible for the Apache web server, Java services, and React frontend.

The relational database model underlying the editorial system enforces structural and semantic consistency between submitted metadata and the existing data inventory. The editorial system also integrates a '''Terminology Catalogue (TC)''', which manages controlled vocabularies and ontologies used to harmonize data and metadata during ingest. Supported terminologies include WoRMS, ITIS, ChEBI, EnvO, PATO, QUDT, and the NERC vocabulary server, among others. A detailed description of the Terminology Catalogue and its role in data archiving and publication is given in Diepenbroek et al. (2017).
== Frontend: Public Web Interface and Search ==

=== Web Delivery Stack ===
The public web infrastructure is fronted by an '''NGINX''' reverse proxy that manages incoming traffic and supports '''HTTP/3 (QUIC)''' and '''HTTP/2''' alongside HTTP/1.1 for broad client compatibility. Behind this entry point, a dual-backend architecture separates concerns by function: a '''PHP 8''' environment handles general information pages and the PANGAEA Wiki, while a '''Java 17''' application layer running on '''Eclipse Jetty 12''' manages core repository services — including dataset landing pages, DOI and handle resolution, and content negotiation for both human- and machine-readable data access. The Jetty-based layer also manages delivery of high-volume and binary files from the tape archive by staging requested files to a local hard-disk cache before serving them to users.

=== Search ===
The PANGAEA search engine supports full-text and faceted search across all published metadata. Faceted navigation is enabled by semantic metadata enrichment performed during the marshaling process: terminology-based annotations are added to metadata records, allowing consistent filtering by topic, device type, geographic region, and other dimensions. PANGAEA's approach to data discovery complies with the recent recommendations of the Research Data Alliance (RDA) Interest Group "Data Discovery Paradigms" (Wu, M. et al., 2026, <nowiki>https://doi.org/10.5334/dsj-2026-006</nowiki>). Search documentation is available in the Wiki at <nowiki>https://wiki.pangaea.de/wiki/PANGAEA_search</nowiki>.

=== Map Search ===
Geographic search and visualization are implemented using '''leaflet.js''', serving map tiles from four configurable sources: the AWI basemap, OpenStreetMap, Google (hybrid), and ESRI.

=== Dataset Landing Pages ===
Each published dataset is represented by a landing page resolved through its DOI. Landing pages present the full dataset metadata in human-readable form and are enriched with structured '''schema.org''' markup in JSON-LD extended with Croissant properties, ensuring machine-actionability and compatibility with generic web search engines and data registries. The schema.org metadata is also accessible via HTTP content negotiation following the Signposting standard. Minor editorial corrections to datasets that do not affect scientific content are documented in a "Change history" section of the landing page, recording the date and a summary of each change applied.

=== Programmatic Access ===
PANGAEA offers programmatic access to data and metadata through a range of web services (REST and SOAP). The OAI-PMH endpoint supports metadata harvesting in all supported standards. Client libraries for '''Python''' (pangaeapy, developed by PANGAEA) and '''R''' (pangaear, developed by the community) allow researchers to load and transform PANGAEA data directly into native data structures for analysis in environments such as Jupyter notebooks.

== Monitoring and Analytics ==
Service health across the PANGAEA infrastructure is monitored through the '''AWI Grafana/Telegraf''' stack, covering service availability, application logs, and resource saturation for all production systems. This is supplemented by external availability checks via '''UptimeRobot''', which provides independent verification of public-facing service endpoints from outside the AWI network. All external links in PANGAEA metadata records — references to related literature, other dataset versions, and external resources — are automatically checked for broken (HTTP 404) or permanently redirected (HTTP 301) responses on a weekly basis.

User engagement and download metrics are captured by an integrated **Matomo** analytics instance configured to produce usage statistics compliant with the **COUNTER** (Counting Online Usage of Networked Electronic Resources) standard, ensuring that data impact is measured according to international scholarly norms. Usage statistics are publicly visible on each dataset landing page; details are documented in the Wiki at <nowiki>https://wiki.pangaea.de/wiki/Data_Usage_Statistics</nowiki>.

== Security ==
Backend and middleware systems are protected behind a firewall; frontend systems operate in a demilitarized zone (DMZ) accessible from outside with restricted but still firewalled access. Frontend systems have no write access and only limited read access to the backend database and tape archives. Public access from the website and REST APIs is served from data replicas hosted in Elasticsearch or through read-only remote filesystem access. Data curators access production systems via virtual private networks (VPN), which enforce a basic check that client operating systems are up to date.

Physical access to the AWI computing center is managed through an electronic access control system for all relevant entrances; key distribution is documented, and guest policies are formally established. The AWI maintains uninterruptible power supplies (UPS) capable of sustaining all PANGAEA-relevant hardware for up to 60 minutes, backed by a diesel-powered emergency generator providing a further 23 hours of operation. Fire alerts are automatically forwarded to the fire department with a contractual response time under 15 minutes.

Security of the technical infrastructure is further maintained through the use of asymmetric key infrastructures, mandatory minimum-length passwords for all user classes, short-cycle security patching for all hardware and software components, professional monitoring tools for hardware, firewall, software, services, performance, and attacks, and regular security training for all technical and non-technical staff. Security risks are assessed on an ongoing basis by AWI's institutional IT security team, which coordinates incident response and monitors threat landscapes relevant to research data infrastructure. PANGAEA technical staff participates in regular security reviews.

Access controls for restricted or moratorium datasets are enforced at the application layer. Dataset metadata remains publicly accessible in all cases; access to the data itself is restricted to authorized users at the individual level for the duration of the moratorium, typically a maximum of two years from submission.

The off-site replica at MARUM operates in a fully isolated network environment (Layer 2 separation) with access restricted to a firewalled VPN gateway at AWI. The Green IT Housing Center is monitored 24/7 by Bremen University staff located in very close proximity to the datacenter (15 m, adjacent building), with automated fire alarms, redundant power supply with battery backup, and multilevel physical access control.

== Software Inventory ==
The following table summarizes the principal software components of the PANGAEA infrastructure and their development model.
{| class="wikitable"
! Component !! Software !! Development Model
|-
| Primary database || PostgreSQL || Open source (community-supported)
|-
| Data warehouse || Clickhouse || Open source (community-supported)
|-
| Search and metadata index || Elasticsearch || Open source (community-supported)
|-
| Editorial system backend || Java 17 / Dropwizard || In-house development
|-
| Editorial system frontend || React / Ant Design || In-house development (open source framework)
|-
| Source control || GitLab || Open source (community-supported, self-hosted)
|-
| CI/CD pipeline || GitLab CI/CD || Open source (community-supported)
|-
| Web server / reverse proxy (editorial) || Apache || Open source (community-supported)
|-
| Web traffic management || NGINX || Open source (community-supported)
|-
| General web and Wiki frontend || PHP 8 || Open source (community-supported)
|-
| Repository services backend || Java 17 / Eclipse Jetty 12 || Open source (community-supported)
|-
| Map visualization || leaflet.js || Open source (community-supported)
|-
| Metadata framework || PanFMP || In-house development (open source)
|-
| Issue tracking || JIRA (Atlassian) || Commercial
|-
| Monitoring stack || Grafana / Telegraf || Open source (community-supported)
|-
| External uptime monitoring || UptimeRobot || Commercial SaaS
|-
| Analytics || Matomo || Open source (community-supported)
|-
| Infrastructure virtualization || VMware || Commercial
|-
| Tape archive hardware || SpectraLogic TFinity ExaScale || Commercial hardware
|-
| Python data client || pangaeapy || In-house development (open source)
|-
| R data client || pangaear || Community development (open source)
|}

== Documentation and Change Management ==
General documentation of PANGAEA systems and services is maintained in the public '''PANGAEA Wiki''' (<nowiki>https://wiki.pangaea.de/</nowiki>). A separate internal '''Confluence''' Wiki, kept operationally isolated from the main PANGAEA infrastructure for disaster-safety purposes, contains detailed documentation of server configurations, installation and maintenance procedures, relationships between system components, VM snapshot procedures, backup routines, and service restart priorities for incident response.

All changes to published data and metadata are recorded in the editorial system's version history. Substantive revisions to published datasets result in a new dataset version with a new DOI; all prior versions remain accessible and cross-referenced. Minor editorial corrections that do not affect scientific content are applied without creating a new identifier and are transparently documented in the "Change history" section of the dataset landing page, including the date and a brief summary of each change applied.

== References ==

* Felden, J., Möller, L., Schindler, U., Huber, R., Schumacher, S., Koppe, R., Diepenbroek, M. & Glöckner, F.O. (2023). PANGAEA — Data Publisher for Earth & Environmental Science. ''Scientific Data'', 10, 347. <nowiki>https://doi.org/10.1038/s41597-023-02269-x</nowiki>
* Diepenbroek, M., Schindler, U., Huber, R., Pesant, S., Stocker, M., Felden, J., Buss, M. & Weinrebe, M. (2017). Terminology supported archiving and publication of environmental science data in PANGAEA. ''Journal of Biotechnology'', 261, 177–186. <nowiki>https://doi.org/10.1016/j.jbiotec.2017.07.016</nowiki>
* Wu, M. et al. (2026). RDA Interest Group on Data Discovery Paradigms. ''Data Science Journal''. <nowiki>https://doi.org/10.5334/dsj-2026-006</nowiki>

== See Also ==

*[[PANGAEA search]]
*[[Data submission]]
*[[Authors Guides]]
*[[Curation levels]]
*[[Processing levels]]
*[[Data Usage Statistics]]
*[[Format]]
*[[PANGAEA XML schema]]
*[[Best practice manuals and templates]]
*[[Continuity Plan]] — https://www.pangaea.de/about/continuity.php
*[[Preservation Plan]] — https://www.pangaea.de/about/preservation.php

Technology

2026-04-10T16:39:04Z

Lmoeller: Removed construction flag

PANGAEA is built on a three-tiered client/server architecture that has evolved continuously since the system's founding in the early 1990s. The architecture separates backend storage and database systems, middleware processing and transformation components, and frontend delivery and user interfaces into distinct layers that are individually maintainable and replaceable. This design philosophy has enabled PANGAEA to migrate core components — including its primary database engine, its editorial system, and its search infrastructure — without disruption to archived data or to users, and underpins the system's long-term stability as a trustworthy data repository.

A detailed description of the PANGAEA information system and its workflows is provided in Felden et al. (2023) and Diepenbroek et al. (2017); this article provides an up-to-date overview oriented toward the general structure and key components.

== System Architecture ==

The technical architecture of PANGAEA follows a three-tiered model comprising a backend, a middleware layer, and a frontend. All hard- and software services are hosted and operated by the data and computing center of the Alfred Wegener Institute, Helmholtz Centre for Polar and Marine Research (AWI) in Bremerhaven. Most backend and middleware systems, as well as all frontend web servers and search engines, run on virtual machines (VMware) operated with Ubuntu Linux. Virtualization provides sufficient capacity and performance while enabling high availability and transparent hardware renewal on a typical cycle of three to four years.
[[File:PANGAEA overview architecture.png|alt=Schema illustrating the system architecture according to Felden et al., 2023|thumb|Schema illustrating the system architecture according to Felden et al., 2023|none|600x600px]]

PANGAEA currently operates nine virtual machines at the AWI computing center, using a total of 53 CPUs, 162 GB RAM, and 28 TB of disk space. The relational database holdings currently occupy approximately 5 TB, with approximately 1 PB of data stored on tape.

== Backend: Storage and Databases ==

=== Relational Database ===
The primary store for all structured data and metadata in PANGAEA is a '''PostgreSQL''' relational database management system (RDBMS). The data model is highly normalized, reflecting the full observational context of each measurement: events (when and where data were collected), campaigns (cruises or field expeditions), methods and devices, parameters, references, and institutional provenance information are all stored in defined relational structures. This normalized model allows data descriptions to be compiled dynamically and serialized into a wide range of output formats on demand, without modifying the underlying archived data.

Database integrity is continuously maintained through '''PostgreSQL streaming replication''' to a dedicated backup system, which enables point-in-time recovery to any moment prior to a failure event. In addition, full database backup copies (`pg_base_backup`) are created each weekend and retained for three weeks, providing an independent recovery layer supplementary to the streaming replica.

=== Data Warehouse ===
Fast access to large, aggregated compilations of data across the full PANGAEA holdings is provided by a '''Clickhouse''' data warehouse, which mirrors the numerical data inventory from the relational database. The data warehouse supports spatially and chronologically constrained queries at the parameter level across potentially hundreds of thousands of individual datasets, and is accessible both through the PANGAEA website and programmatically via a REST API. All data warehouse exports include the DOI for each contributing dataset, ensuring that provenance and citation remain intact in compiled data products.

=== Archival Storage ===
For high-volume and binary data — including geophysical datasets, images, video, and community-specific formats such as NetCDF — data are stored in consistent formats on hard-disk arrays and robotic tape archives. The central archival storage system consists of two '''SpectraLogic TFinity ExaScale''' robotic tape libraries housed in separate buildings at AWI, with a combined capacity of up to 60 PB. Backup operations use high-capacity LTO-tape drives within this environment.

All data are stored redundantly using '''erasure coding''' across disk and tape. Data on disk is replicated to tape nightly and saved to snapshots retained for six months. Tape copies are replicated to a physically separate building within two hours of creation; decommissioned tapes are retained for one year before reuse. Virtual machine working data is captured in nightly machine snapshots. Write caches are battery-backed to ensure data integrity at the point of write.

=== Off-Site Replica ===
Since 2025, PANGAEA operates a **minimal viable repository service** as an off-site replica at MARUM/University of Bremen, hosted in the Green IT Housing Center of Bremen University. This facility provides geographic and infrastructural separation from the primary AWI systems, enabling recovery of data delivery services within 24 hours following a technical breakdown or cyberattack. Disaster recovery and switchover procedures involving this facility are regularly exercised. The replica is kept current through snapshot-based replication several times per day and currently covers all dataset metadata — including individual landing pages and all harvesting endpoints — full representations of tabular data publications, the Elasticsearch search index, and the relevant web frontend. Extension of the replica to include binary data files is under active development, with formal commitment to be finalized through the ongoing MARUM hosting agreement negotiations.

The off-site installation operates in a fully isolated environment with Layer 2 network separation. Access is restricted to VPN connections from a single designated gateway host at AWI, with access tokens and keys limited to the corresponding gateway and PANGAEA DevOps staff.
== Middleware: Processing and Transformation ==
The middleware layer manages the flow of data and metadata between the backend storage systems and the frontend interfaces. Its core functions are marshaling, indexing, transformation, and dissemination.

Metadata are dynamically marshaled from the relational database to a PANGAEA-specific internal XML format, from which they are transformed via '''XSLT and XML-to-JSON pipelines''' into a range of content standards for delivery to users and harvesting services. Currently supported output standards include JSON-LD (Schema.org with Croissant extension), DataCite XML, Dublin Core XML, ISO 19115/ISO 19139, and DIF/FGDC. Dissemination occurs via OAI-PMH, HTTP content negotiation following the Signposting standard (<nowiki>https://signposting.org/</nowiki>), and other protocols. Metadata marshaling from the PostgreSQL database to the Elasticsearch search index is handled asynchronously by dedicated middleware components, which also manage automated DOI minting at DataCite upon dataset publication.

The marshaled metadata are stored and indexed in '''Elasticsearch''', which serves as the primary index for all public search and metadata access interfaces. This architecture separates the authoritative relational record from the search-optimized representation, allowing the search index to be rebuilt or updated without modifying archived data.

The flexible metadata framework '''PanFMP''' (<nowiki>https://www.panfmp.org/</nowiki>) underpins the mapping between the PANGAEA internal schema and the various external output standards, ensuring that new or updated standards can be accommodated without structural changes to the underlying data.

Data submissions, user requests, and bug reports are managed through a '''JIRA''' (Atlassian) issue tracking system, which serves as the primary communication channel between data providers and the PANGAEA editorial team throughout the submission and publication process.
== Frontend: Editorial System ==
The PANGAEA editorial system is the primary tool through which data editors review, curate, harmonize, and publish submitted data and metadata. It is a '''web-based client/server application''' developed entirely in-house, operating directly on the PostgreSQL databases.

The backend of the editorial system is built on '''Java 17''' using the '''Dropwizard''' framework, exposing a REST API. The frontend is implemented in '''React''' with the '''Ant Design''' component library. Source code for both components, together with shared libraries and test suites, is maintained in an institutional '''GitLab''' instance hosted at AWI, structured as an '''Nx monorepo'''. The repository encompasses automated unit tests and '''Cypress''' end-to-end tests. All new versions are deployed exclusively through a '''GitLab CI/CD pipeline''', with releases gated on successful completion of the full automated test suite.

In production, the editorial system runs on Ubuntu virtual machines hosted on VMware infrastructure. '''Apache''' serves the React frontend and acts as a reverse proxy to the backend services. Four parallel instances are operated simultaneously, providing dedicated test and production environments for two editorial system versions at any given time. This setup supports controlled feature validation alongside uninterrupted service operation. Infrastructure components — including VMware, file systems, monitoring, GitLab, and the database platform — are managed by AWI, while the PANGAEA group is responsible for the Apache web server, Java services, and React frontend.

The relational database model underlying the editorial system enforces structural and semantic consistency between submitted metadata and the existing data inventory. The editorial system also integrates a '''Terminology Catalogue (TC)''', which manages controlled vocabularies and ontologies used to harmonize data and metadata during ingest. Supported terminologies include WoRMS, ITIS, ChEBI, EnvO, PATO, QUDT, and the NERC vocabulary server, among others. A detailed description of the Terminology Catalogue and its role in data archiving and publication is given in Diepenbroek et al. (2017).
== Frontend: Public Web Interface and Search ==

=== Web Delivery Stack ===
The public web infrastructure is fronted by an '''NGINX''' reverse proxy that manages incoming traffic and supports '''HTTP/3 (QUIC)''' and '''HTTP/2''' alongside HTTP/1.1 for broad client compatibility. Behind this entry point, a dual-backend architecture separates concerns by function: a '''PHP 8''' environment handles general information pages and the PANGAEA Wiki, while a '''Java 17''' application layer running on '''Eclipse Jetty 12''' manages core repository services — including dataset landing pages, DOI and handle resolution, and content negotiation for both human- and machine-readable data access. The Jetty-based layer also manages delivery of high-volume and binary files from the tape archive by staging requested files to a local hard-disk cache before serving them to users.

=== Search ===
The PANGAEA search engine supports full-text and faceted search across all published metadata. Faceted navigation is enabled by semantic metadata enrichment performed during the marshaling process: terminology-based annotations are added to metadata records, allowing consistent filtering by topic, device type, geographic region, and other dimensions. PANGAEA's approach to data discovery complies with the recent recommendations of the Research Data Alliance (RDA) Interest Group "Data Discovery Paradigms" (Wu, M. et al., 2026, <nowiki>https://doi.org/10.5334/dsj-2026-006</nowiki>). Search documentation is available in the Wiki at <nowiki>https://wiki.pangaea.de/wiki/PANGAEA_search</nowiki>.

=== Map Search ===
Geographic search and visualization are implemented using '''leaflet.js''', serving map tiles from four configurable sources: the AWI basemap, OpenStreetMap, Google (hybrid), and ESRI.

=== Dataset Landing Pages ===
Each published dataset is represented by a landing page resolved through its DOI. Landing pages present the full dataset metadata in human-readable form and are enriched with structured '''schema.org''' markup in JSON-LD extended with Croissant properties, ensuring machine-actionability and compatibility with generic web search engines and data registries. The schema.org metadata is also accessible via HTTP content negotiation following the Signposting standard. Minor editorial corrections to datasets that do not affect scientific content are documented in a "Change history" section of the landing page, recording the date and a summary of each change applied.

=== Programmatic Access ===
PANGAEA offers programmatic access to data and metadata through a range of web services (REST and SOAP). The OAI-PMH endpoint supports metadata harvesting in all supported standards. Client libraries for '''Python''' (pangaeapy, developed by PANGAEA) and '''R''' (pangaear, developed by the community) allow researchers to load and transform PANGAEA data directly into native data structures for analysis in environments such as Jupyter notebooks.

== Monitoring and Analytics ==
Service health across the PANGAEA infrastructure is monitored through the '''AWI Grafana/Telegraf''' stack, covering service availability, application logs, and resource saturation for all production systems. This is supplemented by external availability checks via '''UptimeRobot''', which provides independent verification of public-facing service endpoints from outside the AWI network. All external links in PANGAEA metadata records — references to related literature, other dataset versions, and external resources — are automatically checked for broken (HTTP 404) or permanently redirected (HTTP 301) responses on a weekly basis.

User engagement and download metrics are captured by an integrated **Matomo** analytics instance configured to produce usage statistics compliant with the **COUNTER** (Counting Online Usage of Networked Electronic Resources) standard, ensuring that data impact is measured according to international scholarly norms. Usage statistics are publicly visible on each dataset landing page; details are documented in the Wiki at <nowiki>https://wiki.pangaea.de/wiki/Data_Usage_Statistics</nowiki>.

== Security ==
Backend and middleware systems are protected behind a firewall; frontend systems operate in a demilitarized zone (DMZ) accessible from outside with restricted but still firewalled access. Frontend systems have no write access and only limited read access to the backend database and tape archives. Public access from the website and REST APIs is served from data replicas hosted in Elasticsearch or through read-only remote filesystem access. Data curators access production systems via virtual private networks (VPN), which enforce a basic check that client operating systems are up to date.

Physical access to the AWI computing center is managed through an electronic access control system for all relevant entrances; key distribution is documented, and guest policies are formally established. The AWI maintains uninterruptible power supplies (UPS) capable of sustaining all PANGAEA-relevant hardware for up to 60 minutes, backed by a diesel-powered emergency generator providing a further 23 hours of operation. Fire alerts are automatically forwarded to the fire department with a contractual response time under 15 minutes.

Security of the technical infrastructure is further maintained through the use of asymmetric key infrastructures, mandatory minimum-length passwords for all user classes, short-cycle security patching for all hardware and software components, professional monitoring tools for hardware, firewall, software, services, performance, and attacks, and regular security training for all technical and non-technical staff. Security risks are assessed on an ongoing basis by AWI's institutional IT security team, which coordinates incident response and monitors threat landscapes relevant to research data infrastructure. PANGAEA technical staff participates in regular security reviews.

Access controls for restricted or moratorium datasets are enforced at the application layer. Dataset metadata remains publicly accessible in all cases; access to the data itself is restricted to authorized users at the individual level for the duration of the moratorium, typically a maximum of two years from submission.

The off-site replica at MARUM operates in a fully isolated network environment (Layer 2 separation) with access restricted to a firewalled VPN gateway at AWI. The Green IT Housing Center is monitored 24/7 by Bremen University staff located in very close proximity to the datacenter (15 m, adjacent building), with automated fire alarms, redundant power supply with battery backup, and multilevel physical access control.

== Software Inventory ==
The following table summarizes the principal software components of the PANGAEA infrastructure and their development model.
{| class="wikitable"
! Component !! Software !! Development Model
|-
| Primary database || PostgreSQL || Open source (community-supported)
|-
| Data warehouse || Clickhouse || Open source (community-supported)
|-
| Search and metadata index || Elasticsearch || Open source (community-supported)
|-
| Editorial system backend || Java 17 / Dropwizard || In-house development
|-
| Editorial system frontend || React / Ant Design || In-house development (open source framework)
|-
| Source control || GitLab || Open source (community-supported, self-hosted)
|-
| CI/CD pipeline || GitLab CI/CD || Open source (community-supported)
|-
| Web server / reverse proxy (editorial) || Apache || Open source (community-supported)
|-
| Web traffic management || NGINX || Open source (community-supported)
|-
| General web and Wiki frontend || PHP 8 || Open source (community-supported)
|-
| Repository services backend || Java 17 / Eclipse Jetty 12 || Open source (community-supported)
|-
| Map visualization || leaflet.js || Open source (community-supported)
|-
| Metadata framework || PanFMP || In-house development (open source)
|-
| Issue tracking || JIRA (Atlassian) || Commercial
|-
| Monitoring stack || Grafana / Telegraf || Open source (community-supported)
|-
| External uptime monitoring || UptimeRobot || Commercial SaaS
|-
| Analytics || Matomo || Open source (community-supported)
|-
| Infrastructure virtualization || VMware || Commercial
|-
| Tape archive hardware || SpectraLogic TFinity ExaScale || Commercial hardware
|-
| Python data client || pangaeapy || In-house development (open source)
|-
| R data client || pangaear || Community development (open source)
|}

== Documentation and Change Management ==
General documentation of PANGAEA systems and services is maintained in the public '''PANGAEA Wiki''' (<nowiki>https://wiki.pangaea.de/</nowiki>). A separate internal '''Confluence''' Wiki, kept operationally isolated from the main PANGAEA infrastructure for disaster-safety purposes, contains detailed documentation of server configurations, installation and maintenance procedures, relationships between system components, VM snapshot procedures, backup routines, and service restart priorities for incident response.

All changes to published data and metadata are recorded in the editorial system's version history. Substantive revisions to published datasets result in a new dataset version with a new DOI; all prior versions remain accessible and cross-referenced. Minor editorial corrections that do not affect scientific content are applied without creating a new identifier and are transparently documented in the "Change history" section of the dataset landing page, including the date and a brief summary of each change applied.

== References ==

* Felden, J., Möller, L., Schindler, U., Huber, R., Schumacher, S., Koppe, R., Diepenbroek, M. & Glöckner, F.O. (2023). PANGAEA — Data Publisher for Earth & Environmental Science. ''Scientific Data'', 10, 347. <nowiki>https://doi.org/10.1038/s41597-023-02269-x</nowiki>
* Diepenbroek, M., Schindler, U., Huber, R., Pesant, S., Stocker, M., Felden, J., Buss, M. & Weinrebe, M. (2017). Terminology supported archiving and publication of environmental science data in PANGAEA. ''Journal of Biotechnology'', 261, 177–186. <nowiki>https://doi.org/10.1016/j.jbiotec.2017.07.016</nowiki>
* Wu, M. et al. (2026). RDA Interest Group on Data Discovery Paradigms. ''Data Science Journal''. <nowiki>https://doi.org/10.5334/dsj-2026-006</nowiki>

== See Also ==

*[[PANGAEA search]]
*[[Data submission]]
*[[Authors Guides]]
*[[Curation levels]]
*[[Processing levels]]
*[[Data Usage Statistics]]
*[[Format]]
*[[PANGAEA XML schema]]
*[[Best practice manuals and templates]]
*[[Continuity Plan]] — https://www.pangaea.de/about/continuity.php
*[[Preservation Plan]] — https://www.pangaea.de/about/preservation.php

Technology

2026-04-10T16:27:48Z

Lmoeller: /* Dataset Landing Pages */ added Croissant extension

{{construction}}

PANGAEA is built on a three-tiered client/server architecture that has evolved continuously since the system's founding in the early 1990s. The architecture separates backend storage and database systems, middleware processing and transformation components, and frontend delivery and user interfaces into distinct layers that are individually maintainable and replaceable. This design philosophy has enabled PANGAEA to migrate core components — including its primary database engine, its editorial system, and its search infrastructure — without disruption to archived data or to users, and underpins the system's long-term stability as a trustworthy data repository.

A detailed description of the PANGAEA information system and its workflows is provided in Felden et al. (2023) and Diepenbroek et al. (2017); this article provides an up-to-date overview oriented toward the general structure and key components.

== System Architecture ==

The technical architecture of PANGAEA follows a three-tiered model comprising a backend, a middleware layer, and a frontend. All hard- and software services are hosted and operated by the data and computing center of the Alfred Wegener Institute, Helmholtz Centre for Polar and Marine Research (AWI) in Bremerhaven. Most backend and middleware systems, as well as all frontend web servers and search engines, run on virtual machines (VMware) operated with Ubuntu Linux. Virtualization provides sufficient capacity and performance while enabling high availability and transparent hardware renewal on a typical cycle of three to four years.
[[File:PANGAEA overview architecture.png|alt=Schema illustrating the system architecture according to Felden et al., 2023|thumb|Schema illustrating the system architecture according to Felden et al., 2023|none|600x600px]]

PANGAEA currently operates nine virtual machines at the AWI computing center, using a total of 53 CPUs, 162 GB RAM, and 28 TB of disk space. The relational database holdings currently occupy approximately 5 TB, with approximately 1 PB of data stored on tape.

== Backend: Storage and Databases ==

=== Relational Database ===
The primary store for all structured data and metadata in PANGAEA is a '''PostgreSQL''' relational database management system (RDBMS). The data model is highly normalized, reflecting the full observational context of each measurement: events (when and where data were collected), campaigns (cruises or field expeditions), methods and devices, parameters, references, and institutional provenance information are all stored in defined relational structures. This normalized model allows data descriptions to be compiled dynamically and serialized into a wide range of output formats on demand, without modifying the underlying archived data.

Database integrity is continuously maintained through '''PostgreSQL streaming replication''' to a dedicated backup system, which enables point-in-time recovery to any moment prior to a failure event. In addition, full database backup copies (`pg_base_backup`) are created each weekend and retained for three weeks, providing an independent recovery layer supplementary to the streaming replica.

=== Data Warehouse ===
Fast access to large, aggregated compilations of data across the full PANGAEA holdings is provided by a '''Clickhouse''' data warehouse, which mirrors the numerical data inventory from the relational database. The data warehouse supports spatially and chronologically constrained queries at the parameter level across potentially hundreds of thousands of individual datasets, and is accessible both through the PANGAEA website and programmatically via a REST API. All data warehouse exports include the DOI for each contributing dataset, ensuring that provenance and citation remain intact in compiled data products.

=== Archival Storage ===
For high-volume and binary data — including geophysical datasets, images, video, and community-specific formats such as NetCDF — data are stored in consistent formats on hard-disk arrays and robotic tape archives. The central archival storage system consists of two '''SpectraLogic TFinity ExaScale''' robotic tape libraries housed in separate buildings at AWI, with a combined capacity of up to 60 PB. Backup operations use high-capacity LTO-tape drives within this environment.

All data are stored redundantly using '''erasure coding''' across disk and tape. Data on disk is replicated to tape nightly and saved to snapshots retained for six months. Tape copies are replicated to a physically separate building within two hours of creation; decommissioned tapes are retained for one year before reuse. Virtual machine working data is captured in nightly machine snapshots. Write caches are battery-backed to ensure data integrity at the point of write.

=== Off-Site Replica ===
Since 2025, PANGAEA operates a **minimal viable repository service** as an off-site replica at MARUM/University of Bremen, hosted in the Green IT Housing Center of Bremen University. This facility provides geographic and infrastructural separation from the primary AWI systems, enabling recovery of data delivery services within 24 hours following a technical breakdown or cyberattack. Disaster recovery and switchover procedures involving this facility are regularly exercised. The replica is kept current through snapshot-based replication several times per day and currently covers all dataset metadata — including individual landing pages and all harvesting endpoints — full representations of tabular data publications, the Elasticsearch search index, and the relevant web frontend. Extension of the replica to include binary data files is under active development, with formal commitment to be finalized through the ongoing MARUM hosting agreement negotiations.

The off-site installation operates in a fully isolated environment with Layer 2 network separation. Access is restricted to VPN connections from a single designated gateway host at AWI, with access tokens and keys limited to the corresponding gateway and PANGAEA DevOps staff.
== Middleware: Processing and Transformation ==
The middleware layer manages the flow of data and metadata between the backend storage systems and the frontend interfaces. Its core functions are marshaling, indexing, transformation, and dissemination.

Metadata are dynamically marshaled from the relational database to a PANGAEA-specific internal XML format, from which they are transformed via '''XSLT and XML-to-JSON pipelines''' into a range of content standards for delivery to users and harvesting services. Currently supported output standards include JSON-LD (Schema.org with Croissant extension), DataCite XML, Dublin Core XML, ISO 19115/ISO 19139, and DIF/FGDC. Dissemination occurs via OAI-PMH, HTTP content negotiation following the Signposting standard (<nowiki>https://signposting.org/</nowiki>), and other protocols. Metadata marshaling from the PostgreSQL database to the Elasticsearch search index is handled asynchronously by dedicated middleware components, which also manage automated DOI minting at DataCite upon dataset publication.

The marshaled metadata are stored and indexed in '''Elasticsearch''', which serves as the primary index for all public search and metadata access interfaces. This architecture separates the authoritative relational record from the search-optimized representation, allowing the search index to be rebuilt or updated without modifying archived data.

The flexible metadata framework '''PanFMP''' (<nowiki>https://www.panfmp.org/</nowiki>) underpins the mapping between the PANGAEA internal schema and the various external output standards, ensuring that new or updated standards can be accommodated without structural changes to the underlying data.

Data submissions, user requests, and bug reports are managed through a '''JIRA''' (Atlassian) issue tracking system, which serves as the primary communication channel between data providers and the PANGAEA editorial team throughout the submission and publication process.
== Frontend: Editorial System ==
The PANGAEA editorial system is the primary tool through which data editors review, curate, harmonize, and publish submitted data and metadata. It is a '''web-based client/server application''' developed entirely in-house, operating directly on the PostgreSQL databases.

The backend of the editorial system is built on '''Java 17''' using the '''Dropwizard''' framework, exposing a REST API. The frontend is implemented in '''React''' with the '''Ant Design''' component library. Source code for both components, together with shared libraries and test suites, is maintained in an institutional '''GitLab''' instance hosted at AWI, structured as an '''Nx monorepo'''. The repository encompasses automated unit tests and '''Cypress''' end-to-end tests. All new versions are deployed exclusively through a '''GitLab CI/CD pipeline''', with releases gated on successful completion of the full automated test suite.

In production, the editorial system runs on Ubuntu virtual machines hosted on VMware infrastructure. '''Apache''' serves the React frontend and acts as a reverse proxy to the backend services. Four parallel instances are operated simultaneously, providing dedicated test and production environments for two editorial system versions at any given time. This setup supports controlled feature validation alongside uninterrupted service operation. Infrastructure components — including VMware, file systems, monitoring, GitLab, and the database platform — are managed by AWI, while the PANGAEA group is responsible for the Apache web server, Java services, and React frontend.

The relational database model underlying the editorial system enforces structural and semantic consistency between submitted metadata and the existing data inventory. The editorial system also integrates a '''Terminology Catalogue (TC)''', which manages controlled vocabularies and ontologies used to harmonize data and metadata during ingest. Supported terminologies include WoRMS, ITIS, ChEBI, EnvO, PATO, QUDT, and the NERC vocabulary server, among others. A detailed description of the Terminology Catalogue and its role in data archiving and publication is given in Diepenbroek et al. (2017).
== Frontend: Public Web Interface and Search ==

=== Web Delivery Stack ===
The public web infrastructure is fronted by an '''NGINX''' reverse proxy that manages incoming traffic and supports '''HTTP/3 (QUIC)''' and '''HTTP/2''' alongside HTTP/1.1 for broad client compatibility. Behind this entry point, a dual-backend architecture separates concerns by function: a '''PHP 8''' environment handles general information pages and the PANGAEA Wiki, while a '''Java 17''' application layer running on '''Eclipse Jetty 12''' manages core repository services — including dataset landing pages, DOI and handle resolution, and content negotiation for both human- and machine-readable data access. The Jetty-based layer also manages delivery of high-volume and binary files from the tape archive by staging requested files to a local hard-disk cache before serving them to users.

=== Search ===
The PANGAEA search engine supports full-text and faceted search across all published metadata. Faceted navigation is enabled by semantic metadata enrichment performed during the marshaling process: terminology-based annotations are added to metadata records, allowing consistent filtering by topic, device type, geographic region, and other dimensions. PANGAEA's approach to data discovery complies with the recent recommendations of the Research Data Alliance (RDA) Interest Group "Data Discovery Paradigms" (Wu, M. et al., 2026, <nowiki>https://doi.org/10.5334/dsj-2026-006</nowiki>). Search documentation is available in the Wiki at <nowiki>https://wiki.pangaea.de/wiki/PANGAEA_search</nowiki>.

=== Map Search ===
Geographic search and visualization are implemented using '''leaflet.js''', serving map tiles from four configurable sources: the AWI basemap, OpenStreetMap, Google (hybrid), and ESRI.

=== Dataset Landing Pages ===
Each published dataset is represented by a landing page resolved through its DOI. Landing pages present the full dataset metadata in human-readable form and are enriched with structured '''schema.org''' markup in JSON-LD extended with Croissant properties, ensuring machine-actionability and compatibility with generic web search engines and data registries. The schema.org metadata is also accessible via HTTP content negotiation following the Signposting standard. Minor editorial corrections to datasets that do not affect scientific content are documented in a "Change history" section of the landing page, recording the date and a summary of each change applied.

=== Programmatic Access ===
PANGAEA offers programmatic access to data and metadata through a range of web services (REST and SOAP). The OAI-PMH endpoint supports metadata harvesting in all supported standards. Client libraries for '''Python''' (pangaeapy, developed by PANGAEA) and '''R''' (pangaear, developed by the community) allow researchers to load and transform PANGAEA data directly into native data structures for analysis in environments such as Jupyter notebooks.

== Monitoring and Analytics ==
Service health across the PANGAEA infrastructure is monitored through the '''AWI Grafana/Telegraf''' stack, covering service availability, application logs, and resource saturation for all production systems. This is supplemented by external availability checks via '''UptimeRobot''', which provides independent verification of public-facing service endpoints from outside the AWI network. All external links in PANGAEA metadata records — references to related literature, other dataset versions, and external resources — are automatically checked for broken (HTTP 404) or permanently redirected (HTTP 301) responses on a weekly basis.

User engagement and download metrics are captured by an integrated **Matomo** analytics instance configured to produce usage statistics compliant with the **COUNTER** (Counting Online Usage of Networked Electronic Resources) standard, ensuring that data impact is measured according to international scholarly norms. Usage statistics are publicly visible on each dataset landing page; details are documented in the Wiki at <nowiki>https://wiki.pangaea.de/wiki/Data_Usage_Statistics</nowiki>.

== Security ==
Backend and middleware systems are protected behind a firewall; frontend systems operate in a demilitarized zone (DMZ) accessible from outside with restricted but still firewalled access. Frontend systems have no write access and only limited read access to the backend database and tape archives. Public access from the website and REST APIs is served from data replicas hosted in Elasticsearch or through read-only remote filesystem access. Data curators access production systems via virtual private networks (VPN), which enforce a basic check that client operating systems are up to date.

Physical access to the AWI computing center is managed through an electronic access control system for all relevant entrances; key distribution is documented, and guest policies are formally established. The AWI maintains uninterruptible power supplies (UPS) capable of sustaining all PANGAEA-relevant hardware for up to 60 minutes, backed by a diesel-powered emergency generator providing a further 23 hours of operation. Fire alerts are automatically forwarded to the fire department with a contractual response time under 15 minutes.

Security of the technical infrastructure is further maintained through the use of asymmetric key infrastructures, mandatory minimum-length passwords for all user classes, short-cycle security patching for all hardware and software components, professional monitoring tools for hardware, firewall, software, services, performance, and attacks, and regular security training for all technical and non-technical staff. Security risks are assessed on an ongoing basis by AWI's institutional IT security team, which coordinates incident response and monitors threat landscapes relevant to research data infrastructure. PANGAEA technical staff participates in regular security reviews.

Access controls for restricted or moratorium datasets are enforced at the application layer. Dataset metadata remains publicly accessible in all cases; access to the data itself is restricted to authorized users at the individual level for the duration of the moratorium, typically a maximum of two years from submission.

The off-site replica at MARUM operates in a fully isolated network environment (Layer 2 separation) with access restricted to a firewalled VPN gateway at AWI. The Green IT Housing Center is monitored 24/7 by Bremen University staff located in very close proximity to the datacenter (15 m, adjacent building), with automated fire alarms, redundant power supply with battery backup, and multilevel physical access control.

== Software Inventory ==
The following table summarizes the principal software components of the PANGAEA infrastructure and their development model.
{| class="wikitable"
! Component !! Software !! Development Model
|-
| Primary database || PostgreSQL || Open source (community-supported)
|-
| Data warehouse || Clickhouse || Open source (community-supported)
|-
| Search and metadata index || Elasticsearch || Open source (community-supported)
|-
| Editorial system backend || Java 17 / Dropwizard || In-house development
|-
| Editorial system frontend || React / Ant Design || In-house development (open source framework)
|-
| Source control || GitLab || Open source (community-supported, self-hosted)
|-
| CI/CD pipeline || GitLab CI/CD || Open source (community-supported)
|-
| Web server / reverse proxy (editorial) || Apache || Open source (community-supported)
|-
| Web traffic management || NGINX || Open source (community-supported)
|-
| General web and Wiki frontend || PHP 8 || Open source (community-supported)
|-
| Repository services backend || Java 17 / Eclipse Jetty 12 || Open source (community-supported)
|-
| Map visualization || leaflet.js || Open source (community-supported)
|-
| Metadata framework || PanFMP || In-house development (open source)
|-
| Issue tracking || JIRA (Atlassian) || Commercial
|-
| Monitoring stack || Grafana / Telegraf || Open source (community-supported)
|-
| External uptime monitoring || UptimeRobot || Commercial SaaS
|-
| Analytics || Matomo || Open source (community-supported)
|-
| Infrastructure virtualization || VMware || Commercial
|-
| Tape archive hardware || SpectraLogic TFinity ExaScale || Commercial hardware
|-
| Python data client || pangaeapy || In-house development (open source)
|-
| R data client || pangaear || Community development (open source)
|}

== Documentation and Change Management ==
General documentation of PANGAEA systems and services is maintained in the public '''PANGAEA Wiki''' (<nowiki>https://wiki.pangaea.de/</nowiki>). A separate internal '''Confluence''' Wiki, kept operationally isolated from the main PANGAEA infrastructure for disaster-safety purposes, contains detailed documentation of server configurations, installation and maintenance procedures, relationships between system components, VM snapshot procedures, backup routines, and service restart priorities for incident response.

All changes to published data and metadata are recorded in the editorial system's version history. Substantive revisions to published datasets result in a new dataset version with a new DOI; all prior versions remain accessible and cross-referenced. Minor editorial corrections that do not affect scientific content are applied without creating a new identifier and are transparently documented in the "Change history" section of the dataset landing page, including the date and a brief summary of each change applied.

== References ==

* Felden, J., Möller, L., Schindler, U., Huber, R., Schumacher, S., Koppe, R., Diepenbroek, M. & Glöckner, F.O. (2023). PANGAEA — Data Publisher for Earth & Environmental Science. ''Scientific Data'', 10, 347. <nowiki>https://doi.org/10.1038/s41597-023-02269-x</nowiki>
* Diepenbroek, M., Schindler, U., Huber, R., Pesant, S., Stocker, M., Felden, J., Buss, M. & Weinrebe, M. (2017). Terminology supported archiving and publication of environmental science data in PANGAEA. ''Journal of Biotechnology'', 261, 177–186. <nowiki>https://doi.org/10.1016/j.jbiotec.2017.07.016</nowiki>
* Wu, M. et al. (2026). RDA Interest Group on Data Discovery Paradigms. ''Data Science Journal''. <nowiki>https://doi.org/10.5334/dsj-2026-006</nowiki>

== See Also ==

*[[PANGAEA search]]
*[[Data submission]]
*[[Authors Guides]]
*[[Curation levels]]
*[[Processing levels]]
*[[Data Usage Statistics]]
*[[Format]]
*[[PANGAEA XML schema]]
*[[Best practice manuals and templates]]
*[[Continuity Plan]] — https://www.pangaea.de/about/continuity.php
*[[Preservation Plan]] — https://www.pangaea.de/about/preservation.php

Technology

2026-04-10T16:25:19Z

Lmoeller: /* Middleware: Processing and Transformation */ added Croissant extension

{{construction}}

PANGAEA is built on a three-tiered client/server architecture that has evolved continuously since the system's founding in the early 1990s. The architecture separates backend storage and database systems, middleware processing and transformation components, and frontend delivery and user interfaces into distinct layers that are individually maintainable and replaceable. This design philosophy has enabled PANGAEA to migrate core components — including its primary database engine, its editorial system, and its search infrastructure — without disruption to archived data or to users, and underpins the system's long-term stability as a trustworthy data repository.

A detailed description of the PANGAEA information system and its workflows is provided in Felden et al. (2023) and Diepenbroek et al. (2017); this article provides an up-to-date overview oriented toward the general structure and key components.

== System Architecture ==

The technical architecture of PANGAEA follows a three-tiered model comprising a backend, a middleware layer, and a frontend. All hard- and software services are hosted and operated by the data and computing center of the Alfred Wegener Institute, Helmholtz Centre for Polar and Marine Research (AWI) in Bremerhaven. Most backend and middleware systems, as well as all frontend web servers and search engines, run on virtual machines (VMware) operated with Ubuntu Linux. Virtualization provides sufficient capacity and performance while enabling high availability and transparent hardware renewal on a typical cycle of three to four years.
[[File:PANGAEA overview architecture.png|alt=Schema illustrating the system architecture according to Felden et al., 2023|thumb|Schema illustrating the system architecture according to Felden et al., 2023|none|600x600px]]

PANGAEA currently operates nine virtual machines at the AWI computing center, using a total of 53 CPUs, 162 GB RAM, and 28 TB of disk space. The relational database holdings currently occupy approximately 5 TB, with approximately 1 PB of data stored on tape.

== Backend: Storage and Databases ==

=== Relational Database ===
The primary store for all structured data and metadata in PANGAEA is a '''PostgreSQL''' relational database management system (RDBMS). The data model is highly normalized, reflecting the full observational context of each measurement: events (when and where data were collected), campaigns (cruises or field expeditions), methods and devices, parameters, references, and institutional provenance information are all stored in defined relational structures. This normalized model allows data descriptions to be compiled dynamically and serialized into a wide range of output formats on demand, without modifying the underlying archived data.

Database integrity is continuously maintained through '''PostgreSQL streaming replication''' to a dedicated backup system, which enables point-in-time recovery to any moment prior to a failure event. In addition, full database backup copies (`pg_base_backup`) are created each weekend and retained for three weeks, providing an independent recovery layer supplementary to the streaming replica.

=== Data Warehouse ===
Fast access to large, aggregated compilations of data across the full PANGAEA holdings is provided by a '''Clickhouse''' data warehouse, which mirrors the numerical data inventory from the relational database. The data warehouse supports spatially and chronologically constrained queries at the parameter level across potentially hundreds of thousands of individual datasets, and is accessible both through the PANGAEA website and programmatically via a REST API. All data warehouse exports include the DOI for each contributing dataset, ensuring that provenance and citation remain intact in compiled data products.

=== Archival Storage ===
For high-volume and binary data — including geophysical datasets, images, video, and community-specific formats such as NetCDF — data are stored in consistent formats on hard-disk arrays and robotic tape archives. The central archival storage system consists of two '''SpectraLogic TFinity ExaScale''' robotic tape libraries housed in separate buildings at AWI, with a combined capacity of up to 60 PB. Backup operations use high-capacity LTO-tape drives within this environment.

All data are stored redundantly using '''erasure coding''' across disk and tape. Data on disk is replicated to tape nightly and saved to snapshots retained for six months. Tape copies are replicated to a physically separate building within two hours of creation; decommissioned tapes are retained for one year before reuse. Virtual machine working data is captured in nightly machine snapshots. Write caches are battery-backed to ensure data integrity at the point of write.

=== Off-Site Replica ===
Since 2025, PANGAEA operates a **minimal viable repository service** as an off-site replica at MARUM/University of Bremen, hosted in the Green IT Housing Center of Bremen University. This facility provides geographic and infrastructural separation from the primary AWI systems, enabling recovery of data delivery services within 24 hours following a technical breakdown or cyberattack. Disaster recovery and switchover procedures involving this facility are regularly exercised. The replica is kept current through snapshot-based replication several times per day and currently covers all dataset metadata — including individual landing pages and all harvesting endpoints — full representations of tabular data publications, the Elasticsearch search index, and the relevant web frontend. Extension of the replica to include binary data files is under active development, with formal commitment to be finalized through the ongoing MARUM hosting agreement negotiations.

The off-site installation operates in a fully isolated environment with Layer 2 network separation. Access is restricted to VPN connections from a single designated gateway host at AWI, with access tokens and keys limited to the corresponding gateway and PANGAEA DevOps staff.
== Middleware: Processing and Transformation ==
The middleware layer manages the flow of data and metadata between the backend storage systems and the frontend interfaces. Its core functions are marshaling, indexing, transformation, and dissemination.

Metadata are dynamically marshaled from the relational database to a PANGAEA-specific internal XML format, from which they are transformed via '''XSLT and XML-to-JSON pipelines''' into a range of content standards for delivery to users and harvesting services. Currently supported output standards include JSON-LD (Schema.org with Croissant extension), DataCite XML, Dublin Core XML, ISO 19115/ISO 19139, and DIF/FGDC. Dissemination occurs via OAI-PMH, HTTP content negotiation following the Signposting standard (<nowiki>https://signposting.org/</nowiki>), and other protocols. Metadata marshaling from the PostgreSQL database to the Elasticsearch search index is handled asynchronously by dedicated middleware components, which also manage automated DOI minting at DataCite upon dataset publication.

The marshaled metadata are stored and indexed in '''Elasticsearch''', which serves as the primary index for all public search and metadata access interfaces. This architecture separates the authoritative relational record from the search-optimized representation, allowing the search index to be rebuilt or updated without modifying archived data.

The flexible metadata framework '''PanFMP''' (<nowiki>https://www.panfmp.org/</nowiki>) underpins the mapping between the PANGAEA internal schema and the various external output standards, ensuring that new or updated standards can be accommodated without structural changes to the underlying data.

Data submissions, user requests, and bug reports are managed through a '''JIRA''' (Atlassian) issue tracking system, which serves as the primary communication channel between data providers and the PANGAEA editorial team throughout the submission and publication process.
== Frontend: Editorial System ==
The PANGAEA editorial system is the primary tool through which data editors review, curate, harmonize, and publish submitted data and metadata. It is a '''web-based client/server application''' developed entirely in-house, operating directly on the PostgreSQL databases.

The backend of the editorial system is built on '''Java 17''' using the '''Dropwizard''' framework, exposing a REST API. The frontend is implemented in '''React''' with the '''Ant Design''' component library. Source code for both components, together with shared libraries and test suites, is maintained in an institutional '''GitLab''' instance hosted at AWI, structured as an '''Nx monorepo'''. The repository encompasses automated unit tests and '''Cypress''' end-to-end tests. All new versions are deployed exclusively through a '''GitLab CI/CD pipeline''', with releases gated on successful completion of the full automated test suite.

In production, the editorial system runs on Ubuntu virtual machines hosted on VMware infrastructure. '''Apache''' serves the React frontend and acts as a reverse proxy to the backend services. Four parallel instances are operated simultaneously, providing dedicated test and production environments for two editorial system versions at any given time. This setup supports controlled feature validation alongside uninterrupted service operation. Infrastructure components — including VMware, file systems, monitoring, GitLab, and the database platform — are managed by AWI, while the PANGAEA group is responsible for the Apache web server, Java services, and React frontend.

The relational database model underlying the editorial system enforces structural and semantic consistency between submitted metadata and the existing data inventory. The editorial system also integrates a '''Terminology Catalogue (TC)''', which manages controlled vocabularies and ontologies used to harmonize data and metadata during ingest. Supported terminologies include WoRMS, ITIS, ChEBI, EnvO, PATO, QUDT, and the NERC vocabulary server, among others. A detailed description of the Terminology Catalogue and its role in data archiving and publication is given in Diepenbroek et al. (2017).
== Frontend: Public Web Interface and Search ==

=== Web Delivery Stack ===
The public web infrastructure is fronted by an '''NGINX''' reverse proxy that manages incoming traffic and supports '''HTTP/3 (QUIC)''' and '''HTTP/2''' alongside HTTP/1.1 for broad client compatibility. Behind this entry point, a dual-backend architecture separates concerns by function: a '''PHP 8''' environment handles general information pages and the PANGAEA Wiki, while a '''Java 17''' application layer running on '''Eclipse Jetty 12''' manages core repository services — including dataset landing pages, DOI and handle resolution, and content negotiation for both human- and machine-readable data access. The Jetty-based layer also manages delivery of high-volume and binary files from the tape archive by staging requested files to a local hard-disk cache before serving them to users.

=== Search ===
The PANGAEA search engine supports full-text and faceted search across all published metadata. Faceted navigation is enabled by semantic metadata enrichment performed during the marshaling process: terminology-based annotations are added to metadata records, allowing consistent filtering by topic, device type, geographic region, and other dimensions. PANGAEA's approach to data discovery complies with the recent recommendations of the Research Data Alliance (RDA) Interest Group "Data Discovery Paradigms" (Wu, M. et al., 2026, <nowiki>https://doi.org/10.5334/dsj-2026-006</nowiki>). Search documentation is available in the Wiki at <nowiki>https://wiki.pangaea.de/wiki/PANGAEA_search</nowiki>.

=== Map Search ===
Geographic search and visualization are implemented using '''leaflet.js''', serving map tiles from four configurable sources: the AWI basemap, OpenStreetMap, Google (hybrid), and ESRI.

=== Dataset Landing Pages ===
Each published dataset is represented by a landing page resolved through its DOI. Landing pages present the full dataset metadata in human-readable form and are enriched with structured '''schema.org''' markup in JSON-LD, ensuring machine-actionability and compatibility with generic web search engines and data registries. The schema.org metadata is also accessible via HTTP content negotiation following the Signposting standard. Minor editorial corrections to datasets that do not affect scientific content are documented in a "Change history" section of the landing page, recording the date and a summary of each change applied.

=== Programmatic Access ===
PANGAEA offers programmatic access to data and metadata through a range of web services (REST and SOAP). The OAI-PMH endpoint supports metadata harvesting in all supported standards. Client libraries for '''Python''' (pangaeapy, developed by PANGAEA) and '''R''' (pangaear, developed by the community) allow researchers to load and transform PANGAEA data directly into native data structures for analysis in environments such as Jupyter notebooks.

== Monitoring and Analytics ==
Service health across the PANGAEA infrastructure is monitored through the '''AWI Grafana/Telegraf''' stack, covering service availability, application logs, and resource saturation for all production systems. This is supplemented by external availability checks via '''UptimeRobot''', which provides independent verification of public-facing service endpoints from outside the AWI network. All external links in PANGAEA metadata records — references to related literature, other dataset versions, and external resources — are automatically checked for broken (HTTP 404) or permanently redirected (HTTP 301) responses on a weekly basis.

User engagement and download metrics are captured by an integrated **Matomo** analytics instance configured to produce usage statistics compliant with the **COUNTER** (Counting Online Usage of Networked Electronic Resources) standard, ensuring that data impact is measured according to international scholarly norms. Usage statistics are publicly visible on each dataset landing page; details are documented in the Wiki at <nowiki>https://wiki.pangaea.de/wiki/Data_Usage_Statistics</nowiki>.

== Security ==
Backend and middleware systems are protected behind a firewall; frontend systems operate in a demilitarized zone (DMZ) accessible from outside with restricted but still firewalled access. Frontend systems have no write access and only limited read access to the backend database and tape archives. Public access from the website and REST APIs is served from data replicas hosted in Elasticsearch or through read-only remote filesystem access. Data curators access production systems via virtual private networks (VPN), which enforce a basic check that client operating systems are up to date.

Physical access to the AWI computing center is managed through an electronic access control system for all relevant entrances; key distribution is documented, and guest policies are formally established. The AWI maintains uninterruptible power supplies (UPS) capable of sustaining all PANGAEA-relevant hardware for up to 60 minutes, backed by a diesel-powered emergency generator providing a further 23 hours of operation. Fire alerts are automatically forwarded to the fire department with a contractual response time under 15 minutes.

Security of the technical infrastructure is further maintained through the use of asymmetric key infrastructures, mandatory minimum-length passwords for all user classes, short-cycle security patching for all hardware and software components, professional monitoring tools for hardware, firewall, software, services, performance, and attacks, and regular security training for all technical and non-technical staff. Security risks are assessed on an ongoing basis by AWI's institutional IT security team, which coordinates incident response and monitors threat landscapes relevant to research data infrastructure. PANGAEA technical staff participates in regular security reviews.

Access controls for restricted or moratorium datasets are enforced at the application layer. Dataset metadata remains publicly accessible in all cases; access to the data itself is restricted to authorized users at the individual level for the duration of the moratorium, typically a maximum of two years from submission.

The off-site replica at MARUM operates in a fully isolated network environment (Layer 2 separation) with access restricted to a firewalled VPN gateway at AWI. The Green IT Housing Center is monitored 24/7 by Bremen University staff located in very close proximity to the datacenter (15 m, adjacent building), with automated fire alarms, redundant power supply with battery backup, and multilevel physical access control.

== Software Inventory ==
The following table summarizes the principal software components of the PANGAEA infrastructure and their development model.
{| class="wikitable"
! Component !! Software !! Development Model
|-
| Primary database || PostgreSQL || Open source (community-supported)
|-
| Data warehouse || Clickhouse || Open source (community-supported)
|-
| Search and metadata index || Elasticsearch || Open source (community-supported)
|-
| Editorial system backend || Java 17 / Dropwizard || In-house development
|-
| Editorial system frontend || React / Ant Design || In-house development (open source framework)
|-
| Source control || GitLab || Open source (community-supported, self-hosted)
|-
| CI/CD pipeline || GitLab CI/CD || Open source (community-supported)
|-
| Web server / reverse proxy (editorial) || Apache || Open source (community-supported)
|-
| Web traffic management || NGINX || Open source (community-supported)
|-
| General web and Wiki frontend || PHP 8 || Open source (community-supported)
|-
| Repository services backend || Java 17 / Eclipse Jetty 12 || Open source (community-supported)
|-
| Map visualization || leaflet.js || Open source (community-supported)
|-
| Metadata framework || PanFMP || In-house development (open source)
|-
| Issue tracking || JIRA (Atlassian) || Commercial
|-
| Monitoring stack || Grafana / Telegraf || Open source (community-supported)
|-
| External uptime monitoring || UptimeRobot || Commercial SaaS
|-
| Analytics || Matomo || Open source (community-supported)
|-
| Infrastructure virtualization || VMware || Commercial
|-
| Tape archive hardware || SpectraLogic TFinity ExaScale || Commercial hardware
|-
| Python data client || pangaeapy || In-house development (open source)
|-
| R data client || pangaear || Community development (open source)
|}

== Documentation and Change Management ==
General documentation of PANGAEA systems and services is maintained in the public '''PANGAEA Wiki''' (<nowiki>https://wiki.pangaea.de/</nowiki>). A separate internal '''Confluence''' Wiki, kept operationally isolated from the main PANGAEA infrastructure for disaster-safety purposes, contains detailed documentation of server configurations, installation and maintenance procedures, relationships between system components, VM snapshot procedures, backup routines, and service restart priorities for incident response.

All changes to published data and metadata are recorded in the editorial system's version history. Substantive revisions to published datasets result in a new dataset version with a new DOI; all prior versions remain accessible and cross-referenced. Minor editorial corrections that do not affect scientific content are applied without creating a new identifier and are transparently documented in the "Change history" section of the dataset landing page, including the date and a brief summary of each change applied.

== References ==

* Felden, J., Möller, L., Schindler, U., Huber, R., Schumacher, S., Koppe, R., Diepenbroek, M. & Glöckner, F.O. (2023). PANGAEA — Data Publisher for Earth & Environmental Science. ''Scientific Data'', 10, 347. <nowiki>https://doi.org/10.1038/s41597-023-02269-x</nowiki>
* Diepenbroek, M., Schindler, U., Huber, R., Pesant, S., Stocker, M., Felden, J., Buss, M. & Weinrebe, M. (2017). Terminology supported archiving and publication of environmental science data in PANGAEA. ''Journal of Biotechnology'', 261, 177–186. <nowiki>https://doi.org/10.1016/j.jbiotec.2017.07.016</nowiki>
* Wu, M. et al. (2026). RDA Interest Group on Data Discovery Paradigms. ''Data Science Journal''. <nowiki>https://doi.org/10.5334/dsj-2026-006</nowiki>

== See Also ==

*[[PANGAEA search]]
*[[Data submission]]
*[[Authors Guides]]
*[[Curation levels]]
*[[Processing levels]]
*[[Data Usage Statistics]]
*[[Format]]
*[[PANGAEA XML schema]]
*[[Best practice manuals and templates]]
*[[Continuity Plan]] — https://www.pangaea.de/about/continuity.php
*[[Preservation Plan]] — https://www.pangaea.de/about/preservation.php

Technology

2026-04-10T15:25:43Z

Lmoeller: /* System Architecture */ corrected Relational database storage figures

{{construction}}

PANGAEA is built on a three-tiered client/server architecture that has evolved continuously since the system's founding in the early 1990s. The architecture separates backend storage and database systems, middleware processing and transformation components, and frontend delivery and user interfaces into distinct layers that are individually maintainable and replaceable. This design philosophy has enabled PANGAEA to migrate core components — including its primary database engine, its editorial system, and its search infrastructure — without disruption to archived data or to users, and underpins the system's long-term stability as a trustworthy data repository.

A detailed description of the PANGAEA information system and its workflows is provided in Felden et al. (2023) and Diepenbroek et al. (2017); this article provides an up-to-date overview oriented toward the general structure and key components.

== System Architecture ==

The technical architecture of PANGAEA follows a three-tiered model comprising a backend, a middleware layer, and a frontend. All hard- and software services are hosted and operated by the data and computing center of the Alfred Wegener Institute, Helmholtz Centre for Polar and Marine Research (AWI) in Bremerhaven. Most backend and middleware systems, as well as all frontend web servers and search engines, run on virtual machines (VMware) operated with Ubuntu Linux. Virtualization provides sufficient capacity and performance while enabling high availability and transparent hardware renewal on a typical cycle of three to four years.
[[File:PANGAEA overview architecture.png|alt=Schema illustrating the system architecture according to Felden et al., 2023|thumb|Schema illustrating the system architecture according to Felden et al., 2023|none|600x600px]]

PANGAEA currently operates nine virtual machines at the AWI computing center, using a total of 53 CPUs, 162 GB RAM, and 28 TB of disk space. The relational database holdings currently occupy approximately 5 TB, with approximately 1 PB of data stored on tape.

== Backend: Storage and Databases ==

=== Relational Database ===
The primary store for all structured data and metadata in PANGAEA is a '''PostgreSQL''' relational database management system (RDBMS). The data model is highly normalized, reflecting the full observational context of each measurement: events (when and where data were collected), campaigns (cruises or field expeditions), methods and devices, parameters, references, and institutional provenance information are all stored in defined relational structures. This normalized model allows data descriptions to be compiled dynamically and serialized into a wide range of output formats on demand, without modifying the underlying archived data.

Database integrity is continuously maintained through '''PostgreSQL streaming replication''' to a dedicated backup system, which enables point-in-time recovery to any moment prior to a failure event. In addition, full database backup copies (`pg_base_backup`) are created each weekend and retained for three weeks, providing an independent recovery layer supplementary to the streaming replica.

=== Data Warehouse ===
Fast access to large, aggregated compilations of data across the full PANGAEA holdings is provided by a '''Clickhouse''' data warehouse, which mirrors the numerical data inventory from the relational database. The data warehouse supports spatially and chronologically constrained queries at the parameter level across potentially hundreds of thousands of individual datasets, and is accessible both through the PANGAEA website and programmatically via a REST API. All data warehouse exports include the DOI for each contributing dataset, ensuring that provenance and citation remain intact in compiled data products.

=== Archival Storage ===
For high-volume and binary data — including geophysical datasets, images, video, and community-specific formats such as NetCDF — data are stored in consistent formats on hard-disk arrays and robotic tape archives. The central archival storage system consists of two '''SpectraLogic TFinity ExaScale''' robotic tape libraries housed in separate buildings at AWI, with a combined capacity of up to 60 PB. Backup operations use high-capacity LTO-tape drives within this environment.

All data are stored redundantly using '''erasure coding''' across disk and tape. Data on disk is replicated to tape nightly and saved to snapshots retained for six months. Tape copies are replicated to a physically separate building within two hours of creation; decommissioned tapes are retained for one year before reuse. Virtual machine working data is captured in nightly machine snapshots. Write caches are battery-backed to ensure data integrity at the point of write.

=== Off-Site Replica ===
Since 2025, PANGAEA operates a **minimal viable repository service** as an off-site replica at MARUM/University of Bremen, hosted in the Green IT Housing Center of Bremen University. This facility provides geographic and infrastructural separation from the primary AWI systems, enabling recovery of data delivery services within 24 hours following a technical breakdown or cyberattack. Disaster recovery and switchover procedures involving this facility are regularly exercised. The replica is kept current through snapshot-based replication several times per day and currently covers all dataset metadata — including individual landing pages and all harvesting endpoints — full representations of tabular data publications, the Elasticsearch search index, and the relevant web frontend. Extension of the replica to include binary data files is under active development, with formal commitment to be finalized through the ongoing MARUM hosting agreement negotiations.

The off-site installation operates in a fully isolated environment with Layer 2 network separation. Access is restricted to VPN connections from a single designated gateway host at AWI, with access tokens and keys limited to the corresponding gateway and PANGAEA DevOps staff.
== Middleware: Processing and Transformation ==
The middleware layer manages the flow of data and metadata between the backend storage systems and the frontend interfaces. Its core functions are marshaling, indexing, transformation, and dissemination.

Metadata are dynamically marshaled from the relational database to a PANGAEA-specific internal XML format, from which they are transformed via '''XSLT and XML-to-JSON pipelines''' into a range of content standards for delivery to users and harvesting services. Currently supported output standards include JSON-LD (Schema.org), DataCite XML, Dublin Core XML, ISO 19115/ISO 19139, and DIF/FGDC. Dissemination occurs via OAI-PMH, HTTP content negotiation following the Signposting standard (<nowiki>https://signposting.org/</nowiki>), and other protocols. Metadata marshaling from the PostgreSQL database to the Elasticsearch search index is handled asynchronously by dedicated middleware components, which also manage automated DOI minting at DataCite upon dataset publication.

The marshaled metadata are stored and indexed in '''Elasticsearch''', which serves as the primary index for all public search and metadata access interfaces. This architecture separates the authoritative relational record from the search-optimized representation, allowing the search index to be rebuilt or updated without modifying archived data.

The flexible metadata framework '''PanFMP''' (<nowiki>https://www.panfmp.org/</nowiki>) underpins the mapping between the PANGAEA internal schema and the various external output standards, ensuring that new or updated standards can be accommodated without structural changes to the underlying data.

Data submissions, user requests, and bug reports are managed through a '''JIRA''' (Atlassian) issue tracking system, which serves as the primary communication channel between data providers and the PANGAEA editorial team throughout the submission and publication process.
== Frontend: Editorial System ==
The PANGAEA editorial system is the primary tool through which data editors review, curate, harmonize, and publish submitted data and metadata. It is a '''web-based client/server application''' developed entirely in-house, operating directly on the PostgreSQL databases.

The backend of the editorial system is built on '''Java 17''' using the '''Dropwizard''' framework, exposing a REST API. The frontend is implemented in '''React''' with the '''Ant Design''' component library. Source code for both components, together with shared libraries and test suites, is maintained in an institutional '''GitLab''' instance hosted at AWI, structured as an '''Nx monorepo'''. The repository encompasses automated unit tests and '''Cypress''' end-to-end tests. All new versions are deployed exclusively through a '''GitLab CI/CD pipeline''', with releases gated on successful completion of the full automated test suite.

In production, the editorial system runs on Ubuntu virtual machines hosted on VMware infrastructure. '''Apache''' serves the React frontend and acts as a reverse proxy to the backend services. Four parallel instances are operated simultaneously, providing dedicated test and production environments for two editorial system versions at any given time. This setup supports controlled feature validation alongside uninterrupted service operation. Infrastructure components — including VMware, file systems, monitoring, GitLab, and the database platform — are managed by AWI, while the PANGAEA group is responsible for the Apache web server, Java services, and React frontend.

The relational database model underlying the editorial system enforces structural and semantic consistency between submitted metadata and the existing data inventory. The editorial system also integrates a '''Terminology Catalogue (TC)''', which manages controlled vocabularies and ontologies used to harmonize data and metadata during ingest. Supported terminologies include WoRMS, ITIS, ChEBI, EnvO, PATO, QUDT, and the NERC vocabulary server, among others. A detailed description of the Terminology Catalogue and its role in data archiving and publication is given in Diepenbroek et al. (2017).
== Frontend: Public Web Interface and Search ==

=== Web Delivery Stack ===
The public web infrastructure is fronted by an '''NGINX''' reverse proxy that manages incoming traffic and supports '''HTTP/3 (QUIC)''' and '''HTTP/2''' alongside HTTP/1.1 for broad client compatibility. Behind this entry point, a dual-backend architecture separates concerns by function: a '''PHP 8''' environment handles general information pages and the PANGAEA Wiki, while a '''Java 17''' application layer running on '''Eclipse Jetty 12''' manages core repository services — including dataset landing pages, DOI and handle resolution, and content negotiation for both human- and machine-readable data access. The Jetty-based layer also manages delivery of high-volume and binary files from the tape archive by staging requested files to a local hard-disk cache before serving them to users.

=== Search ===
The PANGAEA search engine supports full-text and faceted search across all published metadata. Faceted navigation is enabled by semantic metadata enrichment performed during the marshaling process: terminology-based annotations are added to metadata records, allowing consistent filtering by topic, device type, geographic region, and other dimensions. PANGAEA's approach to data discovery complies with the recent recommendations of the Research Data Alliance (RDA) Interest Group "Data Discovery Paradigms" (Wu, M. et al., 2026, <nowiki>https://doi.org/10.5334/dsj-2026-006</nowiki>). Search documentation is available in the Wiki at <nowiki>https://wiki.pangaea.de/wiki/PANGAEA_search</nowiki>.

=== Map Search ===
Geographic search and visualization are implemented using '''leaflet.js''', serving map tiles from four configurable sources: the AWI basemap, OpenStreetMap, Google (hybrid), and ESRI.

=== Dataset Landing Pages ===
Each published dataset is represented by a landing page resolved through its DOI. Landing pages present the full dataset metadata in human-readable form and are enriched with structured '''schema.org''' markup in JSON-LD, ensuring machine-actionability and compatibility with generic web search engines and data registries. The schema.org metadata is also accessible via HTTP content negotiation following the Signposting standard. Minor editorial corrections to datasets that do not affect scientific content are documented in a "Change history" section of the landing page, recording the date and a summary of each change applied.

=== Programmatic Access ===
PANGAEA offers programmatic access to data and metadata through a range of web services (REST and SOAP). The OAI-PMH endpoint supports metadata harvesting in all supported standards. Client libraries for '''Python''' (pangaeapy, developed by PANGAEA) and '''R''' (pangaear, developed by the community) allow researchers to load and transform PANGAEA data directly into native data structures for analysis in environments such as Jupyter notebooks.

== Monitoring and Analytics ==
Service health across the PANGAEA infrastructure is monitored through the '''AWI Grafana/Telegraf''' stack, covering service availability, application logs, and resource saturation for all production systems. This is supplemented by external availability checks via '''UptimeRobot''', which provides independent verification of public-facing service endpoints from outside the AWI network. All external links in PANGAEA metadata records — references to related literature, other dataset versions, and external resources — are automatically checked for broken (HTTP 404) or permanently redirected (HTTP 301) responses on a weekly basis.

User engagement and download metrics are captured by an integrated **Matomo** analytics instance configured to produce usage statistics compliant with the **COUNTER** (Counting Online Usage of Networked Electronic Resources) standard, ensuring that data impact is measured according to international scholarly norms. Usage statistics are publicly visible on each dataset landing page; details are documented in the Wiki at <nowiki>https://wiki.pangaea.de/wiki/Data_Usage_Statistics</nowiki>.

== Security ==
Backend and middleware systems are protected behind a firewall; frontend systems operate in a demilitarized zone (DMZ) accessible from outside with restricted but still firewalled access. Frontend systems have no write access and only limited read access to the backend database and tape archives. Public access from the website and REST APIs is served from data replicas hosted in Elasticsearch or through read-only remote filesystem access. Data curators access production systems via virtual private networks (VPN), which enforce a basic check that client operating systems are up to date.

Physical access to the AWI computing center is managed through an electronic access control system for all relevant entrances; key distribution is documented, and guest policies are formally established. The AWI maintains uninterruptible power supplies (UPS) capable of sustaining all PANGAEA-relevant hardware for up to 60 minutes, backed by a diesel-powered emergency generator providing a further 23 hours of operation. Fire alerts are automatically forwarded to the fire department with a contractual response time under 15 minutes.

Security of the technical infrastructure is further maintained through the use of asymmetric key infrastructures, mandatory minimum-length passwords for all user classes, short-cycle security patching for all hardware and software components, professional monitoring tools for hardware, firewall, software, services, performance, and attacks, and regular security training for all technical and non-technical staff. Security risks are assessed on an ongoing basis by AWI's institutional IT security team, which coordinates incident response and monitors threat landscapes relevant to research data infrastructure. PANGAEA technical staff participates in regular security reviews.

Access controls for restricted or moratorium datasets are enforced at the application layer. Dataset metadata remains publicly accessible in all cases; access to the data itself is restricted to authorized users at the individual level for the duration of the moratorium, typically a maximum of two years from submission.

The off-site replica at MARUM operates in a fully isolated network environment (Layer 2 separation) with access restricted to a firewalled VPN gateway at AWI. The Green IT Housing Center is monitored 24/7 by Bremen University staff located in very close proximity to the datacenter (15 m, adjacent building), with automated fire alarms, redundant power supply with battery backup, and multilevel physical access control.

== Software Inventory ==
The following table summarizes the principal software components of the PANGAEA infrastructure and their development model.
{| class="wikitable"
! Component !! Software !! Development Model
|-
| Primary database || PostgreSQL || Open source (community-supported)
|-
| Data warehouse || Clickhouse || Open source (community-supported)
|-
| Search and metadata index || Elasticsearch || Open source (community-supported)
|-
| Editorial system backend || Java 17 / Dropwizard || In-house development
|-
| Editorial system frontend || React / Ant Design || In-house development (open source framework)
|-
| Source control || GitLab || Open source (community-supported, self-hosted)
|-
| CI/CD pipeline || GitLab CI/CD || Open source (community-supported)
|-
| Web server / reverse proxy (editorial) || Apache || Open source (community-supported)
|-
| Web traffic management || NGINX || Open source (community-supported)
|-
| General web and Wiki frontend || PHP 8 || Open source (community-supported)
|-
| Repository services backend || Java 17 / Eclipse Jetty 12 || Open source (community-supported)
|-
| Map visualization || leaflet.js || Open source (community-supported)
|-
| Metadata framework || PanFMP || In-house development (open source)
|-
| Issue tracking || JIRA (Atlassian) || Commercial
|-
| Monitoring stack || Grafana / Telegraf || Open source (community-supported)
|-
| External uptime monitoring || UptimeRobot || Commercial SaaS
|-
| Analytics || Matomo || Open source (community-supported)
|-
| Infrastructure virtualization || VMware || Commercial
|-
| Tape archive hardware || SpectraLogic TFinity ExaScale || Commercial hardware
|-
| Python data client || pangaeapy || In-house development (open source)
|-
| R data client || pangaear || Community development (open source)
|}

== Documentation and Change Management ==
General documentation of PANGAEA systems and services is maintained in the public '''PANGAEA Wiki''' (<nowiki>https://wiki.pangaea.de/</nowiki>). A separate internal '''Confluence''' Wiki, kept operationally isolated from the main PANGAEA infrastructure for disaster-safety purposes, contains detailed documentation of server configurations, installation and maintenance procedures, relationships between system components, VM snapshot procedures, backup routines, and service restart priorities for incident response.

All changes to published data and metadata are recorded in the editorial system's version history. Substantive revisions to published datasets result in a new dataset version with a new DOI; all prior versions remain accessible and cross-referenced. Minor editorial corrections that do not affect scientific content are applied without creating a new identifier and are transparently documented in the "Change history" section of the dataset landing page, including the date and a brief summary of each change applied.

== References ==

* Felden, J., Möller, L., Schindler, U., Huber, R., Schumacher, S., Koppe, R., Diepenbroek, M. & Glöckner, F.O. (2023). PANGAEA — Data Publisher for Earth & Environmental Science. ''Scientific Data'', 10, 347. <nowiki>https://doi.org/10.1038/s41597-023-02269-x</nowiki>
* Diepenbroek, M., Schindler, U., Huber, R., Pesant, S., Stocker, M., Felden, J., Buss, M. & Weinrebe, M. (2017). Terminology supported archiving and publication of environmental science data in PANGAEA. ''Journal of Biotechnology'', 261, 177–186. <nowiki>https://doi.org/10.1016/j.jbiotec.2017.07.016</nowiki>
* Wu, M. et al. (2026). RDA Interest Group on Data Discovery Paradigms. ''Data Science Journal''. <nowiki>https://doi.org/10.5334/dsj-2026-006</nowiki>

== See Also ==

*[[PANGAEA search]]
*[[Data submission]]
*[[Authors Guides]]
*[[Curation levels]]
*[[Processing levels]]
*[[Data Usage Statistics]]
*[[Format]]
*[[PANGAEA XML schema]]
*[[Best practice manuals and templates]]
*[[Continuity Plan]] — https://www.pangaea.de/about/continuity.php
*[[Preservation Plan]] — https://www.pangaea.de/about/preservation.php

PANGAEA Community Workshops

2026-04-07T13:47:58Z

Lmoeller: /* Next Editions */

[[File:PANGAEA Banner CWS&Partner sans.png|thumb|773x773px]]

== About ==
The '''PANGAEA Community Workshop Series (CWS)''' is an established training initiative launched in 2021. It was developed to strengthen research data management (RDM) competences among users of PANGAEA and to promote good scientific practice in the preparation, publication, and reuse of research data. Combining theoretical instruction with practical, hands-on components, the workshop series provides participants with a comprehensive understanding of how to work effectively with the PANGAEA data repository. It is primarily designed for early career scientists but is open for participation by anyone interested.

The overarching goals of the series are to raise awareness of [https://www.go-fair.org/fair-principles/ FAIR data principles] and high-quality data management, empower researchers to make best use of PANGAEA services, and improve the overall efficiency of the data publication process. By educating authors on how to prepare their data appropriately, the workshops not only enhance participants’ individual scientific visibility but also contribute to reducing editorial workload and improving productivity within PANGAEA.

The CWS is a contribution to the [https://www.denbi.de/training de.NBI training network] and covers the two major topics relevant to the use of a data repository.

=== FAIR Data Publications with PANGAEA ===
The course '''“FAIR Data Publications with PANGAEA”''', usually held in November, focuses on the submission and publication aspects of data management. Participants gain detailed insights into how to prepare datasets and metadata for submission, understand editorial procedures, and learn how to ensure their outputs meet FAIR principles. The benefits for researchers are manifold: publishing FAIR data increases citation rates, visibility, and scientific impact, while promoting transparency and encouraging collaboration. Data authors also gain recognition independent of traditional paper authorship, ensuring proper credit for data generation and processing.

=== Finding and Retrieving Data from PANGAEA ===
The second course, '''“Finding and Retrieving Data from PANGAEA”''', is devoted to the discovery and reuse of published data and is usually offered in May. It introduces participants to the repository’s powerful search capabilities, standardized metadata structures, and integration with major data portals such as [https://explore.openaire.eu/ OpenAIRE], [https://commons.datacite.org/ DataCite], and [https://datasetsearch.research.google.com/ Google Dataset Search]. Through practical exercises, attendees learn to efficiently locate, combine, and analyze datasets using standard web protocols and scripting tools based on [https://pypi.org/project/pangaeapy/ Python] and [https://github.com/ropensci/pangaear R]. High-quality, harmonized data products curated by domain experts ensure that users can confidently employ PANGAEA data in new studies and analytical workflows, thereby fostering reproducibility, innovation, and cross-disciplinary research.

== Format Details ==
The workshops are conducted '''online via web conference''', allowing participants from around the world to join easily. Each workshop spans '''two days''', typically held on '''Thursday and Friday''', with '''two-hour sessions per day'''. The format is designed to be compact yet interactive, combining presentations with live demonstrations and short exercises. '''No prior experience or prerequisites''' are required—participants of all backgrounds are welcome to attend.

== Upcoming and Previous Editions and staying in the loop ==
New editions of the '''PANGAEA Community Workshop Series''' are held regularly, typically twice a year, featuring updated content that reflects emerging developments in open science, FAIR data practices, and digital research infrastructures. Researchers interested in participating or staying informed about future events are encouraged to subscribe to the '''PANGAEA Trainings mailing list'''. The list provides timely updates on upcoming workshops and other training activities and materials. You can subscribe to the list [https://lists.pangaea.de/listinfo/training '''here'''].

=== Next Editions ===
The next editions are scheduled for:
{| class="wikitable"
|+
!Date
!Topic
!Registration
|-
|07.-08.05.2026
|Finding and Retrieving Data from PANGAEA
|[https://events.hifis.net/e/CWS2605 Link] (open until May 03 2026)
|-
|12.-13.11.2026
|FAIR Data Publications with PANGAEA
|[https://events.hifis.net/e/CWS2611 Link] (reg. opens Oct. 2026)
|-
|
|
|
|}

=== Slides and Materials for previous editions ===
{| class="wikitable"
!Date
!Title
!Slides
!Recordings
|-
|Nov. 2025
|FAIR Data Publications with PANGAEA
|[https://events.hifis.net/event/3172/attachments/4911/10323/202511_PANGAEA_CommunityWorkshop_DataIn_slides.pdf Link]
|[https://youtube.com/playlist?list=PLJpsMTFHswiaPYWCfKSxTTk1V-Hu8hDW5&si=iTUFnA_b_-eWEHGD @Youtube]
|-
|May 2025
|Finding and Retrieving Data from PANGAEA
|[https://events.hifis.net/event/2356/attachments/3920/8228/202505_SlideDeck_PANGAEA_CommunityWorkshop_Find&RetrieveData.pdf Link]
|n/a
|-
|Nov. 2024
|FAIR Data Publications with PANGAEA
|[https://nextcloud.awi.de/s/DRcFGpd6DMdjLja Link]
|[https://www.youtube.com/playlist?list=PLJpsMTFHswia0jH3XXHC-L2ZpXCYgv0xX @Youtube]
|-
|May 2024
|Finding and Retrieving Data from PANGAEA
|[https://nextcloud.awi.de/s/cfxZDarjzBj9mjr Link]
|[https://www.youtube.com/playlist?list=PLJpsMTFHswibjvzdC1yns2FFMVnxdiu5S @Youtube]
|}
You should also checkout our [https://github.com/pangaea-data-publisher/community-workshop-material CWS Github repository] for training material, Jupyter notebooks and scripts for interactions with PANGAEA content via R and Python.

PANGAEA Community Workshops

2026-04-07T13:29:50Z

Lmoeller: /* Next Editions */

[[File:PANGAEA Banner CWS&Partner sans.png|thumb|773x773px]]

== About ==
The '''PANGAEA Community Workshop Series (CWS)''' is an established training initiative launched in 2021. It was developed to strengthen research data management (RDM) competences among users of PANGAEA and to promote good scientific practice in the preparation, publication, and reuse of research data. Combining theoretical instruction with practical, hands-on components, the workshop series provides participants with a comprehensive understanding of how to work effectively with the PANGAEA data repository. It is primarily designed for early career scientists but is open for participation by anyone interested.

The overarching goals of the series are to raise awareness of [https://www.go-fair.org/fair-principles/ FAIR data principles] and high-quality data management, empower researchers to make best use of PANGAEA services, and improve the overall efficiency of the data publication process. By educating authors on how to prepare their data appropriately, the workshops not only enhance participants’ individual scientific visibility but also contribute to reducing editorial workload and improving productivity within PANGAEA.

The CWS is a contribution to the [https://www.denbi.de/training de.NBI training network] and covers the two major topics relevant to the use of a data repository.

=== FAIR Data Publications with PANGAEA ===
The course '''“FAIR Data Publications with PANGAEA”''', usually held in November, focuses on the submission and publication aspects of data management. Participants gain detailed insights into how to prepare datasets and metadata for submission, understand editorial procedures, and learn how to ensure their outputs meet FAIR principles. The benefits for researchers are manifold: publishing FAIR data increases citation rates, visibility, and scientific impact, while promoting transparency and encouraging collaboration. Data authors also gain recognition independent of traditional paper authorship, ensuring proper credit for data generation and processing.

=== Finding and Retrieving Data from PANGAEA ===
The second course, '''“Finding and Retrieving Data from PANGAEA”''', is devoted to the discovery and reuse of published data and is usually offered in May. It introduces participants to the repository’s powerful search capabilities, standardized metadata structures, and integration with major data portals such as [https://explore.openaire.eu/ OpenAIRE], [https://commons.datacite.org/ DataCite], and [https://datasetsearch.research.google.com/ Google Dataset Search]. Through practical exercises, attendees learn to efficiently locate, combine, and analyze datasets using standard web protocols and scripting tools based on [https://pypi.org/project/pangaeapy/ Python] and [https://github.com/ropensci/pangaear R]. High-quality, harmonized data products curated by domain experts ensure that users can confidently employ PANGAEA data in new studies and analytical workflows, thereby fostering reproducibility, innovation, and cross-disciplinary research.

== Format Details ==
The workshops are conducted '''online via web conference''', allowing participants from around the world to join easily. Each workshop spans '''two days''', typically held on '''Thursday and Friday''', with '''two-hour sessions per day'''. The format is designed to be compact yet interactive, combining presentations with live demonstrations and short exercises. '''No prior experience or prerequisites''' are required—participants of all backgrounds are welcome to attend.

== Upcoming and Previous Editions and staying in the loop ==
New editions of the '''PANGAEA Community Workshop Series''' are held regularly, typically twice a year, featuring updated content that reflects emerging developments in open science, FAIR data practices, and digital research infrastructures. Researchers interested in participating or staying informed about future events are encouraged to subscribe to the '''PANGAEA Trainings mailing list'''. The list provides timely updates on upcoming workshops and other training activities and materials. You can subscribe to the list [https://lists.pangaea.de/listinfo/training '''here'''].

=== Next Editions ===
The next editions are scheduled for:
{| class="wikitable"
|+
!Date
!Topic
!Registration
|-
|07.-08.05.2026
|Finding and Retrieving Data from PANGAEA
|[https://events.hifis.net/e/CWS2605 Link] (reg. open until May 03 2026)
|-
|12.-13.11.2026
|FAIR Data Publications with PANGAEA
|[https://events.hifis.net/e/CWS2611 Link] (reg. opens Oct. 2026)
|-
|
|
|
|}

=== Slides and Materials for previous editions ===
{| class="wikitable"
!Date
!Title
!Slides
!Recordings
|-
|Nov. 2025
|FAIR Data Publications with PANGAEA
|[https://events.hifis.net/event/3172/attachments/4911/10323/202511_PANGAEA_CommunityWorkshop_DataIn_slides.pdf Link]
|[https://youtube.com/playlist?list=PLJpsMTFHswiaPYWCfKSxTTk1V-Hu8hDW5&si=iTUFnA_b_-eWEHGD @Youtube]
|-
|May 2025
|Finding and Retrieving Data from PANGAEA
|[https://events.hifis.net/event/2356/attachments/3920/8228/202505_SlideDeck_PANGAEA_CommunityWorkshop_Find&RetrieveData.pdf Link]
|n/a
|-
|Nov. 2024
|FAIR Data Publications with PANGAEA
|[https://nextcloud.awi.de/s/DRcFGpd6DMdjLja Link]
|[https://www.youtube.com/playlist?list=PLJpsMTFHswia0jH3XXHC-L2ZpXCYgv0xX @Youtube]
|-
|May 2024
|Finding and Retrieving Data from PANGAEA
|[https://nextcloud.awi.de/s/cfxZDarjzBj9mjr Link]
|[https://www.youtube.com/playlist?list=PLJpsMTFHswibjvzdC1yns2FFMVnxdiu5S @Youtube]
|}
You should also checkout our [https://github.com/pangaea-data-publisher/community-workshop-material CWS Github repository] for training material, Jupyter notebooks and scripts for interactions with PANGAEA content via R and Python.

Technology

2026-04-03T16:06:29Z

Lmoeller: /* Software Inventory */ Added revised Table

{{construction}}

PANGAEA is built on a three-tiered client/server architecture that has evolved continuously since the system's founding in the early 1990s. The architecture separates backend storage and database systems, middleware processing and transformation components, and frontend delivery and user interfaces into distinct layers that are individually maintainable and replaceable. This design philosophy has enabled PANGAEA to migrate core components — including its primary database engine, its editorial system, and its search infrastructure — without disruption to archived data or to users, and underpins the system's long-term stability as a trustworthy data repository.

A detailed description of the PANGAEA information system and its workflows is provided in Felden et al. (2023) and Diepenbroek et al. (2017); this article provides an up-to-date overview oriented toward the general structure and key components.

== System Architecture ==

The technical architecture of PANGAEA follows a three-tiered model comprising a backend, a middleware layer, and a frontend. All hard- and software services are hosted and operated by the data and computing center of the Alfred Wegener Institute, Helmholtz Centre for Polar and Marine Research (AWI) in Bremerhaven. Most backend and middleware systems, as well as all frontend web servers and search engines, run on virtual machines (VMware) operated with Ubuntu Linux. Virtualization provides sufficient capacity and performance while enabling high availability and transparent hardware renewal on a typical cycle of three to four years.
[[File:PANGAEA overview architecture.png|alt=Schema illustrating the system architecture according to Felden et al., 2023|thumb|Schema illustrating the system architecture according to Felden et al., 2023|none|600x600px]]

PANGAEA currently operates nine virtual machines at the AWI computing center, using a total of 53 CPUs, 162 GB RAM, and 28 TB of disk space. The relational database holdings currently occupy approximately 130 TB, with approximately 1 PB of data stored on tape.

== Backend: Storage and Databases ==

=== Relational Database ===
The primary store for all structured data and metadata in PANGAEA is a '''PostgreSQL''' relational database management system (RDBMS). The data model is highly normalized, reflecting the full observational context of each measurement: events (when and where data were collected), campaigns (cruises or field expeditions), methods and devices, parameters, references, and institutional provenance information are all stored in defined relational structures. This normalized model allows data descriptions to be compiled dynamically and serialized into a wide range of output formats on demand, without modifying the underlying archived data.

Database integrity is continuously maintained through '''PostgreSQL streaming replication''' to a dedicated backup system, which enables point-in-time recovery to any moment prior to a failure event. In addition, full database backup copies (`pg_base_backup`) are created each weekend and retained for three weeks, providing an independent recovery layer supplementary to the streaming replica.

=== Data Warehouse ===
Fast access to large, aggregated compilations of data across the full PANGAEA holdings is provided by a '''Clickhouse''' data warehouse, which mirrors the numerical data inventory from the relational database. The data warehouse supports spatially and chronologically constrained queries at the parameter level across potentially hundreds of thousands of individual datasets, and is accessible both through the PANGAEA website and programmatically via a REST API. All data warehouse exports include the DOI for each contributing dataset, ensuring that provenance and citation remain intact in compiled data products.

=== Archival Storage ===
For high-volume and binary data — including geophysical datasets, images, video, and community-specific formats such as NetCDF — data are stored in consistent formats on hard-disk arrays and robotic tape archives. The central archival storage system consists of two '''SpectraLogic TFinity ExaScale''' robotic tape libraries housed in separate buildings at AWI, with a combined capacity of up to 60 PB. Backup operations use high-capacity LTO-tape drives within this environment.

All data are stored redundantly using '''erasure coding''' across disk and tape. Data on disk is replicated to tape nightly and saved to snapshots retained for six months. Tape copies are replicated to a physically separate building within two hours of creation; decommissioned tapes are retained for one year before reuse. Virtual machine working data is captured in nightly machine snapshots. Write caches are battery-backed to ensure data integrity at the point of write.

=== Off-Site Replica ===
Since 2025, PANGAEA operates a **minimal viable repository service** as an off-site replica at MARUM/University of Bremen, hosted in the Green IT Housing Center of Bremen University. This facility provides geographic and infrastructural separation from the primary AWI systems, enabling recovery of data delivery services within 24 hours following a technical breakdown or cyberattack. Disaster recovery and switchover procedures involving this facility are regularly exercised. The replica is kept current through snapshot-based replication several times per day and currently covers all dataset metadata — including individual landing pages and all harvesting endpoints — full representations of tabular data publications, the Elasticsearch search index, and the relevant web frontend. Extension of the replica to include binary data files is under active development, with formal commitment to be finalized through the ongoing MARUM hosting agreement negotiations.

The off-site installation operates in a fully isolated environment with Layer 2 network separation. Access is restricted to VPN connections from a single designated gateway host at AWI, with access tokens and keys limited to the corresponding gateway and PANGAEA DevOps staff.
== Middleware: Processing and Transformation ==
The middleware layer manages the flow of data and metadata between the backend storage systems and the frontend interfaces. Its core functions are marshaling, indexing, transformation, and dissemination.

Metadata are dynamically marshaled from the relational database to a PANGAEA-specific internal XML format, from which they are transformed via '''XSLT and XML-to-JSON pipelines''' into a range of content standards for delivery to users and harvesting services. Currently supported output standards include JSON-LD (Schema.org), DataCite XML, Dublin Core XML, ISO 19115/ISO 19139, and DIF/FGDC. Dissemination occurs via OAI-PMH, HTTP content negotiation following the Signposting standard (<nowiki>https://signposting.org/</nowiki>), and other protocols. Metadata marshaling from the PostgreSQL database to the Elasticsearch search index is handled asynchronously by dedicated middleware components, which also manage automated DOI minting at DataCite upon dataset publication.

The marshaled metadata are stored and indexed in '''Elasticsearch''', which serves as the primary index for all public search and metadata access interfaces. This architecture separates the authoritative relational record from the search-optimized representation, allowing the search index to be rebuilt or updated without modifying archived data.

The flexible metadata framework '''PanFMP''' (<nowiki>https://www.panfmp.org/</nowiki>) underpins the mapping between the PANGAEA internal schema and the various external output standards, ensuring that new or updated standards can be accommodated without structural changes to the underlying data.

Data submissions, user requests, and bug reports are managed through a '''JIRA''' (Atlassian) issue tracking system, which serves as the primary communication channel between data providers and the PANGAEA editorial team throughout the submission and publication process.
== Frontend: Editorial System ==
The PANGAEA editorial system is the primary tool through which data editors review, curate, harmonize, and publish submitted data and metadata. It is a '''web-based client/server application''' developed entirely in-house, operating directly on the PostgreSQL databases.

The backend of the editorial system is built on '''Java 17''' using the '''Dropwizard''' framework, exposing a REST API. The frontend is implemented in '''React''' with the '''Ant Design''' component library. Source code for both components, together with shared libraries and test suites, is maintained in an institutional '''GitLab''' instance hosted at AWI, structured as an '''Nx monorepo'''. The repository encompasses automated unit tests and '''Cypress''' end-to-end tests. All new versions are deployed exclusively through a '''GitLab CI/CD pipeline''', with releases gated on successful completion of the full automated test suite.

In production, the editorial system runs on Ubuntu virtual machines hosted on VMware infrastructure. '''Apache''' serves the React frontend and acts as a reverse proxy to the backend services. Four parallel instances are operated simultaneously, providing dedicated test and production environments for two editorial system versions at any given time. This setup supports controlled feature validation alongside uninterrupted service operation. Infrastructure components — including VMware, file systems, monitoring, GitLab, and the database platform — are managed by AWI, while the PANGAEA group is responsible for the Apache web server, Java services, and React frontend.

The relational database model underlying the editorial system enforces structural and semantic consistency between submitted metadata and the existing data inventory. The editorial system also integrates a '''Terminology Catalogue (TC)''', which manages controlled vocabularies and ontologies used to harmonize data and metadata during ingest. Supported terminologies include WoRMS, ITIS, ChEBI, EnvO, PATO, QUDT, and the NERC vocabulary server, among others. A detailed description of the Terminology Catalogue and its role in data archiving and publication is given in Diepenbroek et al. (2017).
== Frontend: Public Web Interface and Search ==

=== Web Delivery Stack ===
The public web infrastructure is fronted by an '''NGINX''' reverse proxy that manages incoming traffic and supports '''HTTP/3 (QUIC)''' and '''HTTP/2''' alongside HTTP/1.1 for broad client compatibility. Behind this entry point, a dual-backend architecture separates concerns by function: a '''PHP 8''' environment handles general information pages and the PANGAEA Wiki, while a '''Java 17''' application layer running on '''Eclipse Jetty 12''' manages core repository services — including dataset landing pages, DOI and handle resolution, and content negotiation for both human- and machine-readable data access. The Jetty-based layer also manages delivery of high-volume and binary files from the tape archive by staging requested files to a local hard-disk cache before serving them to users.

=== Search ===
The PANGAEA search engine supports full-text and faceted search across all published metadata. Faceted navigation is enabled by semantic metadata enrichment performed during the marshaling process: terminology-based annotations are added to metadata records, allowing consistent filtering by topic, device type, geographic region, and other dimensions. PANGAEA's approach to data discovery complies with the recent recommendations of the Research Data Alliance (RDA) Interest Group "Data Discovery Paradigms" (Wu, M. et al., 2026, <nowiki>https://doi.org/10.5334/dsj-2026-006</nowiki>). Search documentation is available in the Wiki at <nowiki>https://wiki.pangaea.de/wiki/PANGAEA_search</nowiki>.

=== Map Search ===
Geographic search and visualization are implemented using '''leaflet.js''', serving map tiles from four configurable sources: the AWI basemap, OpenStreetMap, Google (hybrid), and ESRI.

=== Dataset Landing Pages ===
Each published dataset is represented by a landing page resolved through its DOI. Landing pages present the full dataset metadata in human-readable form and are enriched with structured '''schema.org''' markup in JSON-LD, ensuring machine-actionability and compatibility with generic web search engines and data registries. The schema.org metadata is also accessible via HTTP content negotiation following the Signposting standard. Minor editorial corrections to datasets that do not affect scientific content are documented in a "Change history" section of the landing page, recording the date and a summary of each change applied.

=== Programmatic Access ===
PANGAEA offers programmatic access to data and metadata through a range of web services (REST and SOAP). The OAI-PMH endpoint supports metadata harvesting in all supported standards. Client libraries for '''Python''' (pangaeapy, developed by PANGAEA) and '''R''' (pangaear, developed by the community) allow researchers to load and transform PANGAEA data directly into native data structures for analysis in environments such as Jupyter notebooks.

== Monitoring and Analytics ==
Service health across the PANGAEA infrastructure is monitored through the '''AWI Grafana/Telegraf''' stack, covering service availability, application logs, and resource saturation for all production systems. This is supplemented by external availability checks via '''UptimeRobot''', which provides independent verification of public-facing service endpoints from outside the AWI network. All external links in PANGAEA metadata records — references to related literature, other dataset versions, and external resources — are automatically checked for broken (HTTP 404) or permanently redirected (HTTP 301) responses on a weekly basis.

User engagement and download metrics are captured by an integrated **Matomo** analytics instance configured to produce usage statistics compliant with the **COUNTER** (Counting Online Usage of Networked Electronic Resources) standard, ensuring that data impact is measured according to international scholarly norms. Usage statistics are publicly visible on each dataset landing page; details are documented in the Wiki at <nowiki>https://wiki.pangaea.de/wiki/Data_Usage_Statistics</nowiki>.

== Security ==
Backend and middleware systems are protected behind a firewall; frontend systems operate in a demilitarized zone (DMZ) accessible from outside with restricted but still firewalled access. Frontend systems have no write access and only limited read access to the backend database and tape archives. Public access from the website and REST APIs is served from data replicas hosted in Elasticsearch or through read-only remote filesystem access. Data curators access production systems via virtual private networks (VPN), which enforce a basic check that client operating systems are up to date.

Physical access to the AWI computing center is managed through an electronic access control system for all relevant entrances; key distribution is documented, and guest policies are formally established. The AWI maintains uninterruptible power supplies (UPS) capable of sustaining all PANGAEA-relevant hardware for up to 60 minutes, backed by a diesel-powered emergency generator providing a further 23 hours of operation. Fire alerts are automatically forwarded to the fire department with a contractual response time under 15 minutes.

Security of the technical infrastructure is further maintained through the use of asymmetric key infrastructures, mandatory minimum-length passwords for all user classes, short-cycle security patching for all hardware and software components, professional monitoring tools for hardware, firewall, software, services, performance, and attacks, and regular security training for all technical and non-technical staff. Security risks are assessed on an ongoing basis by AWI's institutional IT security team, which coordinates incident response and monitors threat landscapes relevant to research data infrastructure. PANGAEA technical staff participates in regular security reviews.

Access controls for restricted or moratorium datasets are enforced at the application layer. Dataset metadata remains publicly accessible in all cases; access to the data itself is restricted to authorized users at the individual level for the duration of the moratorium, typically a maximum of two years from submission.

The off-site replica at MARUM operates in a fully isolated network environment (Layer 2 separation) with access restricted to a firewalled VPN gateway at AWI. The Green IT Housing Center is monitored 24/7 by Bremen University staff located in very close proximity to the datacenter (15 m, adjacent building), with automated fire alarms, redundant power supply with battery backup, and multilevel physical access control.

== Software Inventory ==
The following table summarizes the principal software components of the PANGAEA infrastructure and their development model.
{| class="wikitable"
! Component !! Software !! Development Model
|-
| Primary database || PostgreSQL || Open source (community-supported)
|-
| Data warehouse || Clickhouse || Open source (community-supported)
|-
| Search and metadata index || Elasticsearch || Open source (community-supported)
|-
| Editorial system backend || Java 17 / Dropwizard || In-house development
|-
| Editorial system frontend || React / Ant Design || In-house development (open source framework)
|-
| Source control || GitLab || Open source (community-supported, self-hosted)
|-
| CI/CD pipeline || GitLab CI/CD || Open source (community-supported)
|-
| Web server / reverse proxy (editorial) || Apache || Open source (community-supported)
|-
| Web traffic management || NGINX || Open source (community-supported)
|-
| General web and Wiki frontend || PHP 8 || Open source (community-supported)
|-
| Repository services backend || Java 17 / Eclipse Jetty 12 || Open source (community-supported)
|-
| Map visualization || leaflet.js || Open source (community-supported)
|-
| Metadata framework || PanFMP || In-house development (open source)
|-
| Issue tracking || JIRA (Atlassian) || Commercial
|-
| Monitoring stack || Grafana / Telegraf || Open source (community-supported)
|-
| External uptime monitoring || UptimeRobot || Commercial SaaS
|-
| Analytics || Matomo || Open source (community-supported)
|-
| Infrastructure virtualization || VMware || Commercial
|-
| Tape archive hardware || SpectraLogic TFinity ExaScale || Commercial hardware
|-
| Python data client || pangaeapy || In-house development (open source)
|-
| R data client || pangaear || Community development (open source)
|}

== Documentation and Change Management ==
General documentation of PANGAEA systems and services is maintained in the public '''PANGAEA Wiki''' (<nowiki>https://wiki.pangaea.de/</nowiki>). A separate internal '''Confluence''' Wiki, kept operationally isolated from the main PANGAEA infrastructure for disaster-safety purposes, contains detailed documentation of server configurations, installation and maintenance procedures, relationships between system components, VM snapshot procedures, backup routines, and service restart priorities for incident response.

All changes to published data and metadata are recorded in the editorial system's version history. Substantive revisions to published datasets result in a new dataset version with a new DOI; all prior versions remain accessible and cross-referenced. Minor editorial corrections that do not affect scientific content are applied without creating a new identifier and are transparently documented in the "Change history" section of the dataset landing page, including the date and a brief summary of each change applied.

== References ==

* Felden, J., Möller, L., Schindler, U., Huber, R., Schumacher, S., Koppe, R., Diepenbroek, M. & Glöckner, F.O. (2023). PANGAEA — Data Publisher for Earth & Environmental Science. ''Scientific Data'', 10, 347. <nowiki>https://doi.org/10.1038/s41597-023-02269-x</nowiki>
* Diepenbroek, M., Schindler, U., Huber, R., Pesant, S., Stocker, M., Felden, J., Buss, M. & Weinrebe, M. (2017). Terminology supported archiving and publication of environmental science data in PANGAEA. ''Journal of Biotechnology'', 261, 177–186. <nowiki>https://doi.org/10.1016/j.jbiotec.2017.07.016</nowiki>
* Wu, M. et al. (2026). RDA Interest Group on Data Discovery Paradigms. ''Data Science Journal''. <nowiki>https://doi.org/10.5334/dsj-2026-006</nowiki>

== See Also ==

*[[PANGAEA search]]
*[[Data submission]]
*[[Authors Guides]]
*[[Curation levels]]
*[[Processing levels]]
*[[Data Usage Statistics]]
*[[Format]]
*[[PANGAEA XML schema]]
*[[Best practice manuals and templates]]
*[[Continuity Plan]] — https://www.pangaea.de/about/continuity.php
*[[Preservation Plan]] — https://www.pangaea.de/about/preservation.php

Technology

2026-04-03T16:03:22Z

Lmoeller: Updated according to new technical details introduced to the CTS re-certification 2026

{{construction}}

PANGAEA is built on a three-tiered client/server architecture that has evolved continuously since the system's founding in the early 1990s. The architecture separates backend storage and database systems, middleware processing and transformation components, and frontend delivery and user interfaces into distinct layers that are individually maintainable and replaceable. This design philosophy has enabled PANGAEA to migrate core components — including its primary database engine, its editorial system, and its search infrastructure — without disruption to archived data or to users, and underpins the system's long-term stability as a trustworthy data repository.

A detailed description of the PANGAEA information system and its workflows is provided in Felden et al. (2023) and Diepenbroek et al. (2017); this article provides an up-to-date overview oriented toward the general structure and key components.

== System Architecture ==

The technical architecture of PANGAEA follows a three-tiered model comprising a backend, a middleware layer, and a frontend. All hard- and software services are hosted and operated by the data and computing center of the Alfred Wegener Institute, Helmholtz Centre for Polar and Marine Research (AWI) in Bremerhaven. Most backend and middleware systems, as well as all frontend web servers and search engines, run on virtual machines (VMware) operated with Ubuntu Linux. Virtualization provides sufficient capacity and performance while enabling high availability and transparent hardware renewal on a typical cycle of three to four years.
[[File:PANGAEA overview architecture.png|alt=Schema illustrating the system architecture according to Felden et al., 2023|thumb|Schema illustrating the system architecture according to Felden et al., 2023|none|600x600px]]

PANGAEA currently operates nine virtual machines at the AWI computing center, using a total of 53 CPUs, 162 GB RAM, and 28 TB of disk space. The relational database holdings currently occupy approximately 130 TB, with approximately 1 PB of data stored on tape.

== Backend: Storage and Databases ==

=== Relational Database ===
The primary store for all structured data and metadata in PANGAEA is a '''PostgreSQL''' relational database management system (RDBMS). The data model is highly normalized, reflecting the full observational context of each measurement: events (when and where data were collected), campaigns (cruises or field expeditions), methods and devices, parameters, references, and institutional provenance information are all stored in defined relational structures. This normalized model allows data descriptions to be compiled dynamically and serialized into a wide range of output formats on demand, without modifying the underlying archived data.

Database integrity is continuously maintained through '''PostgreSQL streaming replication''' to a dedicated backup system, which enables point-in-time recovery to any moment prior to a failure event. In addition, full database backup copies (`pg_base_backup`) are created each weekend and retained for three weeks, providing an independent recovery layer supplementary to the streaming replica.

=== Data Warehouse ===
Fast access to large, aggregated compilations of data across the full PANGAEA holdings is provided by a '''Clickhouse''' data warehouse, which mirrors the numerical data inventory from the relational database. The data warehouse supports spatially and chronologically constrained queries at the parameter level across potentially hundreds of thousands of individual datasets, and is accessible both through the PANGAEA website and programmatically via a REST API. All data warehouse exports include the DOI for each contributing dataset, ensuring that provenance and citation remain intact in compiled data products.

=== Archival Storage ===
For high-volume and binary data — including geophysical datasets, images, video, and community-specific formats such as NetCDF — data are stored in consistent formats on hard-disk arrays and robotic tape archives. The central archival storage system consists of two '''SpectraLogic TFinity ExaScale''' robotic tape libraries housed in separate buildings at AWI, with a combined capacity of up to 60 PB. Backup operations use high-capacity LTO-tape drives within this environment.

All data are stored redundantly using '''erasure coding''' across disk and tape. Data on disk is replicated to tape nightly and saved to snapshots retained for six months. Tape copies are replicated to a physically separate building within two hours of creation; decommissioned tapes are retained for one year before reuse. Virtual machine working data is captured in nightly machine snapshots. Write caches are battery-backed to ensure data integrity at the point of write.

=== Off-Site Replica ===
Since 2025, PANGAEA operates a **minimal viable repository service** as an off-site replica at MARUM/University of Bremen, hosted in the Green IT Housing Center of Bremen University. This facility provides geographic and infrastructural separation from the primary AWI systems, enabling recovery of data delivery services within 24 hours following a technical breakdown or cyberattack. Disaster recovery and switchover procedures involving this facility are regularly exercised. The replica is kept current through snapshot-based replication several times per day and currently covers all dataset metadata — including individual landing pages and all harvesting endpoints — full representations of tabular data publications, the Elasticsearch search index, and the relevant web frontend. Extension of the replica to include binary data files is under active development, with formal commitment to be finalized through the ongoing MARUM hosting agreement negotiations.

The off-site installation operates in a fully isolated environment with Layer 2 network separation. Access is restricted to VPN connections from a single designated gateway host at AWI, with access tokens and keys limited to the corresponding gateway and PANGAEA DevOps staff.
== Middleware: Processing and Transformation ==
The middleware layer manages the flow of data and metadata between the backend storage systems and the frontend interfaces. Its core functions are marshaling, indexing, transformation, and dissemination.

Metadata are dynamically marshaled from the relational database to a PANGAEA-specific internal XML format, from which they are transformed via '''XSLT and XML-to-JSON pipelines''' into a range of content standards for delivery to users and harvesting services. Currently supported output standards include JSON-LD (Schema.org), DataCite XML, Dublin Core XML, ISO 19115/ISO 19139, and DIF/FGDC. Dissemination occurs via OAI-PMH, HTTP content negotiation following the Signposting standard (<nowiki>https://signposting.org/</nowiki>), and other protocols. Metadata marshaling from the PostgreSQL database to the Elasticsearch search index is handled asynchronously by dedicated middleware components, which also manage automated DOI minting at DataCite upon dataset publication.

The marshaled metadata are stored and indexed in '''Elasticsearch''', which serves as the primary index for all public search and metadata access interfaces. This architecture separates the authoritative relational record from the search-optimized representation, allowing the search index to be rebuilt or updated without modifying archived data.

The flexible metadata framework '''PanFMP''' (<nowiki>https://www.panfmp.org/</nowiki>) underpins the mapping between the PANGAEA internal schema and the various external output standards, ensuring that new or updated standards can be accommodated without structural changes to the underlying data.

Data submissions, user requests, and bug reports are managed through a '''JIRA''' (Atlassian) issue tracking system, which serves as the primary communication channel between data providers and the PANGAEA editorial team throughout the submission and publication process.
== Frontend: Editorial System ==
The PANGAEA editorial system is the primary tool through which data editors review, curate, harmonize, and publish submitted data and metadata. It is a '''web-based client/server application''' developed entirely in-house, operating directly on the PostgreSQL databases.

The backend of the editorial system is built on '''Java 17''' using the '''Dropwizard''' framework, exposing a REST API. The frontend is implemented in '''React''' with the '''Ant Design''' component library. Source code for both components, together with shared libraries and test suites, is maintained in an institutional '''GitLab''' instance hosted at AWI, structured as an '''Nx monorepo'''. The repository encompasses automated unit tests and '''Cypress''' end-to-end tests. All new versions are deployed exclusively through a '''GitLab CI/CD pipeline''', with releases gated on successful completion of the full automated test suite.

In production, the editorial system runs on Ubuntu virtual machines hosted on VMware infrastructure. '''Apache''' serves the React frontend and acts as a reverse proxy to the backend services. Four parallel instances are operated simultaneously, providing dedicated test and production environments for two editorial system versions at any given time. This setup supports controlled feature validation alongside uninterrupted service operation. Infrastructure components — including VMware, file systems, monitoring, GitLab, and the database platform — are managed by AWI, while the PANGAEA group is responsible for the Apache web server, Java services, and React frontend.

The relational database model underlying the editorial system enforces structural and semantic consistency between submitted metadata and the existing data inventory. The editorial system also integrates a '''Terminology Catalogue (TC)''', which manages controlled vocabularies and ontologies used to harmonize data and metadata during ingest. Supported terminologies include WoRMS, ITIS, ChEBI, EnvO, PATO, QUDT, and the NERC vocabulary server, among others. A detailed description of the Terminology Catalogue and its role in data archiving and publication is given in Diepenbroek et al. (2017).
== Frontend: Public Web Interface and Search ==

=== Web Delivery Stack ===
The public web infrastructure is fronted by an '''NGINX''' reverse proxy that manages incoming traffic and supports '''HTTP/3 (QUIC)''' and '''HTTP/2''' alongside HTTP/1.1 for broad client compatibility. Behind this entry point, a dual-backend architecture separates concerns by function: a '''PHP 8''' environment handles general information pages and the PANGAEA Wiki, while a '''Java 17''' application layer running on '''Eclipse Jetty 12''' manages core repository services — including dataset landing pages, DOI and handle resolution, and content negotiation for both human- and machine-readable data access. The Jetty-based layer also manages delivery of high-volume and binary files from the tape archive by staging requested files to a local hard-disk cache before serving them to users.

=== Search ===
The PANGAEA search engine supports full-text and faceted search across all published metadata. Faceted navigation is enabled by semantic metadata enrichment performed during the marshaling process: terminology-based annotations are added to metadata records, allowing consistent filtering by topic, device type, geographic region, and other dimensions. PANGAEA's approach to data discovery complies with the recent recommendations of the Research Data Alliance (RDA) Interest Group "Data Discovery Paradigms" (Wu, M. et al., 2026, <nowiki>https://doi.org/10.5334/dsj-2026-006</nowiki>). Search documentation is available in the Wiki at <nowiki>https://wiki.pangaea.de/wiki/PANGAEA_search</nowiki>.

=== Map Search ===
Geographic search and visualization are implemented using '''leaflet.js''', serving map tiles from four configurable sources: the AWI basemap, OpenStreetMap, Google (hybrid), and ESRI.

=== Dataset Landing Pages ===
Each published dataset is represented by a landing page resolved through its DOI. Landing pages present the full dataset metadata in human-readable form and are enriched with structured '''schema.org''' markup in JSON-LD, ensuring machine-actionability and compatibility with generic web search engines and data registries. The schema.org metadata is also accessible via HTTP content negotiation following the Signposting standard. Minor editorial corrections to datasets that do not affect scientific content are documented in a "Change history" section of the landing page, recording the date and a summary of each change applied.

=== Programmatic Access ===
PANGAEA offers programmatic access to data and metadata through a range of web services (REST and SOAP). The OAI-PMH endpoint supports metadata harvesting in all supported standards. Client libraries for '''Python''' (pangaeapy, developed by PANGAEA) and '''R''' (pangaear, developed by the community) allow researchers to load and transform PANGAEA data directly into native data structures for analysis in environments such as Jupyter notebooks.

== Monitoring and Analytics ==
Service health across the PANGAEA infrastructure is monitored through the '''AWI Grafana/Telegraf''' stack, covering service availability, application logs, and resource saturation for all production systems. This is supplemented by external availability checks via '''UptimeRobot''', which provides independent verification of public-facing service endpoints from outside the AWI network. All external links in PANGAEA metadata records — references to related literature, other dataset versions, and external resources — are automatically checked for broken (HTTP 404) or permanently redirected (HTTP 301) responses on a weekly basis.

User engagement and download metrics are captured by an integrated **Matomo** analytics instance configured to produce usage statistics compliant with the **COUNTER** (Counting Online Usage of Networked Electronic Resources) standard, ensuring that data impact is measured according to international scholarly norms. Usage statistics are publicly visible on each dataset landing page; details are documented in the Wiki at <nowiki>https://wiki.pangaea.de/wiki/Data_Usage_Statistics</nowiki>.

== Security ==
Backend and middleware systems are protected behind a firewall; frontend systems operate in a demilitarized zone (DMZ) accessible from outside with restricted but still firewalled access. Frontend systems have no write access and only limited read access to the backend database and tape archives. Public access from the website and REST APIs is served from data replicas hosted in Elasticsearch or through read-only remote filesystem access. Data curators access production systems via virtual private networks (VPN), which enforce a basic check that client operating systems are up to date.

Physical access to the AWI computing center is managed through an electronic access control system for all relevant entrances; key distribution is documented, and guest policies are formally established. The AWI maintains uninterruptible power supplies (UPS) capable of sustaining all PANGAEA-relevant hardware for up to 60 minutes, backed by a diesel-powered emergency generator providing a further 23 hours of operation. Fire alerts are automatically forwarded to the fire department with a contractual response time under 15 minutes.

Security of the technical infrastructure is further maintained through the use of asymmetric key infrastructures, mandatory minimum-length passwords for all user classes, short-cycle security patching for all hardware and software components, professional monitoring tools for hardware, firewall, software, services, performance, and attacks, and regular security training for all technical and non-technical staff. Security risks are assessed on an ongoing basis by AWI's institutional IT security team, which coordinates incident response and monitors threat landscapes relevant to research data infrastructure. PANGAEA technical staff participates in regular security reviews.

Access controls for restricted or moratorium datasets are enforced at the application layer. Dataset metadata remains publicly accessible in all cases; access to the data itself is restricted to authorized users at the individual level for the duration of the moratorium, typically a maximum of two years from submission.

The off-site replica at MARUM operates in a fully isolated network environment (Layer 2 separation) with access restricted to a firewalled VPN gateway at AWI. The Green IT Housing Center is monitored 24/7 by Bremen University staff located in very close proximity to the datacenter (15 m, adjacent building), with automated fire alarms, redundant power supply with battery backup, and multilevel physical access control.

== Software Inventory ==
The following table summarizes the principal software components of the PANGAEA infrastructure and their development model.

== Documentation and Change Management ==
General documentation of PANGAEA systems and services is maintained in the public '''PANGAEA Wiki''' (<nowiki>https://wiki.pangaea.de/</nowiki>). A separate internal '''Confluence''' Wiki, kept operationally isolated from the main PANGAEA infrastructure for disaster-safety purposes, contains detailed documentation of server configurations, installation and maintenance procedures, relationships between system components, VM snapshot procedures, backup routines, and service restart priorities for incident response.

All changes to published data and metadata are recorded in the editorial system's version history. Substantive revisions to published datasets result in a new dataset version with a new DOI; all prior versions remain accessible and cross-referenced. Minor editorial corrections that do not affect scientific content are applied without creating a new identifier and are transparently documented in the "Change history" section of the dataset landing page, including the date and a brief summary of each change applied.

== References ==

* Felden, J., Möller, L., Schindler, U., Huber, R., Schumacher, S., Koppe, R., Diepenbroek, M. & Glöckner, F.O. (2023). PANGAEA — Data Publisher for Earth & Environmental Science. ''Scientific Data'', 10, 347. <nowiki>https://doi.org/10.1038/s41597-023-02269-x</nowiki>
* Diepenbroek, M., Schindler, U., Huber, R., Pesant, S., Stocker, M., Felden, J., Buss, M. & Weinrebe, M. (2017). Terminology supported archiving and publication of environmental science data in PANGAEA. ''Journal of Biotechnology'', 261, 177–186. <nowiki>https://doi.org/10.1016/j.jbiotec.2017.07.016</nowiki>
* Wu, M. et al. (2026). RDA Interest Group on Data Discovery Paradigms. ''Data Science Journal''. <nowiki>https://doi.org/10.5334/dsj-2026-006</nowiki>

== See Also ==

*[[PANGAEA search]]
*[[Data submission]]
*[[Authors Guides]]
*[[Curation levels]]
*[[Processing levels]]
*[[Data Usage Statistics]]
*[[Format]]
*[[PANGAEA XML schema]]
*[[Best practice manuals and templates]]
*[[Continuity Plan]] — https://www.pangaea.de/about/continuity.php
*[[Preservation Plan]] — https://www.pangaea.de/about/preservation.php

Preservation Plan

2026-04-03T11:25:55Z

Lmoeller: Finalized updating

PANGAEA is committed to the long-term preservation of all data and metadata entrusted to it by the research community. This commitment extends beyond bit-level storage integrity to encompass active management of format usability, semantic consistency over time, and formally documented procedures for all stages of the archival lifecycle. The following article describes the technical and organizational measures that together constitute PANGAEA's preservation strategy. It supplements the information provided in Felden et al. (2023) and is one of the reference documents for PANGAEA's CoreTrustSeal certification.
== Principles ==
PANGAEA's preservation approach is grounded in three guiding principles. First, data are not merely stored as files but are ingested into a structured, normalized relational database that preserves the full semantic context of each measurement — ensuring that data remain interpretable independently of any external documentation. Second, preservation is an active process: PANGAEA monitors the long-term usability of archived formats and takes preventive action against obsolescence, including the creation of format-migrated copies when required. Third, institutional commitment is formally secured: the AWI/MARUM cooperation agreement (AMAR) guarantees that all archived data and metadata will remain accessible for a minimum of ten years following any formal decommissioning of PANGAEA, and that the host institutions will maintain the necessary infrastructure and expertise to honor this commitment.

PANGAEA's ingest and archiving workflow is compliant with the Open Archival Information System (OAIS) standard (ISO 14721).
== Metadata Preservation ==
PANGAEA treats metadata as essential for the long-term reusability of data. Metadata are stored in a highly normalized PostgreSQL relational database, whose schema is modeled to be compatible with international standards including Schema.org and ISO 19115. This normalized structure allows dataset representations to be compiled dynamically and serialized into a wide range of output formats on demand, without modifying the underlying archived records.

The following metadata categories are collected and preserved for every published dataset:

'''Citation metadata:''' author and contributor names with ORCID iDs; institutional affiliations with ROR identifiers; dataset title; publication year; publisher; DOI name; [resource type] according to Stall et al., 2023

'''Funding information:''' project names, grant numbers, and funder identifiers (Crossref Funder IDs or ROR identifiers).

'''Event information:''' detailed spatial and temporal coverage of sampling or measurement events, including methods, devices, and campaign context.

'''Related documentation:''' links (using DOIs or other persistent identifiers) to related scientific articles, reports, and supplementary materials. Where related documentation is not held in an external repository with a persistent identifier, PANGAEA stores a local copy in PDF/A format. PDF/A is preferred for its long-term stability; copies will be migrated to successor standards if PDF/A itself becomes obsolete.

PANGAEA's database schema is continuously adapted to accommodate new and evolving metadata standards. When schema extensions are introduced, the metadata of existing datasets are reviewed and updated accordingly. All such changes are managed carefully to avoid incompatible modifications to existing records.
== Data Object Preservation ==

=== Preservation Responsibility by Data Type ===
PANGAEA differentiates its preservation responsibility according to data type. For structured tabular data stored in the relational database, PANGAEA assumes full responsibility for structural and semantic integrity over time. For binary objects, PANGAEA commits to monitoring format viability and taking remedial action — including format migration — when long-term accessibility is threatened; the original submitted files are always retained regardless of any migration.

=== Tabular Data ===
Except for binary objects, all submitted data values are imported into the PANGAEA relational database as structured data series. Each data entry in a data series carries metadata about its type (numeric, date/time, string), the responsible scientist (PI), the methodology applied, and, for numerical values, format information including significant digits. This structured representation decouples the data from any particular file format, ensuring long-term interpretability regardless of changes in software environments.

At the time of archival, a copy of each dataset — together with a checksum and timestamp — is additionally marshaled to disk as a tab-delimited text file. These reference copies serve as the basis for fixity verification: in the event of suspected data loss or corruption, the relational database contents can be compared against these checksummed disk copies. The data submission system independently retains the original files uploaded by data providers, providing a further reference point for integrity checking.

=== Binary Data ===
Not all data held in PANGAEA is available in tabular form. Some datasets are archived in compact, community-specific binary formats — including NetCDF files, images, video recordings, and geophysical data products. High-volume and binary files are stored on hard-disk arrays and robotic tape archives; when a user requests access to a file held on tape, it is staged to a local disk cache before delivery, ensuring that tape-based archival does not compromise accessibility for the Designated Community.

For binary objects, long-term usability is an active responsibility: the PANGAEA team monitors software dependencies, version changes, and backward compatibility issues for all archived binary formats. Where continued readability requires it, new format-migrated copies are created and archived; the original submitted file is always retained alongside any migrated copy. Data deposited in non-preferred formats are either transformed to PANGAEA-compatible formats during editorial processing or, if transformation is not feasible, stored as static binary files with appropriate metadata documentation.

PANGAEA applies format rules before accepting binary data for archival. Where possible, uncompressed or widely supported open formats are preferred. Currently accepted formats are:

* '''Images:''' JPEG, PNG, TIFF
* '''Documents:''' PDF/A (preferred), ODF, OOXML
* '''Media containers:''' MP4, MPG, OGG, Matroska; audio and video content within containers must comply with the following codecs:
** ''Video:'' uncompressed, MPEG-1, MPEG-4 Part 2, AVC/H.264, H.265
** ''Audio:'' uncompressed, MPEG Layer III (MP3), MPEG-4 Part 3/AAC
* '''Scientific data:''' NetCDF, preferably using the Climate and Forecast (CF) Metadata Conventions; detailed documentation is required in all other cases

This list is not exhaustive and is updated as community standards evolve. If any accepted format becomes deprecated or is superseded, PANGAEA will migrate archived copies to modern equivalents while retaining the originals.

Raw data — defined as level-0 data without any accompanying metadata — are not accepted for archival. Data at processing level 1 (raw data with a minimum set of metadata) may be accepted if adequate contextual information is provided. No guarantees are given for the long-term usability of level-0 or level-1 datasets. Processing levels are documented here: [[Processing levels]].
== Storage and Physical Infrastructure ==
PANGAEA's storage infrastructure is operated by the computing center of the Alfred Wegener Institute (AWI) in Bremerhaven, in accordance with the AMAR cooperation agreement. PANGAEA currently uses approximately 5 TB for its relational databases and approximately 1 PB for data on tape. A comprehensive set of technical and organizational measures (TOM) is in place:

'''Redundant storage:''' all data are stored using erasure coding across disk and tape, with write caches battery-backed to ensure integrity at the point of write. Data on disk is replicated to tape nightly and saved to snapshots retained for six months. Tape copies are replicated to a physically separate building within two hours of creation; decommissioned tapes are retained for one year before reuse. Virtual machine working data is captured in nightly machine snapshots.

'''Tape archive:''' the central archival storage system consists of two SpectraLogic TFinity ExaScale robotic tape libraries, housed in separate buildings at AWI, with a combined capacity of up to 60 PB and using high-capacity LTO-tape drives.

'''Database integrity:''' PostgreSQL streaming replication to a dedicated backup system enables point-in-time recovery to any moment prior to a failure event. In addition, full database backup copies (`pg_base_backup`) are created each weekend and retained for three weeks, providing an additional recovery layer independent of the streaming replica.

'''Facility measures:''' fire and smoke detection systems; server room monitoring of temperature and humidity; server room air-conditioning; uninterruptible power supplies (UPS) capable of sustaining all PANGAEA-relevant hardware for up to 60 minutes, backed by a diesel-powered emergency generator providing a further 23 hours of operation; RAID and hard disk mirroring in the virtualization environment; user permission management; network firewall and intrusion detection systems; anti-virus email filtering.

'''Documentation:''' all systems are documented in an internal Confluence Wiki kept operationally isolated from the main PANGAEA infrastructure for disaster-safety purposes. A ticket system is used to track and manage incidents.

Hardware is typically renewed every three to four years through the AWI computing center's lifecycle management program, implemented transparently via virtualization. AWI holds support contracts for all relevant hardware and non-open-source software, with five business-day response times and one-day part replacement guarantees, ensuring that hardware failures can be remedied without extended service interruption.

== Off-Site Replica at MARUM ==
Since 2025, PANGAEA operates a minimal viable repository service as an off-site replica at MARUM/University of Bremen, hosted in the Green IT Housing Center (Rechenzentrum) of Bremen University. This facility provides geographic and institutional separation from the primary AWI infrastructure and is a key component of PANGAEA's resilience strategy against both technical failure and cyberattack scenarios. The replica enables recovery of data delivery services within 24 hours following a catastrophic failure at AWI. Disaster recovery and switchover procedures involving this facility are regularly exercised.

The off-site installation currently covers:

* All dataset metadata, including individual DOI landing pages and all metadata serializations available via harvesting endpoints (OAI-PMH - Schema.org, DataCite, Dublin Core, DIF, ISO 19139)
* Full representations of tabular data publications
* The Elasticsearch search index
* The relevant web frontend

Extension of the replica to include binary data files is under active development, with formal commitment to be finalized through the ongoing MARUM hosting agreement negotiations.

The Green IT Housing Center maintains the following physical and technical safeguards: two separate fire sections with servers distributed across both for site redundancy; 24/7 on-site monitoring by Bremen University staff located in very close proximity to the datacenter (15 m, adjacent building); automated fire alarms with a fire brigade station less than 1 km away; redundant power supply with battery backup; and multilevel physical access control ensuring that only authorized personnel can access the physical hardware. The off-site installation operates in a fully isolated network environment (Layer 2 separation), accessible only via a firewalled VPN gateway at AWI. The number of access tokens and keys is strictly limited to the corresponding gateway host and PANGAEA DevOps staff. Replication is unidirectional from AWI to MARUM, using snapshot-based transfers executed multiple times per day.

== Versioning and Persistent Identifiers ==
Every published dataset is assigned a universally unique Digital Object Identifier (DOI) minted at DataCite. DOI minting is automated as part of the archival workflow. DOI resolution is actively maintained: PANGAEA keeps its authoritative metadata records at DataCite synchronized with dataset landing pages, and all external links in metadata records are checked automatically on a weekly basis for broken (HTTP 404) or permanently redirected (HTTP 301) responses.

New DOI names are created under clearly defined conditions:

* A new identifier is issued upon the initial publication of each dataset.
* A new identifier is issued when a published dataset undergoes a substantive revision of data or metadata that would affect reproducibility or scientific interpretation. The prior version remains accessible and is cross-referenced in the metadata record of the new version.
* Minor editorial corrections that do not affect scientific content are applied without creating a new identifier. Such corrections are transparently documented in the “Change history” section of the dataset landing pages, including the date and a brief summary of the changes applied.

All versions are linked in the metadata record, ensuring full traceability of the publication history for data users and citing authors.
== Deletion and Tombstone Records ==
PANGAEA does not routinely delete or remove published datasets. In exceptional cases where retraction is required — for example, due to confirmed scientific error, misconduct, copyright infringement, or data privacy obligations under applicable law (e.g., GDPR Art. 17) — the following procedure applies: the data itself is made inaccessible, but the DOI and the dataset landing page are retained as a tombstone record. The tombstone record clearly indicates the dataset's status and the documented reason for its withdrawal, in accordance with DataCite's metadata practices for retracted records. All such actions are logged in the editorial system's change history.

== Custody Transfer and Decommissioning ==
Should PANGAEA cease operations, the host institutions guarantee that all data and metadata will remain accessible for a minimum of ten years following any formal decommissioning. In such a scenario, only the submission and editorial system would be terminated; the database and data delivery services would remain operational. A full transition of custody to another repository could be supported by the off-site replica at MARUM as a concrete technical starting point, providing immediate access to all metadata and tabular data holdings. As a further fallback, PANGAEA can be reduced to a file-based repository: a complete file-based copy of all datasets, including binary objects, can be assembled and made available independently by AWI and/or the University of Bremen. The legal and institutional basis for these guarantees is the AWI/MARUM cooperation agreement (AMAR); a summary of its key commitments is available in the [[Continuity Plan]].

== Community-Specific Preservation Documentation ==
In coordination with scientific communities, PANGAEA has developed detailed documentation on the harmonization and preservation of specific data types. These include guidance on CTD data, Thermosalinograph (TSG) underway data, and bathymetric data, among others. Where applicable, these documents contain information on format choices and long-term preservation handling specific to the relevant data type. See [[Best practice manuals and templates]] for the full collection.
== References ==

* Felden, J., Möller, L., Schindler, U., Huber, R., Schumacher, S., Koppe, R., Diepenbroek, M. & Glöckner, F.O. (2023). PANGAEA — Data Publisher for Earth & Environmental Science. ''Scientific Data'', 10, 347. https://doi.org/10.1038/s41597-023-02269-x
* Stall, S., Bilder, G., Cannon, M. ''et al.'' Journal Production Guidance for Software and Data Citations. ''Sci Data'' '''10''', 656 (2023). https://doi.org/10.1038/s41597-023-02491-7
* Consultative Committee for Space Data Systems (2012). Reference Model for an Open Archival Information System (OAIS). Recommended Practice CCSDS 650.0-M-2. https://public.ccsds.org/Pubs/650x0m2.pdf

== See Also ==

* [[Technology]]
* [[Continuity Plan]] — https://www.pangaea.de/about/continuity.php
* [[Authors Guides]]
* [[Format]]
* [[Processing levels]]
* [[Curation levels]]
* [[Best practice manuals and templates]]
* [[PANGAEA XML schema]]

Preservation Plan

2026-04-03T11:01:40Z

Lmoeller: Start updating the article with recent reviewed adaptations to the CTS certification documentation

PANGAEA is committed to the long-term preservation of all data and metadata entrusted to it by the research community. This commitment extends beyond bit-level storage integrity to encompass active management of format usability, semantic consistency over time, and formally documented procedures for all stages of the archival lifecycle. The following article describes the technical and organizational measures that together constitute PANGAEA's preservation strategy. It supplements the information provided in Felden et al. (2023) and is one of the reference documents for PANGAEA's CoreTrustSeal certification.
== Principles ==
PANGAEA's preservation approach is grounded in three guiding principles. First, data are not merely stored as files but are ingested into a structured, normalized relational database that preserves the full semantic context of each measurement — ensuring that data remain interpretable independently of any external documentation. Second, preservation is an active process: PANGAEA monitors the long-term usability of archived formats and takes preventive action against obsolescence, including the creation of format-migrated copies when required. Third, institutional commitment is formally secured: the AWI/MARUM cooperation agreement (AMAR) guarantees that all archived data and metadata will remain accessible for a minimum of ten years following any formal decommissioning of PANGAEA, and that the host institutions will maintain the necessary infrastructure and expertise to honor this commitment.

PANGAEA's ingest and archiving workflow is compliant with the Open Archival Information System (OAIS) standard (ISO 14721).
== Metadata Preservation ==
PANGAEA treats metadata as essential for the long-term reusability of data. Metadata are stored in a highly normalized PostgreSQL relational database, whose schema is modeled to be compatible with international standards including Schema.org and ISO 19115. This normalized structure allows dataset representations to be compiled dynamically and serialized into a wide range of output formats on demand, without modifying the underlying archived records.

The following metadata categories are collected and preserved for every published dataset:

'''Citation metadata:''' author and contributor names with ORCID iDs; institutional affiliations with ROR identifiers; dataset title; publication year; publisher; DOI name; [resource type] according to Stall et al., 2023

'''Funding information:''' project names, grant numbers, and funder identifiers (Crossref Funder IDs or ROR identifiers).

'''Event information:''' detailed spatial and temporal coverage of sampling or measurement events, including methods, devices, and campaign context.

'''Related documentation:''' links (using DOIs or other persistent identifiers) to related scientific articles, reports, and supplementary materials. Where related documentation is not held in an external repository with a persistent identifier, PANGAEA stores a local copy in PDF/A format. PDF/A is preferred for its long-term stability; copies will be migrated to successor standards if PDF/A itself becomes obsolete.

PANGAEA's database schema is continuously adapted to accommodate new and evolving metadata standards. When schema extensions are introduced, the metadata of existing datasets are reviewed and updated accordingly. All such changes are managed carefully to avoid incompatible modifications to existing records.
== Data Object Preservation ==

=== Tabular Data ===
Except for binary objects, all submitted data values are imported into the PANGAEA relational database as structured data series. Each data entry carries metadata about its type (numeric, date/time, string), the responsible scientist (PI), the methodology applied, and, for numerical values, format information including significant digits. This structured representation decouples the data from any particular file format, ensuring long-term interpretability regardless of changes in software environments.

At the time of archival, a copy of each dataset (with checksum and timestamp) is additionally marshaled to disk as a tab-delimited text file. These copies serve as reference files for integrity verification, and the tab-delimited format ensures readability without specialized software.

=== Binary Data ===
Not all data held in PANGAEA is available in tabular form. Some datasets are archived in compact, community-specific binary formats — including NetCDF files, images, video recordings, and geophysical data products. For these, long-term usability is an active responsibility: the PANGAEA team monitors software dependencies, version changes, and backward compatibility issues for all archived binary formats. Where continued readability requires it, new format-migrated copies are created and archived; the original submitted file is always retained alongside any migrated copy.

PANGAEA applies format rules before accepting binary data for archival. Where possible, uncompressed or widely supported open formats are preferred. Currently accepted formats are:

* '''Images:''' JPEG, PNG, TIFF
* '''Documents:''' PDF/A (preferred), ODF, OOXML
* '''Media containers:''' MP4, MPG, OGG, Matroska; audio and video content within containers must comply with the following codecs:
** ''Video:'' uncompressed, MPEG-1, MPEG-4 Part 2, AVC/H.264, H.265
** ''Audio:'' uncompressed, MPEG Layer III (MP3), MPEG-4 Part 3/AAC
* '''Scientific data:''' NetCDF, preferably using the Climate and Forecast (CF) Metadata Conventions; detailed documentation is required in all other cases

This list is not exhaustive and is updated as community standards evolve. If any accepted format becomes deprecated or is superseded, PANGAEA will migrate archived copies to modern equivalents while retaining the originals.

Raw data — defined as level-0 data without any accompanying metadata — are not accepted for archival. Data at processing level 1 (raw data with a minimum set of metadata) may be accepted if adequate contextual information is provided. No guarantees are given for the long-term usability of level-0 or level-1 datasets. Processing levels are documented here: [[Processing levels]].
== Storage and Physical Infrastructure ==
PANGAEA's storage infrastructure is operated by the computing center of the Alfred Wegener Institute (AWI) in Bremerhaven, in accordance with the AMAR cooperation agreement. A comprehensive set of technical and organizational measures (TOM) is in place:

'''Redundant storage:''' all data are stored using erasure coding across disk and tape, with write caches battery-backed to ensure integrity at the point of write. Data on disk is replicated to tape nightly and saved to snapshots retained for six months. Tape copies are replicated to a physically separate building within two hours of creation; decommissioned tapes are retained for one year before reuse. Virtual machine working data is captured in nightly machine snapshots.

'''Tape archive:''' the central archival storage system consists of two SpectraLogic TFinity ExaScale robotic tape libraries, housed in separate buildings at AWI, with a combined capacity of up to 60 PB and using high-capacity LTO-tape drives.

'''Database integrity:''' PostgreSQL streaming replication to a dedicated backup system enables point-in-time recovery to any moment prior to a failure event.

'''Facility measures:''' fire and smoke detection systems; server room monitoring of temperature and humidity; server room air-conditioning; uninterruptible power supplies (UPS) capable of sustaining all PANGAEA-relevant hardware for up to 60 minutes, backed by a diesel-powered emergency generator providing a further 23 hours of operation; RAID and hard disk mirroring in the virtualization environment; user permission management; network firewall and intrusion detection systems; anti-virus email filtering.

'''Documentation:''' all systems are documented in an internal Confluence Wiki kept operationally isolated from the main PANGAEA infrastructure for disaster-safety purposes. A ticket system is used to track and manage incidents.

Hardware is typically renewed every three to four years through the AWI computing center's lifecycle management program, implemented transparently via virtualization.
== Off-Site Replica at MARUM ==
Since 2025, PANGAEA operates an off-site replica of the relational database and web frontend at MARUM/University of Bremen, hosted in the Green IT Housing Center (Rechenzentrum) of Bremen University. This facility provides geographic and institutional separation from the primary AWI infrastructure and is a key component of PANGAEA's resilience strategy against both technical failure and cyberattack scenarios.

The off-site installation currently covers:

* All dataset metadata, including individual DOI landing pages and all metadata serializations available via harvesting endpoints (OAI-PMH, schema.org, DataCite, Dublin Core, DIF, ISO 19139)
* Full representations of tabular data publications

Extension of the replica to include binary data files is planned as part of the ongoing development of this facility. The replica enables recovery of data delivery services within 24 hours following a catastrophic failure at AWI.

The MARUM Green IT Housing Center maintains the following physical and technical safeguards: two separate fire sections with servers distributed across both for site redundancy; 24/7 on-site monitoring by Bremen University staff; automated fire alarms with a fire brigade station less than 1 km away; redundant power supply with battery backup; and multilevel physical access control. The off-site installation operates in a fully isolated network environment (Layer 2 separation), accessible only via a firewalled VPN gateway at AWI. The number of access tokens and keys is strictly limited to the corresponding gateway host and PANGAEA DevOps staff. Replication is unidirectional from AWI to MARUM, using snapshot-based transfers executed multiple times per day.
== Versioning and Persistent Identifiers ==
Every published dataset is assigned a universally unique Digital Object Identifier (DOI) minted at DataCite. DOI resolution is actively maintained: PANGAEA keeps its authoritative metadata records at DataCite synchronized with dataset landing pages, and all external links in metadata records are checked automatically on a weekly basis for broken (HTTP 404) or permanently redirected (HTTP 301) responses.

New DOI names are created under clearly defined conditions:

* A new identifier is issued upon the initial publication of each dataset.
* A new identifier is issued when a published dataset undergoes a substantive revision of data or metadata that would affect reproducibility or scientific interpretation. The prior version remains accessible and is cross-referenced in the metadata record of the new version.
* Minor editorial corrections that do not affect scientific content are applied without creating a new identifier. Instead, the corrections are transparently documented in the “Change history” section of the dataset landing pages, including the date and a short summary of the changes applied.

All versions are linked in the metadata record, ensuring full traceability of the publication history for data users and citing authors.
== Deletion and Tombstone Records ==
PANGAEA does not routinely delete or remove published datasets. In exceptional cases where retraction is required — for example, due to demonstrated scientific error, misconduct, copyright infringement, or data privacy obligations under applicable law (e.g., GDPR Art. 17) — the following procedure applies: the data itself is made inaccessible, but the DOI and the dataset landing page are retained as a tombstone record. The tombstone record clearly indicates the dataset's status and the documented reason for its withdrawal, in accordance with DataCite's tombstone policy. All such actions are logged in the editorial system's change history.
== Custody Transfer and Decommissioning ==
Should PANGAEA cease operations, the host institutions guarantee that all data and metadata will remain accessible for a minimum of ten years following any formal decommissioning. In such a scenario, only the submission and editorial system would be terminated; the database and data delivery services would remain operational. A full transition of custody to another repository could be supported by the off-site replica at MARUM as a concrete technical starting point. As a further fallback, PANGAEA can be reduced to a file-based repository: a complete file-based copy of all datasets, including binary objects, can be assembled and made available independently by AWI and/or the University of Bremen. The legal and institutional basis for these guarantees is the AWI/MARUM cooperation agreement (AMAR); a summary of its key commitments is available in the [[Continuity Plan]].
== Community-Specific Preservation Documentation ==
In coordination with scientific communities, PANGAEA has developed detailed documentation on the harmonization and preservation of specific data types. These include guidance on CTD data, Thermosalinograph (TSG) underway data, and bathymetric data, among others. Where applicable, these documents contain information on format choices and long-term preservation handling specific to the relevant data type. See [[Best practice manuals and templates]] for the full collection.
== References ==

* Felden, J., Möller, L., Schindler, U., Huber, R., Schumacher, S., Koppe, R., Diepenbroek, M. & Glöckner, F.O. (2023). PANGAEA — Data Publisher for Earth & Environmental Science. ''Scientific Data'', 10, 347. https://doi.org/10.1038/s41597-023-02269-x
* Stall, S., Bilder, G., Cannon, M. ''et al.'' Journal Production Guidance for Software and Data Citations. ''Sci Data'' '''10''', 656 (2023). https://doi.org/10.1038/s41597-023-02491-7
* Consultative Committee for Space Data Systems (2012). Reference Model for an Open Archival Information System (OAIS). Recommended Practice CCSDS 650.0-M-2. https://public.ccsds.org/Pubs/650x0m2.pdf

== See Also ==

* [[Technology]]
* [[Continuity Plan]] — https://www.pangaea.de/about/continuity.php
* [[Authors Guides]]
* [[Format]]
* [[Processing levels]]
* [[Curation levels]]
* [[Best practice manuals and templates]]
* [[PANGAEA XML schema]]

Format

2026-03-31T15:09:01Z

Lmoeller: /* Proprietary formats */ changed wording a bit

PANGAEA accepts and publishes a wide spectrum of data formats. Thereby, we classify these file formats into three categories – [[Format#Documentation|Documentation formats]], and [[Format#Tabular data|Tabular]] and [[Format#Binary data|Binary data]] formats – based on how these formats are treated and processed during our editorial work, and how they are represented in our data publications.

'''Important:''' If you consider submitting data and supplementary documentation to PANGAEA, please make sure to provide open (ideally non-proprietary) formats widely accepted and endorsed in your scientific communities in order to support the accessibility and (re-)usability on long time-scales and by openly available tools.

== Documentation ==
As documentation we consider all files provided in typical text data formats meant to supplement or further describe data submissions (whether in tabular or binary formats), such as processing reports, instrument calibration protocols, standard operating procedures.

Accepted formats for documentation are:
* '''PDF'''/A (ISO19005) - http://en.wikipedia.org/wiki/PDF/A
* '''RTF''' or '''ODF''' (ISO26300) - http://en.wikipedia.org/wiki/OpenDocument
* Microsoft '''Office files''' such as '''*.docx''' for MS Word, and '''*.xlsx''' for Excel spreadsheet documents – compliant to standard OOXML (ISO/IEC 29500:2008 since Office 2013 – https://de.wikipedia.org/wiki/Microsoft_Office),
* or (our favourite and recommended) '''plain UTF-8 encoded text files''' - https://en.wikipedia.org/wiki/UTF-8

== Tabular data ==
PANGAEA is specifically good at (or, to put it differently, is able to make the most of) tabular field observation data. Owing to the many efforts we put in harmonizing parameter/variable names, methods, dimensions and units during our editorial processing, and due to the fact that this kind of data is stored in a relational database (PostgreSQL) at PANGAEA, our users are able to easily compile specific parameters/variables of interest from many similar studies (i.e. related individual publications) into meta-studies targeting new research questions and contexts using our [[PANGAEA search#Data warehouse|Data Warehouse]]. Or apply similar functionality with the help of the Python module pangaeapy or the R pendant pangaear (see our [https://www.pangaea.de/tools/ Tools site] for details).

The '''preferred''' '''formats''' for data tables are:

* '''TAB-delimited TEXT-files''' (UTF-8 encoded), or
* '''(open) spreadsheet file formats''' (MS Excel .xlsx, OpenOffice & LibreOffice Calc .ods etc)'''.'''

Please note that data tables are '''not''' accepted as encapsulated objects (e.g., in .mat files).

Example for a tabular dataset: https://doi.org/10.1594/PANGAEA.934148

== Binary data ==
Binary objects and documentations are usually stored in a combination of hard-drive arrays (for immediate and performant access) and tape archives. File formats should follow ISO standards or at least ''de facto'' standards. Online preview is available for raster graphics and videos (e.g. .tif, .png, .jpeg, .mp4).

Example: https://doi.org/10.1594/PANGAEA.936185

=== Images ===
* tiff
** http://en.wikipedia.org/wiki/Tagged_Image_File_Format
** [http://www.iso.org/iso/iso_catalogue/catalogue_ics/catalogue_detail_ics.htm?csnumber=2181 ISO12639:1998]
* jpeg
** http://en.wikipedia.org/wiki/Jpeg
** [http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=18902 ISO10918-1:1994]
*png
**https://en.wikipedia.org/wiki/Portable_Network_Graphics
**[https://www.iso.org/standard/29581.html ISO/IEC 15948:2004]

=== Video ===
see: http://de.wikipedia.org/wiki/Digital_Video
* '''MPG Container'''
** MP3
*** http://en.wikipedia.org/wiki/MP3
*** [http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=45524 ISO/IEC 14496-4:2004]?
** MPEG2 (for PAL)
*** http://en.wikipedia.org/wiki/MPEG_2
*** [http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=37680 ISO/IEC 13818-11:2004]
**** Software [http://www.videolan.org/ VLC media player] 

* '''MP4 Container'''
** AAC
*** http://en.wikipedia.org/wiki/Advanced_Audio_Coding
*** [http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=43345 ISO/IEC 13818-7:2006]
** MPEG-4 (for HDTV)
*** http://en.wikipedia.org/wiki/MPEG-4
*** [http://www.iso.org/iso/search.htm?qt=14496&published=on&active_tab=standards ISO/IEC 14496]

=== Audio ===
* MP3
* WAVE (WAV)
** description http://en.wikipedia.org/wiki/WAV
** example {{doi|10.1594/PANGAEA.339110}}

=== Seismic data ===
* segy

=== ADCP ===
* proprietary binary ''ping''-format, archived on hs, linked to metadescription in PANGAEA
** ping: http://currents.soest.hawaii.edu/docs/doc/codas_doc/CODAS_pingdemo.html
* final processed data in UTF-8, archived in ''data numeric'' of PANGAEA (file size 100-500 MB!)
** Example {{doi|10.1594/PANGAEA.701279}}

=== Large array-oriented, scientific data (no models - please see our corresponding Wiki article "[[Model data and PANGAEA]]"!) ===
* Network Common Data Form (NetCDF),
** description http://en.wikipedia.org/wiki/NetCDF
** Unidata/NSF http://www.unidata.ucar.edu/software/netcdf/
** example https://doi.org/10.1594/PANGAEA.940846
** viewer ''panoply'' http://www.giss.nasa.gov/tools/panoply/

=== Compression ===
* zip is ISO-standard and supported - *.tar, *.rar and *.7z are NOT standard and (at least the latter two) not supported by PANGAEA

=== Proprietary formats ===
If proprietary data formats cannot be avoided, please include a reference to open source software, preferably with a DOI, that can be used to open such files (e.g. at GitHub, pypi.org).

Format

2026-03-31T15:06:08Z

Lmoeller: /* Tabular data */

PANGAEA accepts and publishes a wide spectrum of data formats. Thereby, we classify these file formats into three categories – [[Format#Documentation|Documentation formats]], and [[Format#Tabular data|Tabular]] and [[Format#Binary data|Binary data]] formats – based on how these formats are treated and processed during our editorial work, and how they are represented in our data publications.

'''Important:''' If you consider submitting data and supplementary documentation to PANGAEA, please make sure to provide open (ideally non-proprietary) formats widely accepted and endorsed in your scientific communities in order to support the accessibility and (re-)usability on long time-scales and by openly available tools.

== Documentation ==
As documentation we consider all files provided in typical text data formats meant to supplement or further describe data submissions (whether in tabular or binary formats), such as processing reports, instrument calibration protocols, standard operating procedures.

Accepted formats for documentation are:
* '''PDF'''/A (ISO19005) - http://en.wikipedia.org/wiki/PDF/A
* '''RTF''' or '''ODF''' (ISO26300) - http://en.wikipedia.org/wiki/OpenDocument
* Microsoft '''Office files''' such as '''*.docx''' for MS Word, and '''*.xlsx''' for Excel spreadsheet documents – compliant to standard OOXML (ISO/IEC 29500:2008 since Office 2013 – https://de.wikipedia.org/wiki/Microsoft_Office),
* or (our favourite and recommended) '''plain UTF-8 encoded text files''' - https://en.wikipedia.org/wiki/UTF-8

== Tabular data ==
PANGAEA is specifically good at (or, to put it differently, is able to make the most of) tabular field observation data. Owing to the many efforts we put in harmonizing parameter/variable names, methods, dimensions and units during our editorial processing, and due to the fact that this kind of data is stored in a relational database (PostgreSQL) at PANGAEA, our users are able to easily compile specific parameters/variables of interest from many similar studies (i.e. related individual publications) into meta-studies targeting new research questions and contexts using our [[PANGAEA search#Data warehouse|Data Warehouse]]. Or apply similar functionality with the help of the Python module pangaeapy or the R pendant pangaear (see our [https://www.pangaea.de/tools/ Tools site] for details).

The '''preferred''' '''formats''' for data tables are:

* '''TAB-delimited TEXT-files''' (UTF-8 encoded), or
* '''(open) spreadsheet file formats''' (MS Excel .xlsx, OpenOffice & LibreOffice Calc .ods etc)'''.'''

Please note that data tables are '''not''' accepted as encapsulated objects (e.g., in .mat files).

Example for a tabular dataset: https://doi.org/10.1594/PANGAEA.934148

== Binary data ==
Binary objects and documentations are usually stored in a combination of hard-drive arrays (for immediate and performant access) and tape archives. File formats should follow ISO standards or at least ''de facto'' standards. Online preview is available for raster graphics and videos (e.g. .tif, .png, .jpeg, .mp4).

Example: https://doi.org/10.1594/PANGAEA.936185

=== Images ===
* tiff
** http://en.wikipedia.org/wiki/Tagged_Image_File_Format
** [http://www.iso.org/iso/iso_catalogue/catalogue_ics/catalogue_detail_ics.htm?csnumber=2181 ISO12639:1998]
* jpeg
** http://en.wikipedia.org/wiki/Jpeg
** [http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=18902 ISO10918-1:1994]
*png
**https://en.wikipedia.org/wiki/Portable_Network_Graphics
**[https://www.iso.org/standard/29581.html ISO/IEC 15948:2004]

=== Video ===
see: http://de.wikipedia.org/wiki/Digital_Video
* '''MPG Container'''
** MP3
*** http://en.wikipedia.org/wiki/MP3
*** [http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=45524 ISO/IEC 14496-4:2004]?
** MPEG2 (for PAL)
*** http://en.wikipedia.org/wiki/MPEG_2
*** [http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=37680 ISO/IEC 13818-11:2004]
**** Software [http://www.videolan.org/ VLC media player] 

* '''MP4 Container'''
** AAC
*** http://en.wikipedia.org/wiki/Advanced_Audio_Coding
*** [http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=43345 ISO/IEC 13818-7:2006]
** MPEG-4 (for HDTV)
*** http://en.wikipedia.org/wiki/MPEG-4
*** [http://www.iso.org/iso/search.htm?qt=14496&published=on&active_tab=standards ISO/IEC 14496]

=== Audio ===
* MP3
* WAVE (WAV)
** description http://en.wikipedia.org/wiki/WAV
** example {{doi|10.1594/PANGAEA.339110}}

=== Seismic data ===
* segy

=== ADCP ===
* proprietary binary ''ping''-format, archived on hs, linked to metadescription in PANGAEA
** ping: http://currents.soest.hawaii.edu/docs/doc/codas_doc/CODAS_pingdemo.html
* final processed data in UTF-8, archived in ''data numeric'' of PANGAEA (file size 100-500 MB!)
** Example {{doi|10.1594/PANGAEA.701279}}

=== Large array-oriented, scientific data (no models - please see our corresponding Wiki article "[[Model data and PANGAEA]]"!) ===
* Network Common Data Form (NetCDF),
** description http://en.wikipedia.org/wiki/NetCDF
** Unidata/NSF http://www.unidata.ucar.edu/software/netcdf/
** example https://doi.org/10.1594/PANGAEA.940846
** viewer ''panoply'' http://www.giss.nasa.gov/tools/panoply/

=== Compression ===
* zip is ISO-standard and supported - *.tar, *.rar and *.7z are NOT standard and (at least the latter two) not supported by PANGAEA

=== Proprietary formats ===
Include a reference, preferably with a DOI, to open source software (e.g. at GitHub, pypi.org) that can be used to open such files.

Format

2026-03-31T15:04:11Z

Lmoeller: minor changes to wording

PANGAEA accepts and publishes a wide spectrum of data formats. Thereby, we classify these file formats into three categories – [[Format#Documentation|Documentation formats]], and [[Format#Tabular data|Tabular]] and [[Format#Binary data|Binary data]] formats – based on how these formats are treated and processed during our editorial work, and how they are represented in our data publications.

'''Important:''' If you consider submitting data and supplementary documentation to PANGAEA, please make sure to provide open (ideally non-proprietary) formats widely accepted and endorsed in your scientific communities in order to support the accessibility and (re-)usability on long time-scales and by openly available tools.

== Documentation ==
As documentation we consider all files provided in typical text data formats meant to supplement or further describe data submissions (whether in tabular or binary formats), such as processing reports, instrument calibration protocols, standard operating procedures.

Accepted formats for documentation are:
* '''PDF'''/A (ISO19005) - http://en.wikipedia.org/wiki/PDF/A
* '''RTF''' or '''ODF''' (ISO26300) - http://en.wikipedia.org/wiki/OpenDocument
* Microsoft '''Office files''' such as '''*.docx''' for MS Word, and '''*.xlsx''' for Excel spreadsheet documents – compliant to standard OOXML (ISO/IEC 29500:2008 since Office 2013 – https://de.wikipedia.org/wiki/Microsoft_Office),
* or (our favourite and recommended) '''plain UTF-8 encoded text files''' - https://en.wikipedia.org/wiki/UTF-8

== Tabular data ==
PANGAEA is specifically good at (or, to put it differently, is able to make the most of) tabular field observation data. Owing to the many efforts we put in harmonizing parameter/variable names, methods, dimensions and units during our editorial processing, and due to the fact that this kind of data is stored in a relational database (PostgreSQL) at PANGAEA, our users are able to easily compile specific parameters/variables of interest from many similar studies (i.e. related individual publications) into meta-studies targeting new research questions and contexts using our [[PANGAEA search#Data warehouse|Data Warehouse]]. Or apply similar functionality with the help of the Python module pangaeapy or the R pendant pangaear (see our [https://www.pangaea.de/tools/ Tools site] for details).

The preferred '''Format''' for data tables is

* '''TAB-delimited TEXT-files''' (UTF-8 encoded), or
* '''(open) spreadsheet file formats''' (MS Excel .xlsx, OpenOffice & LibreOffice Calc .ods etc)'''.'''

Please note that tables are '''not''' accepted as encapsulated objects (e.g., in .mat files).

Example for a tabular dataset: https://doi.org/10.1594/PANGAEA.934148

== Binary data ==
Binary objects and documentations are usually stored in a combination of hard-drive arrays (for immediate and performant access) and tape archives. File formats should follow ISO standards or at least ''de facto'' standards. Online preview is available for raster graphics and videos (e.g. .tif, .png, .jpeg, .mp4).

Example: https://doi.org/10.1594/PANGAEA.936185

=== Images ===
* tiff
** http://en.wikipedia.org/wiki/Tagged_Image_File_Format
** [http://www.iso.org/iso/iso_catalogue/catalogue_ics/catalogue_detail_ics.htm?csnumber=2181 ISO12639:1998]
* jpeg
** http://en.wikipedia.org/wiki/Jpeg
** [http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=18902 ISO10918-1:1994]
*png
**https://en.wikipedia.org/wiki/Portable_Network_Graphics
**[https://www.iso.org/standard/29581.html ISO/IEC 15948:2004]

=== Video ===
see: http://de.wikipedia.org/wiki/Digital_Video
* '''MPG Container'''
** MP3
*** http://en.wikipedia.org/wiki/MP3
*** [http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=45524 ISO/IEC 14496-4:2004]?
** MPEG2 (for PAL)
*** http://en.wikipedia.org/wiki/MPEG_2
*** [http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=37680 ISO/IEC 13818-11:2004]
**** Software [http://www.videolan.org/ VLC media player] 

* '''MP4 Container'''
** AAC
*** http://en.wikipedia.org/wiki/Advanced_Audio_Coding
*** [http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=43345 ISO/IEC 13818-7:2006]
** MPEG-4 (for HDTV)
*** http://en.wikipedia.org/wiki/MPEG-4
*** [http://www.iso.org/iso/search.htm?qt=14496&published=on&active_tab=standards ISO/IEC 14496]

=== Audio ===
* MP3
* WAVE (WAV)
** description http://en.wikipedia.org/wiki/WAV
** example {{doi|10.1594/PANGAEA.339110}}

=== Seismic data ===
* segy

=== ADCP ===
* proprietary binary ''ping''-format, archived on hs, linked to metadescription in PANGAEA
** ping: http://currents.soest.hawaii.edu/docs/doc/codas_doc/CODAS_pingdemo.html
* final processed data in UTF-8, archived in ''data numeric'' of PANGAEA (file size 100-500 MB!)
** Example {{doi|10.1594/PANGAEA.701279}}

=== Large array-oriented, scientific data (no models - please see our corresponding Wiki article "[[Model data and PANGAEA]]"!) ===
* Network Common Data Form (NetCDF),
** description http://en.wikipedia.org/wiki/NetCDF
** Unidata/NSF http://www.unidata.ucar.edu/software/netcdf/
** example https://doi.org/10.1594/PANGAEA.940846
** viewer ''panoply'' http://www.giss.nasa.gov/tools/panoply/

=== Compression ===
* zip is ISO-standard and supported - *.tar, *.rar and *.7z are NOT standard and (at least the latter two) not supported by PANGAEA

=== Proprietary formats ===
Include a reference, preferably with a DOI, to open source software (e.g. at GitHub, pypi.org) that can be used to open such files.

Format

2026-03-31T14:55:59Z

Lmoeller: /* Tabular data */ Den Bandwurmsatz rund ums DataWareHouse gekürzt

PANGAEA accepts and publishes a wide spectrum of data formats. Thereby, we classify these file formats into three categories – [[Format#Documentation|Documentation formats]], and [[Format#Tabular data|Tabular]] and [[Format#Binary data|Binary data]] formats – based on how these formats are treated and processed during the editorial processing and how they are represented in our data publications.

'''Important:''' If you consider submitting data and supplementary documentation to PANGAEA, please make sure to provide open (ideally non-proprietary) formats widely accepted and endorsed in your scientific communities in order to support the accessibility and (re-)usability on long time-scales and by openly available tools.

== Documentation ==
As documentation we consider all files provided in typical text data formats meant to supplement or further describe data submissions (whether in tabular or binary formats), such as processing reports, instrument calibration protocols, standard operating procedures.

Accepted formats for documentation are:
* '''PDF'''/A (ISO19005) - http://en.wikipedia.org/wiki/PDF/A
* '''RTF''' or '''ODF''' (ISO26300) - http://en.wikipedia.org/wiki/OpenDocument
* MS Office files - standard OOXML (ISO/IEC 29500:2008 since Office 2013 - https://de.wikipedia.org/wiki/Microsoft_Office), e.g. '''*.docx''' for MS Office document and *.xlsx for Excel spreadsheet files
* or (our favourite and recommended) '''plain UTF-8 encoded text files''' - https://en.wikipedia.org/wiki/UTF-8

== Tabular data ==
PANGAEA is specifically good at (or, to put it differently, is able to make the most of) tabular field observation data. Owing to the many efforts we put in harmonizing parameter/variable names, methods, dimensions and units during our editorial processing, and due to the fact that this kind of data is stored in a relational database (PostgreSQL) at PANGAEA, our users are able to easily compile specific parameters/variables of interest from many similar studies (i.e. related individual publications) into meta-studies targeting new research questions and contexts using our [[PANGAEA search#Data warehouse|Data Warehouse]]. Or apply similar functionality with the help of the Python module pangaeapy or the R pendant pangaear (see our [https://www.pangaea.de/tools/ Tools site] for details).

The preferred '''Format''' for data tables is

* '''TAB-delimited TEXT-files''' (UTF-8 encoded), or
* '''(open) spreadsheet file formats''' (MS Excel .xlsx, OpenOffice & LibreOffice Calc .ods etc)'''.'''

Please note that tables are '''not''' accepted as encapsulated objects (e.g., in .mat files).

Example for a tabular dataset: https://doi.org/10.1594/PANGAEA.934148

== Binary data ==
Binary objects and documentations are usually stored in a combination of hard-drive arrays (for immediate and performant access) and tape archives. File formats should follow ISO standards or at least ''de facto'' standards. Online preview is available for raster graphics and videos (e.g. .tif, .png, .jpeg, .mp4).

Example: https://doi.org/10.1594/PANGAEA.936185

=== Images ===
* tiff
** http://en.wikipedia.org/wiki/Tagged_Image_File_Format
** [http://www.iso.org/iso/iso_catalogue/catalogue_ics/catalogue_detail_ics.htm?csnumber=2181 ISO12639:1998]
* jpeg
** http://en.wikipedia.org/wiki/Jpeg
** [http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=18902 ISO10918-1:1994]
*png
**https://en.wikipedia.org/wiki/Portable_Network_Graphics
**[https://www.iso.org/standard/29581.html ISO/IEC 15948:2004]

=== Video ===
see: http://de.wikipedia.org/wiki/Digital_Video
* '''MPG Container'''
** MP3
*** http://en.wikipedia.org/wiki/MP3
*** [http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=45524 ISO/IEC 14496-4:2004]?
** MPEG2 (for PAL)
*** http://en.wikipedia.org/wiki/MPEG_2
*** [http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=37680 ISO/IEC 13818-11:2004]
**** Software [http://www.videolan.org/ VLC media player] 

* '''MP4 Container'''
** AAC
*** http://en.wikipedia.org/wiki/Advanced_Audio_Coding
*** [http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=43345 ISO/IEC 13818-7:2006]
** MPEG-4 (for HDTV)
*** http://en.wikipedia.org/wiki/MPEG-4
*** [http://www.iso.org/iso/search.htm?qt=14496&published=on&active_tab=standards ISO/IEC 14496]

=== Audio ===
* MP3
* WAVE (WAV)
** description http://en.wikipedia.org/wiki/WAV
** example {{doi|10.1594/PANGAEA.339110}}

=== Seismic data ===
* segy

=== ADCP ===
* proprietary binary ''ping''-format, archived on hs, linked to metadescription in PANGAEA
** ping: http://currents.soest.hawaii.edu/docs/doc/codas_doc/CODAS_pingdemo.html
* final processed data in UTF-8, archived in ''data numeric'' of PANGAEA (file size 100-500 MB!)
** Example {{doi|10.1594/PANGAEA.701279}}

=== Large array-oriented, scientific data (no models - please see our corresponding Wiki article "[[Model data and PANGAEA]]"!) ===
* Network Common Data Form (NetCDF),
** description http://en.wikipedia.org/wiki/NetCDF
** Unidata/NSF http://www.unidata.ucar.edu/software/netcdf/
** example https://doi.org/10.1594/PANGAEA.940846
** viewer ''panoply'' http://www.giss.nasa.gov/tools/panoply/

=== Compression ===
* zip is ISO-standard and supported - *.tar, *.rar and *.7z are NOT standard and (at least the latter two) not supported by PANGAEA

=== Proprietary formats ===
Include a reference, preferably with a DOI, to open source software (e.g. at GitHub, pypi.org) that can be used to open such files.

Technology

2026-03-26T17:10:33Z

Lmoeller: Edited label

{{construction}}

PANGAEA is built on a three-tiered client/server architecture that has evolved continuously since the system's founding in the early 1990s. The architecture separates backend storage and database systems, middleware processing and transformation components, and frontend delivery and user interfaces into distinct layers that are individually maintainable and replaceable. This design philosophy has enabled PANGAEA to migrate core components — including its primary database engine, its editorial system, and its search infrastructure — without disruption to archived data or to users, and underpins the system's long-term stability as a trustworthy data repository.

A detailed description of the PANGAEA information system and its workflows is provided in Felden et al. (2023) and Diepenbroek et al. (2017); this article provides an up-to-date overview oriented toward the general structure and key components.

== System Architecture ==

The technical architecture of PANGAEA follows a three-tiered model comprising a backend, a middleware layer, and a frontend. All hard- and software services are hosted and operated by the data and computing center of the Alfred Wegener Institute, Helmholtz Centre for Polar and Marine Research (AWI) in Bremerhaven. Most backend and middleware systems, as well as all frontend web servers and search engines, run on virtual machines (VMware) operated with Ubuntu Linux. Virtualization provides sufficient capacity and performance while enabling high availability and transparent hardware renewal on a typical cycle of three to four years.
[[File:PANGAEA overview architecture.png|alt=Schema illustrating the system architecture according to Felden et al., 2023|thumb|Schema illustrating the system architecture according to Felden et al., 2023|none|600x600px]]

PANGAEA currently operates nine virtual machines at the AWI computing center, using a total of 53 CPUs, 162 GB RAM, and 28 TB of disk space. The relational database holdings currently occupy approximately 130 TB, with approximately 1 PB of data stored on tape.

== Backend: Storage and Databases ==

=== Relational Database ===
The primary store for all structured data and metadata in PANGAEA is a '''PostgreSQL''' relational database management system (RDBMS). The data model is highly normalized, reflecting the full observational context of each measurement: events (when and where data were collected), campaigns (cruises or field expeditions), methods and devices, parameters, references, and institutional provenance information are all stored in defined relational structures. This normalized model allows data descriptions to be compiled dynamically and serialized into a wide range of output formats on demand, without modifying the underlying archived data.

Database integrity is continuously maintained through '''PostgreSQL streaming replication''' to a dedicated backup system, which enables point-in-time recovery to any moment prior to a failure event.

=== Data Warehouse ===
Fast access to large, aggregated compilations of data across the full PANGAEA holdings is provided by a '''Clickhouse''' data warehouse, which mirrors the numerical data inventory from the relational database. The data warehouse supports spatially and chronologically constrained queries at the parameter level across potentially hundreds of thousands of individual datasets, and is accessible both through the PANGAEA website and programmatically via a REST API. All data warehouse exports include the DOI for each contributing dataset, ensuring that provenance and citation remain intact in compiled data products.

=== Archival Storage ===
For high-volume and binary data — including geophysical datasets, images, video, and community-specific formats such as NetCDF — data are stored in consistent formats on hard-disk arrays and robotic tape archives. The central archival storage system consists of two '''SpectraLogic TFinity ExaScale''' robotic tape libraries housed in separate buildings at AWI, with a combined capacity of up to 60 PB. Backup operations use high-capacity LTO-tape drives within this environment.

All data are stored redundantly using '''erasure coding''' across disk and tape. Data on disk is replicated to tape nightly and saved to snapshots retained for six months. Tape copies are replicated to a physically separate building within two hours of creation; decommissioned tapes are retained for one year before reuse. Virtual machine working data is captured in nightly machine snapshots. Write caches are battery-backed to ensure data integrity at the point of write.

=== Off-Site Replica ===
Since 2025, PANGAEA operates an '''off-site replica''' of the relational database and web frontend at MARUM/University of Bremen, hosted in the Green IT Housing Center of Bremen University. This facility provides geographic and infrastructural separation from the primary AWI systems, enabling recovery of data delivery services within 24 hours following a technical breakdown or cyberattack. The replica is kept current through snapshot-based replication several times per day and currently covers all dataset metadata — including individual landing pages and all harvesting endpoints — as well as full representations of tabular data publications. Extension of the replica to include binary data files is planned as a further development of this facility.

The off-site installation operates in a fully isolated environment with Layer 2 network separation. Access is restricted to VPN connections from a single designated gateway host at AWI, with access tokens and keys limited to the corresponding gateway and PANGAEA DevOps staff.
== Middleware: Processing and Transformation ==
The middleware layer manages the flow of data and metadata between the backend storage systems and the frontend interfaces. Its core functions are marshaling, indexing, transformation, and dissemination.

Metadata are dynamically marshaled from the relational database to a PANGAEA-specific internal XML format, from which they are transformed via '''XSLT and XML-to-JSON pipelines''' into a range of content standards for delivery to users and harvesting services. Currently supported output standards include JSON-LD according to schema.org, DataCite XML, Dublin Core XML, ISO 19115/ISO 19139, DIF (Directory Interchange Format), and Darwin Core. Dissemination occurs via OAI-PMH, HTTP content negotiation based on HTTP standards and technical FAIR recommendations, and other protocols.

The marshaled metadata are stored and indexed in '''Elasticsearch''', which serves as the primary index for all public search and metadata access interfaces. This architecture separates the authoritative relational record from the search-optimized representation, allowing the search index to be rebuilt or updated without modifying archived data.

The flexible metadata framework '''PanFMP''' (<nowiki>https://www.panfmp.org/</nowiki>) underpins the mapping between the PANGAEA internal schema and the various external output standards, ensuring that new or updated standards can be accommodated without structural changes to the underlying data.

Data submissions, user requests, and bug reports are managed through a '''JIRA''' (Atlassian) issue tracking system, which serves as the primary communication channel between data providers and the PANGAEA editorial team throughout the submission and publication process.
== Frontend: Editorial System ==
The PANGAEA editorial system is the primary tool through which data editors review, curate, harmonize, and publish submitted data and metadata. It is a '''web-based client/server application''' developed entirely in-house, operating directly on the PostgreSQL databases.

The backend of the editorial system is built on '''Java 17''' using the '''Dropwizard''' framework, exposing a REST API. The frontend is implemented in '''React''' with the '''Ant Design''' component library. Source code for both components, together with shared libraries and test suites, is maintained in an institutional '''GitLab''' instance hosted at AWI, structured as an '''Nx monorepo'''. The repository encompasses automated unit tests and '''Cypress''' end-to-end tests. All new versions are deployed exclusively through a '''GitLab CI/CD pipeline''', with releases gated on successful completion of the full automated test suite.

In production, the editorial system runs on Ubuntu virtual machines hosted on VMware infrastructure. '''Apache''' serves the React frontend and acts as a reverse proxy to the backend services. Four parallel instances are operated simultaneously, providing dedicated test and production environments for two editorial system versions at any given time. This setup supports controlled feature validation alongside uninterrupted service operation. Infrastructure components — including VMware, file systems, monitoring, GitLab, and the database platform — are managed by AWI, while the PANGAEA group is responsible for the Apache web server, Java services, and React frontend.

The relational database model underlying the editorial system enforces structural and semantic consistency between submitted metadata and the existing data inventory. The editorial system also integrates a '''Terminology Catalogue (TC)''', which manages controlled vocabularies and ontologies used to harmonize data and metadata during ingest. Supported terminologies include WoRMS, ITIS, ChEBI, EnvO, PATO, QUDT, and the NERC vocabulary server, among others. A detailed description of the Terminology Catalogue and its role in data archiving and publication is given in Diepenbroek et al. (2017).
== Frontend: Public Web Interface and Search ==
The public PANGAEA website and search interface is served through Apache web servers in the frontend tier, backed by the Elasticsearch index for fast metadata retrieval.

=== Search ===
The PANGAEA search engine supports full-text and faceted search across all published metadata. Faceted navigation is enabled by semantic metadata enrichment performed during the marshaling process: terminology-based annotations are added to metadata records, allowing consistent filtering by topic, device type, geographic region, and other dimensions. Search documentation is available in the Wiki at <nowiki>https://wiki.pangaea.de/wiki/PANGAEA_search</nowiki>.

=== Map Search ===
Geographic search and visualization are implemented using '''leaflet.js''', serving map tiles from four configurable sources: the AWI basemap, OpenStreetMap, Google (hybrid), and ESRI.

=== Dataset Landing Pages ===
Each published dataset is represented by a landing page resolved through its DOI. Landing pages present the full dataset metadata in human-readable form and are enriched with structured '''schema.org''' markup in JSON-LD, ensuring machine-actionability and compatibility with generic web search engines and data registries. The schema.org metadata is also accessible via HTTP content negotiation.

=== Programmatic Access ===
PANGAEA offers programmatic access to data and metadata through a range of web services (SOAP and REST). The OAI-PMH endpoint supports metadata harvesting in all supported standards. Client libraries for '''Python''' (pangaeapy, developed by PANGAEA) and '''R''' (pangaear, developed by the community) allow researchers to load and transform PANGAEA data directly into native data structures for analysis in environments such as Jupyter notebooks.
== Monitoring ==
Service health across the PANGAEA infrastructure is monitored through the '''AWI Grafana/Telegraf''' stack, covering service availability, application logs, and resource saturation for all production systems. This is supplemented by external availability checks via '''UptimeRobot''', which provides independent verification of public-facing service endpoints from outside the AWI network. All external links in PANGAEA metadata records — references to related literature, other dataset versions, and external resources — are automatically checked for broken (HTTP 404) or permanently redirected (HTTP 301) responses on a weekly basis.
== Security ==
Backend and middleware systems are protected behind a firewall; frontend systems operate in a demilitarized zone (DMZ) accessible from outside with restricted but still firewalled access. Frontend systems have no write access and only limited read access to the backend database and tape archives. Public access from the website and REST APIs is served from data replicas hosted in Elasticsearch or through read-only remote filesystem access. Data curators access production systems via virtual private networks (VPN), which enforce a basic check that client operating systems are up to date.

Physical access to the AWI computing center is managed through an electronic access control system for all relevant entrances; key distribution is documented, and guest policies are formally established. The AWI maintains uninterruptible power supplies (UPS) capable of sustaining all PANGAEA-relevant hardware for up to 60 minutes, backed by a diesel-powered emergency generator providing a further 23 hours of operation. Fire alerts are automatically forwarded to the fire department with a contractual response time under 15 minutes.

Security of the technical infrastructure is further maintained through the use of asymmetric key infrastructures, mandatory minimum-length passwords for all user classes, short-cycle security patching for all hardware and software components, professional monitoring tools for hardware, firewall, software, services, performance, and attacks, and regular security training for all technical and non-technical staff. Security risks are assessed on an ongoing basis by AWI's institutional IT security team, which coordinates incident response and monitors threat landscapes relevant to research data infrastructure. PANGAEA technical staff participates in regular security reviews.

Access controls for restricted or moratorium datasets are enforced at the application layer. Dataset metadata remains publicly accessible in all cases; access to the data itself is restricted to authorized users at the individual level for the duration of the moratorium, typically a maximum of two years from submission.

The off-site replica at MARUM operates in a fully isolated network environment (Layer 2 separation) with access restricted to a firewalled VPN gateway at AWI. The Green IT Housing Center is monitored 24/7 by Bremen University staff, with automated fire alarms, redundant power supply with battery backup, and multilevel physical access control.
== Software Inventory ==
The following table summarizes the principal software components of the PANGAEA infrastructure and their development model.
{| class="wikitable"
!Component
!Software
!Development Model
|-
|Primary database
|PostgreSQL
|Open source (community-supported)
|-
|Data warehouse
|Clickhouse
|Open source (community-supported)
|-
|Search and metadata index
|Elasticsearch
|Open source (community-supported)
|-
|Editorial system backend
|Java 17 / Dropwizard
|In-house development
|-
|Editorial system frontend
|React / Ant Design
|In-house development (open source framework)
|-
|Source control
|GitLab
|Open source (community-supported, self-hosted)
|-
|CI/CD pipeline
|GitLab CI/CD
|Open source (community-supported)
|-
|Web server / reverse proxy
|Apache
|Open source (community-supported)
|-
|Map visualization
|leaflet.js
|Open source (community-supported)
|-
|Metadata framework
|PanFMP
|In-house development (open source)
|-
|Issue tracking
|JIRA (Atlassian)
|Commercial
|-
|Monitoring stack
|Grafana / Telegraf
|Open source (community-supported)
|-
|External uptime monitoring
|UptimeRobot
|Commercial SaaS
|-
|Infrastructure virtualization
|VMware
|Commercial
|-
|Tape archive hardware
|SpectraLogic TFinity ExaScale
|Commercial hardware
|-
|Python data client
|pangaeapy
|In-house development (open source)
|-
|R data client
|pangaear
|Community development (open source)
|}
== Documentation and Change Management ==
General documentation of PANGAEA systems and services is maintained in the public '''PANGAEA Wiki''' (<nowiki>https://wiki.pangaea.de/</nowiki>). A separate internal '''Confluence''' Wiki, kept operationally isolated from the main PANGAEA infrastructure for disaster-safety purposes, contains detailed documentation of server configurations, installation and maintenance procedures, relationships between system components, VM snapshot procedures, backup routines, and service restart priorities for incident response.

All changes to published data and metadata are recorded in the editorial system's version history. Substantive revisions to published datasets result in a new dataset version with a new DOI; all prior versions remain accessible and cross-referenced. Minor editorial corrections that do not affect scientific content are applied without creating a new identifier.
== References ==

* Felden, J., Möller, L., Schindler, U., Huber, R., Schumacher, S., Koppe, R., Diepenbroek, M. & Glöckner, F.O. (2023). PANGAEA — Data Publisher for Earth & Environmental Science. ''Scientific Data'', 10, 347. <nowiki>https://doi.org/10.1038/s41597-023-02269-x</nowiki>
* Diepenbroek, M., Schindler, U., Huber, R., Pesant, S., Stocker, M., Felden, J., Buss, M. & Weinrebe, M. (2017). Terminology supported archiving and publication of environmental science data in PANGAEA. ''Journal of Biotechnology'', 261, 177–186. <nowiki>https://doi.org/10.1016/j.jbiotec.2017.07.016</nowiki>

== See Also ==

*[[PANGAEA search]]
*[[Data submission]]
*[[Authors Guides]]
*[[Curation levels]]
*[[Processing levels]]
*[[Data Usage Statistics]]
*[[Format]]
*[[PANGAEA XML schema]]
*[[Best practice manuals and templates]]
*[[Continuity Plan]] — https://www.pangaea.de/about/continuity.php
*[[Preservation Plan]] — https://www.pangaea.de/about/preservation.php

Version

2026-03-26T17:07:05Z

Lmoeller: Lmoeller moved page Version to Intern:Version

#REDIRECT [[Intern:Version]]

Data Seal of Approval

2026-03-26T17:05:55Z

Lmoeller: Lmoeller moved page Data Seal of Approval to Intern:Data Seal of Approval

#REDIRECT [[Intern:Data Seal of Approval]]

Portal

2026-03-26T16:41:43Z

Lmoeller: Lmoeller moved page Portal to Intern:Portal: outdated and content better explained elsewhere

#REDIRECT [[Intern:Portal]]

DOI

2026-03-26T16:13:02Z

Lmoeller: deleted link to 3 sentence wiki Backup article

== Digital Object Identifier (DOI) ==

DOIs provide [[persistent identifier|persistent links]] to scholarly content, helping users get to the authoritative, published version of the content they are searching for, even when the content changes location or ownership. With about 35 Million registered DOI for publications (2009), the system is established and consequently used by scientific publishers and organisations.

Through the project '''[[STD-DOI]]''', the TIB (German National Library of Science and Technology, Hannover) was established as an agency for '''data DOI'''. PANGAEA among four data providers was the first system using DOI for automated persistent identification of data sets. A data DOI has the prefix '''10.1594''' which is assigned to the publication of primary data through the TIB. The suffix, separated by a slash, is composed of the data system or center acronym and a system specific part. In a Pangaea DOI, this part is equivalent to the '''internal ID''', automaticaly assigned to a data set by the relational database management system during import; thus the uniqueness of each DOI is assured.

A valid Pangaea-DOI has the syntax '''10.1594/PANGAEA.738357'''
* spelled {{doi|10.1594/PANGAEA.738357}}
* and resolved as https://doi.org/10.1594/PANGAEA.738357
----
Data citation and DOI are defined in three steps during the publication process:

* registry status ''will be registered'', then
* registry status ''registration is in the lead time'' with DOI registration in progress for '''30 days''' followed by
* registry status ''registered'' after transfer of the DOI to the DOI-registry > DOI can be resolved globally, e.g. at https://doi.org/
#If a data set is imported and the status is set to ''validated'', its internal ID can only be resolved as a '''preliminary DOI''' through ''doi.pangaea.de'' (PANGAEAs own DOI resolver). In the citation, the data set is identified as '''Dataset #738509'''
#If a data set status is set to ''published'', the internal ID is changed to a global resolvable '''technical DOI''' 4 weeks after the last edit and the data set gets the status [[Citation|citable]]. In the citation, the data set is identified as '''Dataset #738509 (DOI registration in progress)''', changing after 4 weeks to '''doi:10.1594/PANGAEA.82361''' which can be resolved globally.
#On request the dataset can be defined as an offical '''data publication''' and is thus added to the library catalog of the TIB, see [[citation]].

[[PangaVista]] and the [http://doi.pangaea.de/ DOI resolver of PANGAEA] can be used for any registered DOI, including preliminary DOIs of Pangaea.

If a data set to be archived in Pangaea already has a DOI from an other repository, this citation will be indicated with its citation as "other version" in the metadata header.

In case a registered data set has to be deleted, in the field ''other version'' the link/DOI to the substitute must be given '''before''' deletion.
* Example: [http://doi.pangaea.de/10.1594/PANGAEA.58757 doi:10.1594/PANGAEA.58757]

== Prerequisites to become an agent for the registration of scientific primary data ==
Any data provider, interested in assigning DOI for data may use one of the agents listed below or become a new agent of the ''data-DOI'' agency TIB. When establishing a data system/center new agents need to assure the following points defined through a '''concept''' and a '''[[data policy]]''':
* '''Metadata'''
** metadata are mandatory and should follow standards of the specific scientific field the data are covering (e.g. ISO19115 for geo-data)
** data sets must be accompanied by a citation, consisting of bibliographic fields according to the STD-DOI application profile
* '''Access and availibility'''
** long-term availability must be assured, stable linking is provided by means of a DOI
** data must be available online, assuring Open Access to metadata; Open Access to data is highly recommended (access restrictions may appear for a moratorium period); data should be provided under a CC-[[license]]
** it is highly recommended, that data are ''machine readable'', giving data in the repository an ''added values''. This means, that
*** (1) data are provided in a standard technical format (best is ascii and ISO formats)
*** (2) data are organized in a way, that further processing of any part of the repository can easily be performed (data model, relational database)
** a full backup of the data repository must be assured
* '''Data review and integrity'''
** once registered, data sets are static
** versioning is allowed, different versions should be linked to each other
** data curation must include an editorial process with proofread by the author/principle investigator (the author is responsible for the scientific quality of the data!)
** an external peer-review of data publications is recommended

== Links to agents for archiving geoscientific primary data with DOI ==

*[http://www.mad.zmaw.de/wdc-for-climate/ World Data Center for Climate ('''WDCC''')] for climate models
** ''example {{doi|10.1594/WDCC/EH5-T63L31_OM_1CO2_2_MM}}''
** contact [mailto:lautenschlager@dkrz.de '''Michael Lautenschlager''']
----
*[http://www.pangaea.de '''PANGAEA''' data library] for georeferenced observational data (including [http://www.wdc-mare.org WDC-MARE])
** ''example: {{doi|10.1594/PANGAEA.484677}}''
** [https://www.pangaea.de/contact/ Contact]
----
*[http://wdc.dlr.de/ World Data Center for Remote Sensing of the Atmosphere ('''WDC-RSAT''')] for remote sensing
** ''example {{doi|10.1594/WDCRSAT.5Q6Q9Q9B}}''
** contact [mailto:michael.bittner@dlr.de '''Michael Bittner''']
----
*[http://www.gfz-potsdam.de/ GeoForschungszentrum Potsdam ('''GFZ''') with '''ICDP''']
** ''example {{doi|10.1594/GFZ/ICDP/KTB/ktb-geoch-gaschr-p}}''
** contact [mailto:jklump@gfz-potsdam.de '''Jens Klump''']

== DOI provision service for reports and grey literature by PANGAEA/TIB ==
For reports and grey literature such as Master or PhD theses a DOI can be assigned by the TIB. The DOI prefix for these kinds of documents is 10.2312. '''Update 2020-12''': a new prefix was provided by the TIB to account for inconsistencies with the DataCite policy: please use 10.48433 from now on.

For submitting documents as PDF to TIB: 
1) Put all PDF files into one directory. The file names of the PDF files have to be the suffix of the DOI (case sensitive!). 
2) Create a control file for [[Intern:PanXML]]. See also [[File:Metadata_grey_literature_v3.pdf|Metadata for grey literature]] 
3) Execute the control file with [[Intern:PanXML]]. 

PanXML creates an XML file for each DOI. Send all PDF files together with XML files to TIB (Frauke Ziedorn or Britta Dreyer) in one zip archive.

== Links ==
*[http://www.tib-hannover.de/de/die-tib/doi-registrierungsagentur/ DOI-Registrieragentur @ TIB]
*The International DOI Foundation (IDF) https://doi.org/
*shortDOI service http://shortdoi.org
*DOI handbook: {{Doi|10.1000/182}}
*DOI Project for scientific primary data http://www.std-doi.de
*DOI of the DOI system: {{Doi|10.1000/1}}
*DOI of the Pangaea data library: {{Doi|10.1594/PANGAEA}}
*[http://doi.pangaea.de DOI resolver of Pangaea]
*[http://de.wikipedia.org/wiki/Uniform_Resource_Identifier What is a resource identifier?]
*Publication using a ''child-''DOI for an image: {{doi|10.1371/journal.pbio.0020449}}, see ''Reconstruction of Neanderthal woman''
*[http://hdl.handle.net Handle resolver]
*[http://handle.net/ Handle system]
**to check handle values, type in a DOI and check ''Don't Redirect to URLs''
----
* provider für DOI im ''deutschsprachigen Raum'' http://www.mvb-online.de
* provider für DOI in Europa http://www.medra.org/
----
* [http://kopal.langzeitarchivierung.de '''kopal''' - Kooperativer Aufbau eines Langzeitarchivs digitaler Informationen]
* [http://www.langzeitarchivierung.de/ '''nestor''' - Kompetenznetzwerk Langzeitarchivierung]
** [http://nestor.sub.uni-goettingen.de/handbuch/index.php nestor-Handbuch]
* [http://www.tib.uni-hannover.de/ueber_uns/projekte/vascoda/ '''vascoda''' - Internet-Portal für wissenschaftliche Information]
** [http://www.vascoda.de/ vascoda search]
* [http://www.parse-insight.eu/ '''PARSE.Insight''' - Permanent access to the records of science in Europe]

== Resolver ==
* The CNRI Handle Extension for Firefox is part of the official [http://addons.mozilla.org/addon/10820 Resolver Add-ons for Firefox]
* '''https://doi.org''' - priority in use!
** https://doi.org/10.1007/s00367-006-0049-8
*** spelling {{doi|10.1007/s00367-006-0049-8}}
* https://doi.pangaea.de - resolves doi and handle and unregistered "doi" of pangaea
** https://doi.pangaea.de/10.1594/PANGAEA.547989
*** spelling {{doi|10.1594/PANGAEA.547989}}
* http://hdl.handle.net
** http://hdl.handle.net/10013/epic.32128
*** spelling {{hdl|10013/epic.32128}}
* http://nbn-resolving.de
** http://nbn-resolving.de/urn:nbn:de:gbv:46-ep000103869
** http://nbn-resolving.de/urn/resolver.pl?urn=urn:nbn:de:tib-10.1594/PANGAEA.5479896 (special case with doi as part of the urn)
* http://www.sref.org (invented by publisher ''Copernicus'', out of order, substituted by DOI in 2009)
** http://direct.sref.org/1814-9332/cp/2005-1-19
*** spelling {{sref|1814-9332/cp/2005-1-19}}
* http://lsid.tdwg.org/ (life science identifier)
** http://lsid.tdwg.org/urn:lsid:ubio.org:namebank:2659717

for '''ISBN''' there is no online-redirect and thus no direct resolver, see
* http://en.wikipedia.org/wiki/Special:BookSources?isbn=9783000050282
* http://en.wikipedia.org/wiki/Wikipedia:ISBN

IGSN (International Geological Sampling Number) ''development stage''
* IGSN: ODP010MEY
** http://app.geosamples.org/sample/igsn/ODP010MEY

NADC

2026-03-26T16:06:22Z

Lmoeller: Lmoeller moved page NADC to Intern:NADC: last edit 2016 - relevance for PANGAEA docu questionable

#REDIRECT [[Intern:NADC]]