Technology
PANGAEA is built on a three-tiered client/server architecture that has evolved continuously since the system's founding in the early 1990s. The architecture separates backend storage and database systems, middleware processing and transformation components, and frontend delivery and user interfaces into distinct layers that are individually maintainable and replaceable. This design philosophy has enabled PANGAEA to migrate core components — including its primary database engine, its editorial system, and its search infrastructure — without disruption to archived data or to users, and underpins the system's long-term stability as a trustworthy data repository.
A detailed description of the PANGAEA information system and its workflows is provided in Felden et al. (2023) and Diepenbroek et al. (2017); this article provides an up-to-date overview oriented toward the general structure and key components.
System Architecture
The technical architecture of PANGAEA follows a three-tiered model comprising a backend, a middleware layer, and a frontend. All hard- and software services are hosted and operated by the data and computing center of the Alfred Wegener Institute, Helmholtz Centre for Polar and Marine Research (AWI) in Bremerhaven. Most backend and middleware systems, as well as all frontend web servers and search engines, run on virtual machines (VMware) operated with Ubuntu Linux. Virtualization provides sufficient capacity and performance while enabling high availability and transparent hardware renewal on a typical cycle of three to four years.

PANGAEA currently operates nine virtual machines at the AWI computing center, using a total of 53 CPUs, 162 GB RAM, and 28 TB of disk space. The relational database holdings currently occupy approximately 130 TB, with approximately 1 PB of data stored on tape.
Backend: Storage and Databases
Relational Database
The primary store for all structured data and metadata in PANGAEA is a PostgreSQL relational database management system (RDBMS). The data model is highly normalized, reflecting the full observational context of each measurement: events (when and where data were collected), campaigns (cruises or field expeditions), methods and devices, parameters, references, and institutional provenance information are all stored in defined relational structures. This normalized model allows data descriptions to be compiled dynamically and serialized into a wide range of output formats on demand, without modifying the underlying archived data.
Database integrity is continuously maintained through PostgreSQL streaming replication to a dedicated backup system, which enables point-in-time recovery to any moment prior to a failure event.
Data Warehouse
Fast access to large, aggregated compilations of data across the full PANGAEA holdings is provided by a Clickhouse data warehouse, which mirrors the numerical data inventory from the relational database. The data warehouse supports spatially and chronologically constrained queries at the parameter level across potentially hundreds of thousands of individual datasets, and is accessible both through the PANGAEA website and programmatically via a REST API. All data warehouse exports include the DOI for each contributing dataset, ensuring that provenance and citation remain intact in compiled data products.
Archival Storage
For high-volume and binary data — including geophysical datasets, images, video, and community-specific formats such as NetCDF — data are stored in consistent formats on hard-disk arrays and robotic tape archives. The central archival storage system consists of two SpectraLogic TFinity ExaScale robotic tape libraries housed in separate buildings at AWI, with a combined capacity of up to 60 PB. Backup operations use high-capacity LTO-tape drives within this environment.
All data are stored redundantly using erasure coding across disk and tape. Data on disk is replicated to tape nightly and saved to snapshots retained for six months. Tape copies are replicated to a physically separate building within two hours of creation; decommissioned tapes are retained for one year before reuse. Virtual machine working data is captured in nightly machine snapshots. Write caches are battery-backed to ensure data integrity at the point of write.
Off-Site Replica
Since 2025, PANGAEA operates an off-site replica of the relational database and web frontend at MARUM/University of Bremen, hosted in the Green IT Housing Center of Bremen University. This facility provides geographic and infrastructural separation from the primary AWI systems, enabling recovery of data delivery services within 24 hours following a technical breakdown or cyberattack. The replica is kept current through snapshot-based replication several times per day and currently covers all dataset metadata — including individual landing pages and all harvesting endpoints — as well as full representations of tabular data publications. Extension of the replica to include binary data files is planned as a further development of this facility.
The off-site installation operates in a fully isolated environment with Layer 2 network separation. Access is restricted to VPN connections from a single designated gateway host at AWI, with access tokens and keys limited to the corresponding gateway and PANGAEA DevOps staff.
Middleware: Processing and Transformation
The middleware layer manages the flow of data and metadata between the backend storage systems and the frontend interfaces. Its core functions are marshaling, indexing, transformation, and dissemination.
Metadata are dynamically marshaled from the relational database to a PANGAEA-specific internal XML format, from which they are transformed via XSLT and XML-to-JSON pipelines into a range of content standards for delivery to users and harvesting services. Currently supported output standards include JSON-LD according to schema.org, DataCite XML, Dublin Core XML, ISO 19115/ISO 19139, DIF (Directory Interchange Format), and Darwin Core. Dissemination occurs via OAI-PMH, HTTP content negotiation based on HTTP standards and technical FAIR recommendations, and other protocols.
The marshaled metadata are stored and indexed in Elasticsearch, which serves as the primary index for all public search and metadata access interfaces. This architecture separates the authoritative relational record from the search-optimized representation, allowing the search index to be rebuilt or updated without modifying archived data.
The flexible metadata framework PanFMP (https://www.panfmp.org/) underpins the mapping between the PANGAEA internal schema and the various external output standards, ensuring that new or updated standards can be accommodated without structural changes to the underlying data.
Data submissions, user requests, and bug reports are managed through a JIRA (Atlassian) issue tracking system, which serves as the primary communication channel between data providers and the PANGAEA editorial team throughout the submission and publication process.
Frontend: Editorial System
The PANGAEA editorial system is the primary tool through which data editors review, curate, harmonize, and publish submitted data and metadata. It is a web-based client/server application developed entirely in-house, operating directly on the PostgreSQL databases.
The backend of the editorial system is built on Java 17 using the Dropwizard framework, exposing a REST API. The frontend is implemented in React with the Ant Design component library. Source code for both components, together with shared libraries and test suites, is maintained in an institutional GitLab instance hosted at AWI, structured as an Nx monorepo. The repository encompasses automated unit tests and Cypress end-to-end tests. All new versions are deployed exclusively through a GitLab CI/CD pipeline, with releases gated on successful completion of the full automated test suite.
In production, the editorial system runs on Ubuntu virtual machines hosted on VMware infrastructure. Apache serves the React frontend and acts as a reverse proxy to the backend services. Four parallel instances are operated simultaneously, providing dedicated test and production environments for two editorial system versions at any given time. This setup supports controlled feature validation alongside uninterrupted service operation. Infrastructure components — including VMware, file systems, monitoring, GitLab, and the database platform — are managed by AWI, while the PANGAEA group is responsible for the Apache web server, Java services, and React frontend.
The relational database model underlying the editorial system enforces structural and semantic consistency between submitted metadata and the existing data inventory. The editorial system also integrates a Terminology Catalogue (TC), which manages controlled vocabularies and ontologies used to harmonize data and metadata during ingest. Supported terminologies include WoRMS, ITIS, ChEBI, EnvO, PATO, QUDT, and the NERC vocabulary server, among others. A detailed description of the Terminology Catalogue and its role in data archiving and publication is given in Diepenbroek et al. (2017).
Frontend: Public Web Interface and Search
The public PANGAEA website and search interface is served through Apache web servers in the frontend tier, backed by the Elasticsearch index for fast metadata retrieval.
Search
The PANGAEA search engine supports full-text and faceted search across all published metadata. Faceted navigation is enabled by semantic metadata enrichment performed during the marshaling process: terminology-based annotations are added to metadata records, allowing consistent filtering by topic, device type, geographic region, and other dimensions. Search documentation is available in the Wiki at https://wiki.pangaea.de/wiki/PANGAEA_search.
Map Search
Geographic search and visualization are implemented using leaflet.js, serving map tiles from four configurable sources: the AWI basemap, OpenStreetMap, Google (hybrid), and ESRI.
Dataset Landing Pages
Each published dataset is represented by a landing page resolved through its DOI. Landing pages present the full dataset metadata in human-readable form and are enriched with structured schema.org markup in JSON-LD, ensuring machine-actionability and compatibility with generic web search engines and data registries. The schema.org metadata is also accessible via HTTP content negotiation.
Programmatic Access
PANGAEA offers programmatic access to data and metadata through a range of web services (SOAP and REST). The OAI-PMH endpoint supports metadata harvesting in all supported standards. Client libraries for Python (pangaeapy, developed by PANGAEA) and R (pangaear, developed by the community) allow researchers to load and transform PANGAEA data directly into native data structures for analysis in environments such as Jupyter notebooks.
Monitoring
Service health across the PANGAEA infrastructure is monitored through the AWI Grafana/Telegraf stack, covering service availability, application logs, and resource saturation for all production systems. This is supplemented by external availability checks via UptimeRobot, which provides independent verification of public-facing service endpoints from outside the AWI network. All external links in PANGAEA metadata records — references to related literature, other dataset versions, and external resources — are automatically checked for broken (HTTP 404) or permanently redirected (HTTP 301) responses on a weekly basis.
Security
Backend and middleware systems are protected behind a firewall; frontend systems operate in a demilitarized zone (DMZ) accessible from outside with restricted but still firewalled access. Frontend systems have no write access and only limited read access to the backend database and tape archives. Public access from the website and REST APIs is served from data replicas hosted in Elasticsearch or through read-only remote filesystem access. Data curators access production systems via virtual private networks (VPN), which enforce a basic check that client operating systems are up to date.
Physical access to the AWI computing center is managed through an electronic access control system for all relevant entrances; key distribution is documented, and guest policies are formally established. The AWI maintains uninterruptible power supplies (UPS) capable of sustaining all PANGAEA-relevant hardware for up to 60 minutes, backed by a diesel-powered emergency generator providing a further 23 hours of operation. Fire alerts are automatically forwarded to the fire department with a contractual response time under 15 minutes.
Security of the technical infrastructure is further maintained through the use of asymmetric key infrastructures, mandatory minimum-length passwords for all user classes, short-cycle security patching for all hardware and software components, professional monitoring tools for hardware, firewall, software, services, performance, and attacks, and regular security training for all technical and non-technical staff. Security risks are assessed on an ongoing basis by AWI's institutional IT security team, which coordinates incident response and monitors threat landscapes relevant to research data infrastructure. PANGAEA technical staff participates in regular security reviews.
Access controls for restricted or moratorium datasets are enforced at the application layer. Dataset metadata remains publicly accessible in all cases; access to the data itself is restricted to authorized users at the individual level for the duration of the moratorium, typically a maximum of two years from submission.
The off-site replica at MARUM operates in a fully isolated network environment (Layer 2 separation) with access restricted to a firewalled VPN gateway at AWI. The Green IT Housing Center is monitored 24/7 by Bremen University staff, with automated fire alarms, redundant power supply with battery backup, and multilevel physical access control.
Software Inventory
The following table summarizes the principal software components of the PANGAEA infrastructure and their development model.
| Component | Software | Development Model |
|---|---|---|
| Primary database | PostgreSQL | Open source (community-supported) |
| Data warehouse | Clickhouse | Open source (community-supported) |
| Search and metadata index | Elasticsearch | Open source (community-supported) |
| Editorial system backend | Java 17 / Dropwizard | In-house development |
| Editorial system frontend | React / Ant Design | In-house development (open source framework) |
| Source control | GitLab | Open source (community-supported, self-hosted) |
| CI/CD pipeline | GitLab CI/CD | Open source (community-supported) |
| Web server / reverse proxy | Apache | Open source (community-supported) |
| Map visualization | leaflet.js | Open source (community-supported) |
| Metadata framework | PanFMP | In-house development (open source) |
| Issue tracking | JIRA (Atlassian) | Commercial |
| Monitoring stack | Grafana / Telegraf | Open source (community-supported) |
| External uptime monitoring | UptimeRobot | Commercial SaaS |
| Infrastructure virtualization | VMware | Commercial |
| Tape archive hardware | SpectraLogic TFinity ExaScale | Commercial hardware |
| Python data client | pangaeapy | In-house development (open source) |
| R data client | pangaear | Community development (open source) |
Documentation and Change Management
General documentation of PANGAEA systems and services is maintained in the public PANGAEA Wiki (https://wiki.pangaea.de/). A separate internal Confluence Wiki, kept operationally isolated from the main PANGAEA infrastructure for disaster-safety purposes, contains detailed documentation of server configurations, installation and maintenance procedures, relationships between system components, VM snapshot procedures, backup routines, and service restart priorities for incident response.
All changes to published data and metadata are recorded in the editorial system's version history. Substantive revisions to published datasets result in a new dataset version with a new DOI; all prior versions remain accessible and cross-referenced. Minor editorial corrections that do not affect scientific content are applied without creating a new identifier.
References
- Felden, J., Möller, L., Schindler, U., Huber, R., Schumacher, S., Koppe, R., Diepenbroek, M. & Glöckner, F.O. (2023). PANGAEA — Data Publisher for Earth & Environmental Science. Scientific Data, 10, 347. https://doi.org/10.1038/s41597-023-02269-x
- Diepenbroek, M., Schindler, U., Huber, R., Pesant, S., Stocker, M., Felden, J., Buss, M. & Weinrebe, M. (2017). Terminology supported archiving and publication of environmental science data in PANGAEA. Journal of Biotechnology, 261, 177–186. https://doi.org/10.1016/j.jbiotec.2017.07.016