Techdoc:Access to data and metadata

The figure shows access to and distribution as well as preservation of PANGAEA data & metadata.



The data & metadata stored in the RDB are cached onto disk - currently the local webserver file system (performance!). Each file has a unique data base identifier (integer), which allows to create a canonical path to the file (a configurable prefix, currently /pangaea/middleware/cache, and the data set identifier, padded with zeroes to 8 digits and grouped into path components of 2 digits, e.g.: Data Set ID 80968 -> /pangaea/middleware/cache/00/08/09/68 as file path)

File metadata are kept in the RDB including mimetype, title, size, MD5 hash, and a possible thumbnail information as blob (as e.g. generated during archival of image data). Equivalent to the files in the data cache each file has a unique data base identifier (also an integer). This applies to data files as well as additional files documenting the data content (further details etc.). In this way access to files and RDB content can be handled consistently. The sole difference is the location of the files, expressend in the prefix, which points to the hs.

Access to the files on hs are realized using hard links, which allows creating multiple names for the same file. Hard links ensure the persistence of files in case of a conflicting usage of files (e.g. shift or deletion of files by other authorized groups). The canonical 'view' of PANGAEA will exclusively be maintained by a generic PANGAEA user. The corresponding web service. The current 'hs.pangaea.de' web server, which provides access to the underlying hs file system will be replaced by a REST based service on 'doi.pangaea.de'. This services resolves the PANGAEA internal identifiers and handles access rights via the information stored in the RDB. This service will also allow to extract on the fly single files from zipped containers as well as downloading complete containers. Preferably files are stored as as zipped containers.

Backup of files onto tertiary media (tape silo) is done by default. Migration is based on the file size, availability of free hard disk space, and access frequency thus ensuring that smaller and often used files stay permanently on disk.

Data and Information resources outside of PANGAEA will be referenced by URI (DOI, HANDLE, URL). This includes also references to gray literature held in institutional repositories. The URI are maintained in a separate table in the RDB. All URI are regularly checked; irregularities like broken links or unvalid content is logged on the PANGAEA background services queue.