Data policy

From PangaWiki
Jump to: navigation, search

The aim of this data policy is to facilitate operation and use of the data library PANGAEA - Publisher for Earth & Environmental Data. The system is operated as archive, publisher and library for data from earth system research. This policy recognises the benefits of providing Open Access to documented data from earth and environmental sciences for future use by the scientific community.

Principles

The guiding principle of the data library PANGAEA is Open Access to its content for the scientific community.

  • The content is defined as data from earth system research which can be georeferenced in time and space.
  • Data are distributed under a Creative Commons Attribution licence.
  • Data might be password protected for a moratorium period; its definition is in the responsibility of the source project/institute.
  • Format and description of data (metadata) must ensure its most widespread and easiest use.
  • Data include a bibliographic citation. Users are urged to properly quote this citation when using data from the system.
  • The reliable long-term access to the data is assured by using persistent identifier (DOI) which are part of each data set citation.
  • The system is open to individual scientists, institutes or projects for data archiving and publication. Principally data can be submitted free of charge. However, financial contributions are appreciated. Costs are a matter of negotiation.

Operation

Long-term availability (>10 years) of data in PANGAEA is assured through a commitment of the host institutes AWI and MARUM. The Pangaea department in both institutes is responsible for the technical quality, operation and consistency of the content. Persistent identification, data publication and widespread distribution is performed by the networking functionality and webservices on the Internet using international standards.

BackUp

Daily incremental backup and weekly full backup in two mirrored tape drive archives (capacity >1 PB), located in different buildings with a distance of 1 km, ensures data safety and integrity.

The data flow is organized from the PI via the data curator of the project or institute to the archive for upload. Data availability should be monitored by the project/institutes management and reviewed by the PI. Data curators are supervised by the data librarian. Individual scientists may send data directly to the librarian for archiving (mailto:info@pangaea.de).

PANGAEA is operated by the Alfred Wegener Institute for Polar and Marine Research (AWI), Bremerhaven and the Center for Marine Environmental Sciences (MARUM), Bremen, Germany for the benefit of the scientific community. The operating institutes encourage the widest possible use of Pangaea as a data library, in order to best realise its potential value. 

Data provision for upload

Data archiving includes:

  1. Metadata(*) of expeditions, stations, samples and activities;
  2. Metadata related to the factual data (authors, PI, reference, method, comment);
  3. Factual data from (a) archives/collections/databases, (b) expeditions/monitoring/time series, (c) supplements to publications;
  4. Products resulting from compilations and interpretations.
  • specific to marine research:

Chief scientists are requested to send reports including a station list to the project management office shortly after an expedition. Station labels as published in the cruise reports station list must remain the same at any time when used in data submissions or publications.

The data librarian maintains a dictionary of parameter definitions with unit, to be used as the agreed standard for all project data. Parameter are grouped into categories according to their related scientific field. Data submissions are required to use parameters and units as defined in the dictionary. New parameters are defined by the data librarian on request.

Data are archived in a relational database, georeferenced in space and time; if a data set is very large or, for certain format reasons, must have a proprietary format, it is archived as a binary object in a file system with a metadescription only, linked to the file. As soon as data become available and are validated, the providers are urged to submit the data in agreement with the import format. Any type of data must always be accompanied by a description (metadata) allowing future users to understand and process the data at any time. The granularity and format of data sets have to be defined in agreement between the principle investigator (PI) and the data librarian. The export format in principle is tab-delimited ASCII, headed by metadata fields according to ISO19115, GCMD-DIF and DublinCore standards.

Quality assurance

Data submitted for archiving have to be documented properly; documentation is archived together with each dataset. The scientific quality is always in the responsibility of the PI or the authors. Fields for its documentation like quality flags for single values, adjustable precission or documentation of methods are available in the Pangaea data model. Technical quality control, i.e. completeness of metadata, consistence of formats and correctness of download is in the responsibility of the data managers. After import, the PI/authors is requested to proofread data sets on the Internet and submit corrections to the data manager.

Access and Publication

The project data management provides an up-to-date list of publications with links to the related data. Any scientific primary data* related to publications shall be submitted to the data management at the same time as the manuscript is submitted to the editor. Authors will receive in return a persistent identifier (DOI, Digital Object Identifier) for each data set that can be cited in the publication. Likewise, one to many data sets can be made citable with a reference added to a public library catalog and will receive a DOI. Those data publications may also be added to personal or project publication lists.

Higher level data products* in electronic form can also be archived through Pangaea and will receive a persistent identifer and citation on request. Partner institutes and data providers aggree, that data archived in Pangaea are made public available through appropriate technical setups on the Internet (e.g. portals, search engines, library catalogs, GIS) without further notification. Unpublished data are password protected by default; password protection for published data is set on request for a moratorium period. Providers may decide to withdraw data from the archive as long as it is not published. Metadata are always freely accessible. According to EU data policy all data collected during the lifetime of the project are made public two years after the termination of the project; regulations may differ in agreement between coordinator, partners and funding organization. Following recommendations of the EU (Colour of Ocean Data Symposium, Brussels 2002), metadata are archived only in relation to available factual data. The metadata solely may be mirrored to other systems like the Global Change Master Directory (GCMD). 

(*) Depending on the level of processing scientific primary (or factual) data can be differentiated between raw data, primary data and secondary data. Raw data are provided by a measuring system and are unprocessed; scientific primary data are resulting from the processing of raw data and are the basis for scientific interpretations and publications. Primary data have the highest priority for archiving; the related raw data files may be added if appropriate. Secondary data are higher level products resulting from compilations and interpretations of primary data, i.e. maps, profiles, statistics, graphics, models or any material produced for education and outreach. All information describing any of these three data types are metadata.

This text may be used as a draft for project specific data policies. A policy for using pangaea in marine research projects may be found at doi:10.1594/PANGAEA.327791 .