Intern:NKGCF

Text draft NKGCF brochure Global Change Research in Germany 2011 (chapter Data Centre and Data Availability)

Motivation
Observations, measurements and models are the lifeblood of Global Change research, the resulting data sets are the basis for any scientific publications. But only if the underlaying primary data are accessible, the findings can be verified by reviewer and reader. The data availability is one point, its form is the other important prerequisite for a proper future use of the content. In particular questions with a focus on changes in the earth system the global perspective requires consequently global data sets. In principle most findings are fractionated through the individual research of scientists and projects. An immense added-value could be given to those distributed data, if it would not only be available but even in harmonized standard formats, allowing exchange and compilation.

Besides the curiosity to know how the earth works, a not to be neglected driving force in science is the credit a scientist gains for his/her work. Consequently data archiving needs to become integral part of the already established workflow for scientific publications - data must be fully citable. Also the infrastructure for the search and distribution of publications is established, i.e. through the global system of libraries including legaly obligated longterm archiving. The invention of the Internet with the over-all availability in digital form has added new search capabilities through publisher catalogs, science specific portals and search engines. If data citations are becoming available through this growing infrastructure the assimilation of data into the publication workflow and infrastructure would be completed. As an outcome of a national funded project of the German Research Foundation (DFG), a reliable system for sustainable archiving and citation of data was established and the registration agency DataCite was founded.

Data Cite (box)
DataCite is an international association that aims to support researchers by enabling them to locate, identify, and cite research datasets with confidence. DataCite was launched on December 1st 2009 in London. As of December 2010, DataCite has 15 members from 10 countries, including the initiator, the German National Library of Science and Technology (TIB, Hannover). DataCite has a global leadership in promoting the use of persistent identifiers for datasets as integral part of the data citation. Through its members, it establishes and promotes common methods, best practices, and guidance. The member organisations work independently with data centres and other holders of research data sets in their own domains. As science operates globally with individual researchers working and publishing, DataCite is global with local partners offering services and advice where required. Further organisations are encouraged to join the association.
 * http://www.datacite.org

A globally distributed data system
Libraries exist since 5000 years, data centers since 50 years. Since the invention of the World Data Center system during the International Geophysical Year (1957/58) the process of data archiving has suffered from a constantly changing technology, in particular storage media and formats. Many valuable data are lost for ever on degrading tapes or on broken discs without backup, in many disciplines data archiving was neglected anyway. During the last years this childhood of electronic data handling is substituted by a technology avoiding data loss by migration and providing capacities in the petabyte range. The evolution of computer technology will continue but now science has the oportunity to assure long-term availability of results in the perspective of a librarian. The Internet provides the network to interlink archives distributed around the globe and thus even allows the provision of harmonized data views through science specific portals.

Capacity versus complexity


The amount of data and the required increasing storage capacities is the point mostly stressed in data discussions; e.g. a satellite like CryoSat, recently launched to observe the behaviour of the polar ice, will produce 50 GB of data per day. The solution for this problem is linear: a high amount of data requires a high amount of storage space. An other point, rarely mentioned in this context, is the variety of measurements in all parts of the geosphere (atmo-, hydro-, kryo-, bio- lithosphere) and thus the complexity of the resulting data. Leaving out anything related to biodiversity (i.e. species distribution) will still remain some ten thousands of variables as measured by all disciplines of Global Change related sciences. Smart data models are required to handle this fine granular and highly diverse data huddle.

Throug the Internet technology it is now possible to store any thing (digital object) any where at any time - which is far from beeing a sustainable infrastructure in the sense of a librarian. The error 404 (file not found) problem was realized by the commercial publishers already a few years after the opening of the Internet and the system of persistent identifiers was invented.

The content of German data centers also contributes to international portals, e.g. for carbon (www.carboocean.org), biodiversity (GBIF, OBIS), IODP (sedis), or to the major portal of the WDS which is still in its intital phase.

WDCC, WDC-RSAT and Pangaea were approved by ICSU as members of the World Data Center system.

Availability
The availability of data from Global Change research needs urgent improvement. To support the flow of data into archives most important are (1) the reliability of access and an (2) assured credit for the data producer/provider. A national DFG-funded project has substantialy contributed to these points. The project has focused on the citation of data sets citable and has establish a reliable access persistient identification. DataCite as the international registration agency for data DOI in 2010 has registered 100,000 data sets as an initial contribution and test for workflow, technical implementation and distribution mechanisms. Several data centers contributed with its content under the leadership of the National Library for Science and Technology (TIB).

In Germany data DOI are distributed through the TIB (National Library for Science and Technology) as a partner of DataCite which is the international registration agency and a member of the IDF (International DOI Foundation). Centers have a cooperation contract with the TIB which also assures the distribution of the data citation through library catalogs.

Data citations are distributed through library catalogs, search engines as well as generic and science specific portals.

Publication workflow
Centers in Germany archiving data related to Global Change provide its content through an established workflow in citable entities. An author who wants to archive a supplement related to a publication or making results available to the scientific community, e.g. within a research project, is integrated into an editorial workflow similar to a scientific journal. Documented through an issue tracking system the data publication process consists of the following steps: (1) submission (author), (2) check for consistency and completness of metadata, archiving (editor), (3) proof-read (author), (4) corrections (editor), (5) peer-review for supplement to a publication (if applicable), (6) publication as citable entity with DOI as persistent identifier. Data supplements can automatically be linked to the publications splash page of the journal/publisher through an automatic web service (e.g. ).

ICSU World Data System (box)
As a consequence of the International Geophysical Year of 1957-1958 (IGY) the International Council for Science (ICSU) established the World Data Center system to serve data from the IGY scientific disciplines. Until the end of the WDC system in 2008 more than 50 centers with a global distribution were established covering most disciplines of Global Change research. With the invention of the Internet, many WDC made its data online accessible. Since 2001 Germany has contributed with three centers: World Data Center for Marine Environmental Sciences (WDC-MARE), World Data Center for Climate (WDCC), and World Data Center for Remote Sensing of the Atmosphere (WDC-RSAT).

In 2009 a new World Data System (WDS) has been created through a decision of the 29th General Assembly of ICSU; WDS will substitute WDC in 2011. The WDS concept aims at a transition from existing stand-alone WDCs and individual services to a common globally interoperable distributed data system, that incorporates emerging technologies and new scientific data activities. A prototype data portal which is being considered as a proof of concept for an element of the new system. WDS will enjoy a broader disciplinary and geographic base than previous ICSU bodies and will strive to become a world wide 'community of excellence' for scientific data. Any organization producing or holding data is encouraged to join the new WDS (http://icsu-wds.org).

WDCC (box)
In 2003 the present Data Management department of Deutsches Klimarechenzentrum (DKRZ, Hamburg) was approved as World Data Centre for Climate (WDCC) by ICSU. WDCC offers data management consulting for climate models over the whole life time of the data. With several catalogues a variety of metadata standards can be handled - an indispensable precondition for distributed and federated archives. Part of the archiving workflow is a detailed quality management. In 2010 the amount of on-line accessible model data has grown to more than 400 TB. As a Data Collection and Production Centre the WDCC is also part of the World Meteorological Organization's information system. As part of the IPCC Assessment Reports, WDCC is one of three data notes for climate model data collection in cooperation with the Program for Climate Model Diagnosis and Intercomparison (PCMDI, USA) and the British Atmospheric Data Centre.
 * http://www.mad.zmaw.de/wdc-for-climate/

GFZ (box - needs a focus on data center)
Helmholtz Centre Potsdam German Research Centre for Geosciences (GFZ, Potsdam) investigates "System Earth" at locations all over the world with all geological, physical, chemical and biological processes occurring at its surface and in its interior. The resulting data are unique in their scientific profile as they encompass the entire planet from global field models down to individual samples from scientific drilling operations. A large proportion of the data held at GFZ are accessible through web portals and many datasets are identified, and thus citeable, by Digital Objekt Identifers (DOI). Data curation services are offered to geoscience research projects as a joint service of the GFZ Centre for Geoinformation Technology (CeGIT) and of the GFZ Library Albert Einstein through a common data portal, project portals and virtual research environments. This service also encompasses consulting on data management, data curation, publication and long-term preservation of digital research data.

WDC-RSAT (box)
?contribution still missing?

PANGAEA (box)
PANGAEA® - Publisher for Data from the Earth System is a unique publication system for supplementary data related to journals and for data publications. Data are published by using a specific editorial system; the content is stored in a relational database and distributed via web services. The operational institutes AWI and MARUM also provide Pangaea as a data archive and library infrastructure to the international scientific community. Data can be georeferenced in time and space and are acompanied by a description (metainformation); part of the latter is a bibliographic citation and a Digital Object Identifier (DOI) as persistent identification of the publication. Data are available in Opcen Access following by using Creative Commons licenses. Use of the system is open to any institute, project or scientist. Data from the oceans are distributed also via the World Data Center for Marine Environmental Sciences (WDC-MARE).

A journal for data publications


One reason for the insufficient availability of data is the missing credit for the data provider. As a contribution to solve this problem scientists from Germany and the UK initiated the journal Earth System Science Data (ESSD) as the first journal solely aimed at the publication of data. ESSD is online since 2009 from the Open Access publisher Copernicus. The journal ESSD improves the availability of data by its integration into the established scientific publication process. Papers describing data, methods and quality are peer-reviewed and will be listed in the Science Citation Index; the provision of a persistent identifier (e.g. DOI) as a pointer to the full data set is mandatory in a paper. The first publication made available an eight year time series of ozone profiles from the Antarctic station of the former GDR (German Democratic Republic).
 * http://www.earth-system-science-data.net

Links

 * examples: wdc-paleo, pangaea, wdcc
 * http://visibleearth.nasa.gov/
 * http://earthobservatory.nasa.gov
 * http://science.gsfc.nasa.gov
 * WMO
 * GEOSS
 * Global Change Research in Germany 2008