Intern talk:Project data management/IODP

Rescue for DSDP/ODP/IODP post-cruise data Report by Hannes Grobe, Evgeny Gurvich & Stefanie Schumacher (2009-04-28)

add:
 * 4 hours/supplement
 * scientific expertise required
 * own contributions
 * deadline for published post-cruise data -> 2013?

1. Motivation and aim
The ocean drilling started in 1968 and since than has generated a hugh amount of datasets, shared in many national and international journals. A summary of the engineering topics and of the first scientific results of each cruise are published in the Initial Reports since DSDP. First scientific publications related to a certain leg are also published in the „Initial Reports“ of the DSDP Programm and the „Scientific Results“ of the ODP Project. The majority of publications (with related primary data) is produced later on and sometimes are published years after a leg. Those "post-cruise" publications are distributed in various journals related to marine geosciences of major publishers (a.o. Science, Nature, Elsevier, Springer, Wiley) and in smaller journals, e.g. of national societies. Nearly none of the related primary data are available in machine-readable form on the Internet. In very many cases no data are available at all. Thus for scientists working in the field of marine geology, it is nearly impossible to get an overview about the availability of research data.

Data from core documentation and scientific investigations on board were published through the Initial Reports from the ODP Project (Legs 100-210) and IODP (Leg 310 ff) and are available via the ODP JANUS Database. JANUS includes technial meta-information about the cores (length, sections, recovery etc.) and core images of the DSDP Legs (1-96). Loggin data were accquired and archived by the Borehole Research Group (BRG) at Lamont-Doherty Earth Observatory.

In 2007 a show-case project was initiated by IODP to store former post-cruise data, printed in indiviual publications in an Open Access repository. The project started by screening previously published DSDP and ODP publications, extracting the data, reformating the data according to international standards and made those 'data supplements available through a data system. In addition, each data table and each supplement had to be long-term identified by a persistent identifier.

2. Work flow
2.1 Reference/Data search Data search started on the journal-level, considering all scientific fields relevant to ocean drilling (sedimentology, palaeontology, geochemistry, petrology, geophysics). As a central source of ocean drilling related publications, the Ocean Drilling Citation Database of the GeoRef Information Service, operated by the AGI (American Geological Institute) was used. The georef search is able to provide a list of all DDP/ODP/IODP publications related to a certain journal. The systematic search for supplementary data started on the publisher level, going through all journals of the publisher. Each publication is accessed on the publisher’s webpage and the search for data is started. Recent publications often have supplementary data, which are available online in the repository of the publisher. These files had to be downloaded. Each publications was browsed for data, either on the online version or on the pdf version. Papers with datasets in the printed and/or the online version were downloaded, harmonized and archived. All georeferenced data of a publication are considered, even if the data are not directly related to a DSDP/ODP/IODP Project.

2.2 Extraction of data from publications Published data tables are available in different technical qualities, depending on the publication year and state, in the printed paper or in a supplement archive. Younger publications often have online accessible supplementary data in excel, text or pdf format. Tables in the paper are allways integrated in the pdf-format. Any georeferenced data are converted to excel sheets. A conversion of excel and txt files to the import format is easy; the conversion of pdf-files might required some editing.

Pdf-files are opened with Adobe Acrobat Professional, pages with tables are isolated and stored as MS Word document. The document opened in MS Word allows to copy the table into an MS Excel sheet. If this flow does not work, the table has to be extracted directly from the pdf file. The table is marked and copied into a plane editor file. Blanks are replaced by tabstops, and the document can be copied into an Excel sheet. In a few cases, the tables are integrated as a grafic object (tiff or gif) in the pdf. In this case, data have to be retrodigitized in typing by hand.

The excel sheets are quality controled, eddited and compared with the original document. Line breaks and tabstops in wrong order may confuse the orientation of lines and columns, numbers and names may contain misspellings from the OCR process. The final editing and review can be quite time consuming. In mean, the data of one publication need about 4 hours to be transfered from its original published format to the machine-readable standard form provided by the data archive.

2.3 Preparation of data for import The prepared and corrected Excel sheets are prepared for import. Sample information, i.e. the standard ODP sample designation has to be added or completed. Metadata are defined in the database. References are completed with DOI, in older publications without DOI the pdf file/page on the publishers web site is linked. All data tables from a publication are imported, in principle one published table as one dataset. In case the table contains more than one Site or Hole, a dataset is defined by Site/Hole. The dataset titel mostly is equivalent to the table/appendix number and caption. Many datasets (childs) of one publication are merged into one parent set set which also includes the abstract of the publication (extracted from the original pdf file). The data set DOI, or, in case of many data sets, the parent DOI will become the official identifier of the supplement. Always a final control in comparison with the original publication is part of the quality control and internal review process.

3. Examples
1. Parent set with severel child datasets: http://doi.pangaea.de/10.1594/PANGAEA.678472 This publications contains three tables in the pdf file. Table 3 is splitted to the five Sites. All tables needed a time consuming review after extracting from the pdf, because of the species names. The tables were extracted via MS Word document, therefor colums and rows were in a proper order.

2. Parent set with three child data sets: http://doi.pangaea.de/10.1594/PANGAEA.672082 Here we have one table in the pdf file (Table 1) and two excel sheets as supplement (Appendix A). Table 1 was extracted via copy-past, and only few editing was needed. The excel sheets were also in a good mode for the Pangaea import.

3. A single data set refered to a publication: http://doi.pangaea.de/10.1594/PANGAEA.712516 This publication has no data tables in the pdf file, but a supplement, which can be downloaded from the publisher’s web page. The supplement pdf was converted and imported and has the same status as a parent set with all informations included.

4. A single data set refered to a publications: http://doi.pangaea.de/10.1594/PANGAEA.706057 The Table II in the pdf file is in an very bad mode. The table is insertes as an graphic, the scan was done with a low definition. We have used the table as a hardcopy from the printed journal and created an excel sheet via data-typist.

5. Parent set with two child datasets: http://doi.pangaea.de/10.1594/PANGAEA.710844 The authors also give previous published data in the tables (EPSL). These data are imported with all data of the primary publications, and also parent sest were created (Init. Rep.): http://doi.pangaea.de/10.1594/PANGAEA.710841 and http://doi.pangaea.de/10.1594/PANGAEA.710824 Now the EPSL child datasets get the relevant doi- informations of the Init. Rep. child datasets (Example: For Sr and Cl data see Gieskes (1974) dataset: doi:10.1594/PANGAEA.710820).

6. In a few cases data sets were published in the ODP/DSDP Reports AND in a journal. First priority is given to the journal and the data report is listed as additional reference: http://doi.pangaea.de/10.1594/PANGAEA.706226

7. In cooperation with Elsevier, available Supplementary Data in Pangaea are now also visible on the splash page of a publication in Science Direct. (Functionality is provided through a test installation with special login only: Oostende/Oostende): http://dx.doi.org/10.1016/S0031-0182(01)00497-7

4. Statistics
Table 1. Overview (April 2009) of processed journals with 1251 DSDP/ODP/IODP publications. In 615 publications 2766 data sets were found in tables/appendices/supplements and were made available in machine readable form through Pangaea.
 * DSDP 329 supplements with 1042 childs in 250 parents and 79 single data set supplements
 * ODP 446 supplements with 1503 childs in 347 parents and 99 single data set supplements
 * IODP 13 supplements with 60 childs in 12 parents and 1 single data set supplement

Links

 * ODP home
 * JANUS database
 * Borehole Research Group
 * Logging database
 * Ocean drilling citation database of GeoRef