Intern:Data set

A data set is a collection of data (often from one event) in a scientific context organized in one matrix. Data in Pangaea are organized in predefined data sets which are quite similar to the original files uploaded and exported from the archive.

The granularity of a data set depends on the type of data, the number of data points and is primarily in the decision of the data author. In principle a Pangaea data set can have an unlimited number of columns and lines (excel 2003: 65,536 x 256; excel 2008: >1 Mio x 16,384) - Examples: A data set may contain one to many Intern:data series. Two to many data sets may be grouped to one Intern:parent set. Intern:Access rights can be defined for a complete data set only. Each data set consists of the data accompanied by metadata according to ISO standard fields (ISO 19115). A data set appears on the Internet with a metaheader which contains the information as described below.
 * 507 columns
 * 2,000,000+ lines
 * 22,600,000+ lines (in ascii: 551 MB; in ASE +index: 2.2 GB; export from IQ +DOI: 1.44 GB) (Fig. 1)

Opening a data set in 4D will show the frame title and five tabs named Basics, Config, References, Details and Specials with metadata fields as described below:

Frame title
The frame title of the data set window shows its ID and the responsible curator (Fig. 2). Below are buttons to  and  the data set or to open it via its URL in a browser window . The button  is offered when the data set is part of a Intern:parent set.

Basics tab (Fig. 2)

 * Author(s) of the data set; one to many authors may be added by a multiple choice list related to the table Intern:Staff
 * It is possible to add working group / team /institutional authors (collective authors), but the usage of collective authors should be reduced as much as possible (there are reasonable exception, for instance for IODP)
 * No data sets can be archived with just collective authors on it, there always must be a real responsible contact person
 * After DOI is registered, no authors can be added anymore to the data set
 * Year, automatically set but can be changed
 * Title of the data set as free text; equivalent to the title of a publication; in case the data set is a table from a publication, the title should be the table caption. Heading by the table number, e.g. doi:10.1594/PANGAEA.693592, is discouraged.
 * Source may contain the institution of the data origin and is relational to the table Intern:Institution; use only if data are not related to a reference.
 * Status of the data set with
 * drop-down menu with choices: questionable, in review, validated, published
 * Registry: gives information about the registration process:
 * not to be registered if status is not published
 * registration is in the lead time for four weeks after setting the data set to status published and final editing
 * registered as the final status
 * Enforce DOI registration can be used for immediate DOI registration (can only be done for data sets with status published.
 * Protection of the data set with
 * drop-down menu with choices: unrestricted, signup required, access rights needed
 * unrestricted open access data sets (default)
 * signup required for e.g., BSRN datasets
 * access rights needed data sets under moratorium. Login required may be checked for sets with status published but should still be protected (on request of the PI)
 * Access rights button to set individual access to data sets with access rights needed
 *  Get temporary access key button for generating a link to access data sets with access rights needed. The lifetime of key can be set to a maximum of 180 days.
 * Delete access keys button for canceling all access keys for the data set.
 * End of moratorium specifies date and time when the moratorium on data sets with access rights needed is lifted. The end of moratorium is being checked hourly (by middleware) and when the dataset Protection is set from "access rights needed" to "unrestricted" after the moratorium expiry, the Status is automatically set to "published" at the same time, if not "published" yet.
 * License: drop-down menu with different cc creative commons license, CC-BY by default, see Creative Commons Attribution-Noncommercial. http://i.creativecommons.org/l/by-nc/4.0/88x31.png
 * Keywords is relational to the Terms from Ontology Keywords, PANGAEA and can be set individually, different types of Intern:keywords are available
 * Event(s) as used in the data set
 * Created date/time and curator of import; Updated date/time and curator of last change
 * Issue specifies the issue tracker key (e.g., PDI-23662), which can be opened with Open button in a browser window.
 * Transfer button is used for a transfer of information from the related ticket, when the Issue tracker key is specified, see
 * Project(s) allows via a multiple choice list to add one to many projects as provided by the Project table; when no project is relevant, the default is "not_given".

Selecting events
During data import:
 * Multiple events can be selected (or deselected) if no event data series is part of the import matrix and no metadata are configured. For that any selection (in another open Event table) can be drag/dropped into the search field in the multiple choice dialog.
 * A single event can be selected if no event data series is part of the import matrix and metadata are configured.
 * If an event data series is part of the import matrix single events can be replaced by clicking on the event to be replaced in the event list of the import dialog.

Archived parent data sets:
 * Multiple events can be selected (or deselected) - there are no dependencies from child events. For that any selection (in another open Event table) can be drag/dropped into the search field in the multiple choice dialog.
 * Events can be replicated from childs using a button at Reference Tab.

Archived simple data sets:
 * Multiple events can be selected (or deselected) if no event data series is part of the dataset and no metadata are configured. For that any selection (in another open Event table) can be drag/dropped into the search field in the multiple choice dialog.
 * A single event can be selected if no event data series is part of the dataset and metadata are configured.
 * If an event data series is part of the dataset nothing can be changed.

Transfer of information from JIRA ticket
Transfer of metadata from JIRA ticket is possible when the Issue field contains a valid PDI ticket number. The correctness of the transfer must be always checked.

Following fields from the ticket can be matched:

Config tab (Fig. 3)

 * Data series window shows parameters used in the data set. The list contains data series label, parameter name, unit and original format. The button Add/Remove allows to add or remove data series to the data set. Data series must be deleted from the configuration before they can be removed.
 * Geocodes window lists the used geocodes. A double-click brings a geocode in the data set Configuration.
 * Related metainformation window contains fields from the event table which can be added to the Configuration.
 * Configuration window lists geocodes (highlighted in blue), related metainformation (highlighted in yellow) and all parameters used in the order of the available data set. The configuration lists shows the data set label, parameter name, unit, format, parameter ID, PI, Method, Comment and no. items. Mark the Edit in list check-box to edit these items.
 * Format will show the number of digits before and after the decimal point of a numeric parameter if selected in the configuration window by a mouse click. Different formats can be selected from the pop-up menue or changed by hand. If the geocode Date/Time is selected, different types of ISO formats can be selected, depending on the required precision. Also the exponential-format is supported, e.g.
 * PI and Method are relational fields and can be set via the button choices from the relational lists
 * Parameter can be changed in the same way via the choices button. Be carefull, never exchange a text parameter by a numerical parameter or vice versa! Unit and Param.ID are linked with the parameter and cannot be changed.
 * Comment is a free text field

References tab (Fig. 4a)



 * References (literature & data) is related to the reference list and offers all kind of references from literature, web pages and external data repositories.
 * Add/Remove opens the relational reference list.
 * Choices is inoperable
 * The reference list contains Author(s), Title, Year and ID of the related reference. Comment is a free text field.
 * Relation opens a pop-up menu with the options:
 * Supplement to
 * Replaced by
 * Suppl. to (dependent): obsolete - do not use
 * Related to by default
 * Other version
 * New version
 * Original version
 * Source data set
 * Further details
 * References (Pangaea data sets) is related to the Pangaea data sets list. Data sets can be referenced in the same way as the references (literature & data).
 * Add/Remove opens the data set list.
 * Choices is inoperable
 * Relational opens the same pop-up menue as describe in references (literature & data)

Changing relations (many at once)
A relation of reference to the dataset can be changed individually in the Relation column using the drop down menu. For many at once, it is necessary to select all relevant references using the column ID (YES, column ID, one to the right from Relation, see Fig. 4b). Once selected, the column Relation is still highlighted. Press Choices to select the correct relation for all selected references.

Details tab (Fig. 5)

 * URL as defined by the system for event-related data sets or as defined by the user for static links to files. For static links the check-box static must be selected.
 * Export filename contains the data set's name if downloaded as text file to the users PC. The extension *.tab is added automaticaly.
 * Filenames usualy start with the event, followed by a specification of the content (e.g. M24_3-5_sedimentology).
 * File names of supplements start with the first author's name followed by the year, equivalent to the citation of references in a publication text; e.g. Smith_1998, Smith-Sandwell_1987 or Smith-etal_2007.
 * Comment to add individual comments as plain text; field size up to 32 kbyte. URIs might be included and will be resolvable in the metaheader (example ).
 * Abstract to add the (paper's) abstract. Is mandatory for supplementary to data sets.

Coverage tab (Fig. 6)

 * Spatial coverage: fields showing min/max of the three spatial dimensions of the data set. Set automatically; data sets without the three dimensions get by default Latitude min./max. -90/90, Longitude min./max. -180/180 and 3. dimension min./max. -2000/20000. These values can be deleted.
 * Size of the data set, set automatically and gives Rows number, Columns number and Data points.
 * Temporal coverage: min/max of Date/Time or Age [ka]
 * Intern:Topologic type is used to define the extension of a data set

Locked data set or other entries
When opening a window to edit details of a record, the record is locked and is not available for other users and also not for background processes. As any updated records are processed sequentially in the background queue, this might cause updates not showing up on the web. Close windows if editing is done!

Deleting/Versioning PANGAEA data
30 days after the last change to an imported dataset a permanent doi-number is attached to a dataset (only published status). Before, a dataset can be updated or deleted without problems. Once the DOI is registered, a data set can no longer be deleted.

Please have a look at the section on versioning:

Import of data
See: