PANGAEA provides a considerable amount of biodiversity data. Beneath environmental data sets also a large number of taxon observation data is available, that is the georeferenced observation, abundance counts etc of living or dead organisms. In comparison to traditional taxonomic databases such as ITIS, PANGAEA treats this information in a slightly different way. These biodiversity databases concentrate on taxonomy issues such as taxon names, taxon author information, synonymy and classification etc., and regard this information as the ‘data’.
In contrast, PANGAEA subsumes this information within it’s metadata information and regards taxonomic information as ‘metadata’ for an underlying set of information, e.g. abundance counts. This diverging definition of data and metadata is mainly due to the fact that a considerable amount of biodiversity databases originated from collection databases where the tupel taxon (specimen) coordinates was considered as a ’record’.
Taxonomic Information in PANGAEA
Due to PANGAEA’s generic design the treatment of taxonomic information follows a pragmatic approach. For species occurance data taxon names are not stored in a specialiced taxonomic db structure, but within the usual ‘Parameter’ table, which documents the type of measurements or observations within a disctinct dataset. In this table taxonomic information is stored together with, other parameters, are stored for example physical parameters such as ‘Temperature’, ‘Salinity’ or complex parameters such as isotope measurements on organisms or concentrations of chemical compounds.
It was therefore important to separate and reorganize data containing taxonomic information in order to contribute taxonomic information to biodiversity portals such as GBIF or IOBIS. To ease this curatorial task, PANGAEA has developed some specialised procedures and tools to recognize taxon names. Beneath simple filters which exclude for example parameters containing special chars, PANGAEA makes also use of external services to classify parameter definitions. For example uBio’s taxonomic intelligence (http://www.ubio.org) and their public webservices to discover known taxon names. However, a considerable amount of taxon names which are used within PANGAEA have not yet been listed by the major taxonomic databases. Therefore all parameters which potentially contain taxon names are double checked by humans (data curators and librarians).
In questionable cases the responsible specialist (PI) is asked for clarification. Those parameters which certainly are recognized as taxon names are organized in dedicated taxonomic Parameter Groups. Finally, these parameters are used in a later step to transfer PANGAEA data to those standards necessary to feed the major biodiversity portals.
Supported Biodiversity Standards
The Darwin Core XML format is a simple and commonly used format to distribute biodiversity data. It is a flat XML format similar to the Dublin Core format and consists of a simple set of Core Elements it is primarily based on taxa, their occurrence in nature as documented by observations, specimens, and samples, and related information. A detailled documentation can be found at http://wiki.tdwg.org/twiki/bin/view/DarwinCore/WebHome.
PANGAEA uses Darwin Core to feed the GBIf as well as the IOBIS portal. Wheras GBIF uses the Darwin Core as specified in the above mentioned documentation, IOBIS uses a slightly different profile, which is an extension to this format. A detailled documentation of this format can be found at: http://www.iobis.org/tech/provider/schemadef1.html.
DIGIR stands for Distributed Generic Information Retrieval and is one of the most important data exchange protocols for biodiversity. As the name suggests, DiGIR was intended to provide a protocol for distributed searches allowing to formulate complex queries which are then performed on several repositories. It is therefore not ideal but also suited for data harvesting which is the primary purpose for it’s use for PANGAEA’s data submission to GBIF. A detailled documentation of DiGIR can be found here: http://digir.sourceforge.net/.
PANGAEA uses a modified PHP DiGIR provider (http://digir.net/prov/prov_manual.html) originally written by Dave Vieglais. The PANGAEA DiGIR provider can be found at http://digir.pangaea.de. This DiGIR provides biodiversity data in both, standard Darwin Core as well as IOBIS format. The transformation of PANGAEA data to fit in the structure of this provider is done by a simple script which queries the PANGAEA data base for all parameters which previously have been tagged by data curators to represent a taxon name. The data of all data sets which contain such a taxon parameter are queried and loaded into the DiGIR provider’s data base. The matrix how PANGAEA metadata and data fit into the Darwin Core model is as follows:
The original purpose of DiGIR was to support data delivery of a distinct collection representing one dataset of biodiversity data (= DiGIR ressource). DiGIR allows to maintain more than one ressource and it allows to provide some metadata on this ressource (dataset) by a ‘Metadata Request’. This is an advantage, as the capabilities of the Darwin Core format to cite a data set are very limited. However, the response will list all metadata of any ressouce (dataset) contained. As PANGAEA provides a significantly finer granularity of data sets, we deliver a high number of data sets containing biodiversity data. Therefore, the PANGAE response to a ‘Metadata Request’ results in a very large XML file and this takes some time.