Topic

From PANGAEA Wiki
Jump to navigation Jump to search

The new PANGAEA web site is showing several "Topics" on the start page. When a user clicks on one of those topics a facetted drill down is started showing all datasets, which have those topic assigned. In addition, the user can select a topic from the drop down list next to the search bar.

Overview

The assignment of topics to datasets is completely automatic and done using the terms catalogue. A manual assignment of additional terms and topics directly to datasets is possible (through the keyword user interface in 4D, which allows to access keywords, technical keywords and terms from "PANGAEA classifying terms"), but removal of automatically assigned topics is currently impossible. Please note: Manually assigning terms or topics from the "PANGAEA Classifying Terms" should only be done as last resort, as it defeats the whole purpose of the terms catalogue. It may be useful for datasets with only few metadata (e.g. static URLs) and no journal on the reference and no parameters.

Every dataset may get assigned to multiple topics - which is the norm!

The terms catalogue is a hierarchical list of terms with many connections between those terms. The most common connections are "broader term" (which produces something like a tree view) and "synonyms" (e.g. a foraminifera in WoRMS which is known under different names) or "same-as" relations (same-as is used to link different terminologies). The "broader term" relation is most important, as it allows to classify the very fine terms (like foram names, device names) with the topics on highest level. The "PANGAEA Classifying Terms" Terminology is the structure behind the topics and facets shown on the PANGAEA search engine. It is a very coarse structure of the world, only consisting of terms that should appear in the facets on the left side of PANGAEA search (under "Topic"). It contains the 15 top level topics ("main topics" called internally) which are static as the program code of PANGAEA depends on their name and spelling. Changing them is theoretically possible, but breaks the search engine and prevents a release of a new version of the terms catalogue for the search engine (reindex). The other terms in this terminology are somehow related using "broader terms" to those topics. It is mainly scientific disciplines, but also consists of a very coarse WoRMS clone (only kingdoms, family,...). It also contains all terms from the Thomson Reuters Journal list (every scientific Journal was assigned some keywords by Thomson Reuters to classify them).

Topic Assignment to datasets and Term Catalogue Index

During marshalling datasets to XML, several parts of the PANGAEA metadata schema (Klorolle) have attributes relating to term IDs (Journal list, Parameters, Devices, Locations). Those terms are not part of the XML schema, they are only listed as IDs. For a journal like "Journal of Marine Systems" the XML contains:

<md:source id="ref81872.journal9123" relatedTermIds="33974,33987,34022">Journal of Marine Systems</md:source>

This list of term IDs are stored as-is in the metadata. During indexing the dataset for search, those term IDs are looked up in a separate Elasticsearch "helper" index ("facet helper index"), which is built on a release of the PANGAEA terms catalogue (it is not maintained live!). Any change on this index requires a reindexing of all PANGAEA datasets, so its done only weekly.

This index contains the whole relationships of the attached term: Thomson Reuters assigned "Oceanography" to this journal, which has ID 34022: http://ws.pangaea.de/es/pangaea-terms/term/34022; the information on this URL shows the whole relationship of this term up to the "main topics":

{
  "_index" : "pangaea-terms_v1",
  "_type" : "term",
  "_id" : "34022",
  "_version" : 12,
  "found" : true,
  "_source" : {
    "internal-source" : "standard",
    "internal-datestamp" : "2016-02-10T11:16:04.680Z",
    "name" : "Oceanography",
    "terminology" : 8,
    "mainTopics" : "Oceans",
    "topics" : [ "Oceanography", "Oceans" ],
    "searchTerms" : [ "Oceanography", "Oceans" ]
  }
}

In that case it is very simple: the broader term of "Oceanography" is "Oceans". All those terms are added to the full text index, so people can search for them, but those marked as "topics" and "mainTopics" are also put into the facetting fields of datasets. The mainTopic is responsible for the assignment of the 15 main topics on PANGAEA's start page.

A more complex example is a parameter named "Globigerina bulloides":

<md:parameter id="col5.ds2114989.param9186" relatedTermIds="1047579">
 <md:name>Globigerina bulloides dissolution index (Volbers)</md:name>
 <md:shortName>BDX´</md:shortName>
 <md:group>Components, biogeneous</md:group>
</md:parameter>

If you look up the related term in the terms index, you get a lot more information: http://ws.pangaea.de/es/pangaea-terms/term/1047579

{
  "_index" : "pangaea-terms_v1",
  "_type" : "term",
  "_id" : "1047579",
  "_version" : 12,
  "found" : true,
  "_source" : {
    "internal-source" : "standard",
    "internal-datestamp" : "2015-05-05T15:06:18.680Z",
    "name" : "Globigerina bulloides",
    "terminology" : 1,
    "mainTopics" : "Biological Classification",
    "topics" : [ "Foraminifera", "Chromista", "Biological Classification" ],
    "searchTerms" : [ "Globigerina bulloides", "Globigerina", "Globigerininae", "Globigerinidae", "Globigerinoidea",
      "Globigerinina", "Rotaliida", "Globothalamea", "Foraminifera", "Rhizaria", "Harosa", "Chromista" ]
  }
}

The source of this term ID is WoRMS (see the terminology ID). WoRMS has many detailed information about this up to the kingdom. For display/facetting purposes this information is more or less useless (much too detailed). The algorithm follows all terms up starting at the requested term "Globigerina bulloides" and follows broader terms and synonyms and collects them. Those terms are listed as "terms" in the JSON and are added to the search index into the field where they were collected from (e.g., "parameter:" field in search). So instead of a search query like parameter:"Globigerina bulloides" one could also search for this parameter using any of the other broader or synonym terms, e.g. parameter:Rhizaria (this is some broader term of the "Glob bulloides" term/parameter, on level Infrakingdom). It allows users to just enter more wide keywords or synonyms into the PANGAEA search engine and still find the dataset, which prevents the horrible "nothing found" message by the search index! Synonyms (alternate names) are also added to the list of expanded terms.

If any of those terms visited during diving up or to the side in the hierarchy match with a term in the special "PANGAEA Classifying Terms" terminology (or have a "same-as" relationship), the relation is followed cross-terminology. All the terms marked as "topic" from the "PANGAEA classifying terms" terminology are added to the index as "facets" (for our previous example "Glob bulloides", it will add "Foraminifera", "Chromista", "Biological Classification" as "Topic Facet"). The term on the top level "Biological Classification" is one of the 15 "main topics" (added separately to index).

Collecting terms, topics and main topics during indexing for facets

Currently all related terms used for assigning topics are collected from the whole Klorolle with the following algorithm:

  • The journal of the first "Supplement to" or "Related to" reference is checked for thomson-reuters terms
  • Event's Method/Device
  • Event's Location (it is a term on its own)
  • Data series' Parameter and Method/Device
  • Data set keywords

All terms collected from these XML parts are collected and looked up in the term helper index (see above). All terms and topics are added to search engine. Currently all terms are equal, there are no special cases!

In addition all terms that have related terms from the whole Klorolle XML are expanded for term search purposes!