Intern:Chemical data in PANGAEA

This is a preliminary page, which is still under work in progress!

This Wiki-pages serves as an entry point for chemical data. It offers an overview about the most important rules for PANGAEA, as well as chemical nomenclatures, databases, identifiers and ontologies. Furthermore, a list of compound groups shall help to identify molecules and to find the relevant guideline pages in this wiki.

=PANGAEA-rules for chemical data and features=

=Sources of error and ambiguities= Many different errors can occur within PANGAEA parameters, but also extern databases, which may complicate correct mapping of terms. Generally speaking, the reason for the difficulties, besides semantic issues themselves, are the complex nature of chemistry.

=Workflow on chemical data=

=Chemical nomenclatures=

=Chemical identifiers= PANGAEA uses InChI-keys (InChI = International Chemical Identifier) as a standard to identify molecules, because they are non-proprietary and open accessible. All chemical compound features in the terminolgy catalogue should contain InChI-keys (if applicable). This will allow crosslinking to other databases and ontologies (e.g. Pubchem, BRENDA, ChEBI). InChI-keys are calculated from InChI-strings.

For more comprehensive information on the International Chemical Identifier (descriptions, advantages, drawbacks), please read the corresponding PANGAEA wiki-article.

Other identifiers (such as Smiles or CAS registry numbers) exist, but are not used by PANGAEA.

InChI-string
An InChI-string encodes for the chemical structure of a molecule and is computer readable (not human readable). The structure can be derived from the InChI-string and vice versa. InChI-string serve as unique identifiers for molecules.

However, the more complex the molecule, the longer the InChI-string. This is disadvantageous for searchengines and databases. Therefore, the InChI-key has been developed.

InChI-key
The InChI-key serves as a barcode for chemical structures. In contrast to the InChI-string, each InChI-key has always the same fixed length of 27 characters, making it easily searchable by google and within databases.

An algorithm converts the InChI-string into an InChI-key. However this conversion only functions one way: The structural information is encrypted and can not be deduced from the InChI-key again. This requires a lookup service within a database.

=Chemical databases=

General databases
The following table gives an overview about general chemistry databases in descending order of seniority for referencing purposes:

Table columns: Database name (w. link), # of data entries (size), curation (manual, none, community based), crossreferences (to other databases) additional features? (e.g. spell-correction, partial inchi-key matches), correctness/reliability, Access rights (open, proprietary, payment?)

Pubchem
Pubchem Compound is a huge publically available reference database hosted by the National Center for Biotechnology Information (NCBI). The compound database contains 93.6 million unique structures and associated information, which are extracted from the 234.7 million records of the PubChem Substance database (containing external depositor entries). Thus, a PubChem Compound entry summarizes all available information on a unique structure and assigns a unique and permanent compound ID (CID) to them. CID-containing URLs are always stable.

PubChem Compound does not employ curators. However PubChem Substance structures undergo an automated validation and standardization process, before they are mapped to existing CID-entries (or before new entries are created). Sum formulas, identifiers (InChI, Smiles) and the IUPAC name are automatically created. Other information is overtaken from depositors substance entries. The names collected in the Compound entry header, represent depositer supplied substance names ordered by the weighted frequency of use (see here). The synonym list therefore provides usefull hints on commonly used names, but names are not validated. Please note, that not all compounds necessarily represent valid structures (obsolete structures due to new results of research, read this discussion for more information).

A list of all depositors/data sources and when the data was last updated, can be found here.

More comprehensive information can be found in this paper.

ChemSpider
ChemSpider is hosted by the Royal Society of Chemistry. The database contains 59 million deduplicated molecules. Chemspider works with a combination of crowd sourced curation approach, manual curation (by staff) and automated validation.

Registered people can comment on entries, add synonyms or mark existing synonyms as obsolete. Furthermore, crowd sourced discussions help identifying the correct chemical structure for yet ambigous compound names. Employed curators need to validate changes made by members. Synonyms written in bold letters are validated by a curator as correct structure vs. name mapping. The FAQ answers most questions about the operating principles.

Each ChemSpider entry receives a unique CSID (ChemSpider ID), which is part of the URL (exchange # by CSID): http://www.chemspider.com/Chemical-Structure.#.html

Identifiers (InChI etc.) are hidden under the "more details"-tab.

Crossreferences can be found by clicking on the "More"-tab and choosing "Data Sources". A complete list of all data sources can be found here.

Remarks: ChemSpider has used CaffeineFix for automated spell corrections in the past. However this service seems to be discontinued.

Pay attention to the search output statement on top of the page: For InChI-keys: Does the displayed entry represent a full match or a skeleton match? For name based search: Was the displayed entry found by an approved synonym?

Official publication on ChemSpider: Link

Scifinder
eq. scifinder?

Specific databases
Table columns: Compound group (e.g. lipids, carbohydrates), Database name, Usability

=Chemical ontologies and thesauri=

=Usefull tools=

OPSIN Parser
The Open Parser for Systematic IUPAC Nomenclature (OPSIN) is able to translate systematic IUPAC names into chemical structures and simultanously creates the following structure based identifiers: InChI String InChI-key Smiles

The InChI-key is hyperlinked and can be directly used for a google search. Publication on Opsin

BIOVIA Draw
The chemical structure drawing programm BIOVIA Draw is a no-fee program for academic and non-commercial use. A previous subscription is necessary to download the program.

BIOVIA does not only offer the possibility to draw chemical structures. Most useful are the following tools:
 * Generate text from structure
 * Convert structure into IUPAC name
 * Convert structure into InChI-key
 * Convert structure into InChI string
 * Generate structure from text
 * Convert systematic IUPAC name into structure
 * Convert InChI-String into Structure
 * Chemical Check (Structure Validation)

All tools can be found under the "chemistry"-tab of the programme.

Cactvs - Chemical Identifier Resolver
https://cactus.nci.nih.gov/chemical/structure

Cactvs - Chemical Structure Lookup Service
https://cactus.nci.nih.gov/cgi-bin/lookup/search

UniChem
UniChem uses identifiers (InChI, InChI-key or database identifiers) to automatically generate database crosslinkages.

The crosslinkable databases are listed here. Most important for PANGAEA: BRENDA, ChEBI and PubChem.

Data from BRENDA is received via download, ChEBI and PubChem offer FTP Sites.

Also links to some compound-specialised databases (e.g. LipidMaps, CarotenoidsDB) are possible. UniChem does not only crosslink to databases with chemistry as primary focus, but also to databases with a secondary focus on chemistry (e.g. molecular biology databases)

In total, UniChem contains information on 151 mio. chemical structures out of 33 resources. The total number of mappings/assignments is >250 mio. In contrast to the spiderweb-like crosslinkages between other databases, UniChem offers a cetralized view of available crosslinkages as this picture visualizes.

Update routines are run once in three months (date of last update see here). Obsolete crosslinkage-information is not deleted, but is marked as obsolte (together with the data of last current assignment).

UniChem calculates InChI strings and keys from sources, which are only providing molfiles or SMILES. Moreover UniChem validates, if source InChI string and key are consistent to avoid errors.

A web service can be used for larger queries.

Remarks:

Database identifiers are called "src_compound_IDs" in UniChem. Furthermore, the database "src_ID" from the source list must be chose to perform a query. (more information)

Example:

PubChEM CID:2244

--> src_compound_ID = 2244

--> src_ID of database = 22

For statistics, how many structures come from which database and how many structures are unique to a certain database, watch here. Remarkable: 75% of all UniChem entries have mappings to PubChem (largest source), meaning, that 25% of all compounds in UniChem can not be found in PubChem.

For more information, read the publication of UniChem.

=Overview on compound groups= This list serves as an overview on compound groups, which occur in PANGAEA datasets (in parameters or the data matrix). In future, each group will be hyperlinked to corresponding guidelines for curators. The guidelines will contain essential naming rules, hints about common mistakes and links to further references. Furthermore, a table with identifying name components and example structures shall be implemented to help recognizing the right substance family.

(flat, preliminary list without hierarchy, alphabetically ordered)
 * Alcohols
 * Aldehydes
 * Amino acids
 * Bridged compounds
 * Carbohydrates
 * Carboxylic acids
 * Fatty acids
 * Fatty alcohols
 * Glycerol Dialkyl Glycerol Tetraethers (GDGTs)
 * Hopanoids
 * Isoprenoids
 * Ketones/Alkenones
 * Membrane lipids
 * Pigments
 * Polychlorinated and -brominated biphenyls, biphenyl ethers and naphthalenes
 * Polycyclic aromatic hydrocarbons /Fused rings
 * Sterols and Steroids