Intern:Molecular data in PANGAEA

The Wiki-article is not complete, yet and still under development. Currently, the page will be split up into a short guideline (for curators) and a comprehensive guideline, see the following working documents:

Short guidelines (curator) Comprehensive guidelines This guideline is structured by data type (IDs/Accessions, Nomenclature, Databases)

= Accession numbers and IDs =

Recognizing the origin of a set of IDs/accessions numbers
First step: Check the number of prefix letters

1 letter:

2 letters: likely to be a simple nucleotide sequence (Check: ...)

2 letters and underscore: likely to be a NCBI-Ref Seq (Check: ...)

tbc Second step: Check the number of digits

Does the combination of letter and digits fit the scheme of the suspected data type? (see description of the respective chapter)

Third step: Check the list of prefixes

Is the prefix listed in the chapter of the suspected data type?

Fourth step: Try to search for the accession number /ID in the suspected origin database

If you are unsure, if the IDs, given by the scientists, represents INSDC accessions, you can try the link-format http://www.ebi.ac.uk/ena/data/view/xxx (xxx = INSDC accession number), to check the ID (attention: This only functions, if the data is already published!)

Rules for using the parameter "Accession number, genetics" (ID = 153190)
Please use this parameter for all INSDC accession numbers (=accession number, which are resolvable by NCBI, ENA or DDBJ). A routine will automatically link to the entry in ENA, using the URL: http://www.ebi.ac.uk/ena/data/view/xxx (xxx = accession number). Please do not add a new column "Accession number, link" (ID = 130729), because this will be redundant.

Do not use the parameter for non-INSDC-accession numbers. In that case, you should use the generic parameter "accession number" instead.

Standard INSDC accession numbers
INSDC accession numbers for single nucleotide sequences consist out of 1 or 2 letters + 5-6 numbers.

Protein sequence accession numbers consist out of 3 letters + 5 numbers

WGS (Whole Genome Shotgun Sequencing project) accession numbers consist out of 4 letters + 8-9 numbers.

Version numbers are separated by a dot: accession_number.#

A list of all standard INSDC accession number prefixes (sorted by data type and origin) can be found here.

Sequence Read Archive accession numbers
Sequence Read Archive (SRA) accession numbers (-omics data, high throughput) vary from standard INSDC accession numbers, but also represent INSDC accession numbers. The origin of most data can be deduced by the first letter of the accession prefix (N = NCBI, E = ENA, D = DDBJ)
 * Project accession number prefixes:
 * BioProject: PRJNA (NCBI), PRJE (EBI), PRJD (DDBJ) (sometimes plus an additional letter)
 * Synonyms: SRP (NCBI), ERP (ENA), DRP (DDBJ)
 * Sample accession number prefixes
 * BioSample: SAMN (NCBI), SAME (EBI), SAMD (DDBJ) (sometimes plus an additional letter)
 * Synonyms: SRS (NCBI), ERS (ENA), DRS (DDBJ)
 * Experiment accession number prefixes:
 * SRX (NCBI), ERX (ENA), DRX (DDBJ)
 * Sequencing run accession number:
 * SRR (NCBI), ERR (ENA), DRR (DDBJ)

These pages summarize information on different accession number types and their prefixes:

http://www.ebi.ac.uk/ena/submit/accession-number-formats (formats are using regular expressions)

https://www.ncbi.nlm.nih.gov/books/NBK56913/#search.why_does_sra_have_so_many_differe   (scroll down for prefix tables)

For very comprehensive information about the SRA archive data structure, you may have a look at this page: https://trace.ddbj.nig.ac.jp/dra/submission_e.html

You can also have a look at the schematic overview (will be added)

Mapping SRA data to PANGAEA
As in PANGAEA, where data is substructured into projects, campains, events and datasets, SRA-databases also substructure there data. However, the data granularity of PANGAEA is not exactly the same as the granularity of SRA-archives.

How shall SRA-data be integrated into PANGAEA?

to be continued

non-INSDC IDs/accession numbers
For a list of external, non-INSDC-databases and example-IDs please have a look at this site: http://www.insdc.org/db_xref.html

NCBI GenInfo Identifiers
The GenInfo Identifiers run in parallel to the accession-version system. However GI-numbers are NCBI-specific and can not be resolved with GenBank.

The prefix is "GI:#######". The number of numerals can vary.

However, some scientists leave away the prefix in their datasets and only provide the integer-strings. This can lead to confusion, especially, if the GenInfo Identifier is wrongly termed accession number. In case of uncertainty (other databases also use simple integer strings), please check the number in the NCBI database or ask the authors.

NCBI RefSeq database accessions
The RefSeq database by NCBI is separate database from GenBank and only contains non-redundant reference sequences (DNA, RNA, Protein), which are derived from INSDC sequences. The RefSeq accession number format can be distinguished from the INSDC format by containing an underscore, which separates the numerals from the preceding two letters. A list of existing RefSeq prefixes can be found here.

RefSeq accession numbers can not be resolved by ENA. However, the RefSeq entry contains information about identical INSDC sequences.

NCBI Entrez Gene database accessions
The Entrez Gene database contains additional information on genes, found on RefSeq sequences. Entrez Gene Identifiers consist only out of numerals, which do not represent real accession numbers and which can not be resolved by ENA.

UniProt accession numbers
UniProt accession numbers hava a quite complex format and consist out of 6 or 10 characters. The accession numbers are not separated in a strict prefix- (letters) and suffix-part (numbers). Numbers and letters can be mixed.

The construction schema can be found in the UniProt manual.

The Uniprot database consists out of two sections: Curated protein sequences (Swissprot) and automatically annotated, non-reviewed sequences (TrEMBL = translated EMBL, i.e. translated from genetic sequence). The evidence level (evidence for protein existance) is also given in an entry (more information hereand here). The Uniprot database rather serves as a reference database. Therefore, Uniprot accessions will rather rarely be given in PANGAEA datasets.

= Nomenclatures: Names and symbols (abbreviations) = The information about proper naming and abbreviations of genes, proteins, enzymes etc. is very important for parameter names. The term "symbol" is often used synonymously for gene, protein or enzyme names. However, in contrast to names, symbols represent short abbreviations following a defined schema.

Gene names and symbols in parameter names
Common parameters in PANGAEA, containing gene names and symbols, are for example gene expression or copy number parameters.

However, in the past, parameters have only contained the gene name (often using the gene symbol in the abbreviation) or the gene symbol. Examples are:
 * "Functional gene dissimilatory sulfate reductase genes, presence" with abbreviation "F gene dsrB pres" (ID:132160)
 * "Functional gene nrfA" (ID:136910)

The recommendation is to use gene name and gene symbol simultaneously within the parameter name.

It should especially be avoided to only use the gene symbol in parameter names, because symbols are not always unique and may be used ambiguously.

Only using a gene name without a symbol should also be avoided, because several proteins/enzymes can have the same functionality (the same EC-number and recommended name for distinct enzymes)

Still needs to be discussed: Only in exceptional cases (parameter names becoming too long), the gene symbol can be solely used.

Example "Heterocystous cyanobacteria, abundance expressed in number of nifH gene copies" (ID:150930)

However, the correct features from the terminology catalogue (containing the gene name) should be directly created and added. If the gene product is not known, the scientist must be consulted.

Gene names
"Gene name" and "gene symbol" are often used as synonyms. However, gene symbols denote short names, identifying the gene of a specific protein. PANGAEA uses the term "gene name" to explicitly mean names describing the gene product, phenotype or function.

Preferably, a gene name should be assigned by the gene product. If the gene product is for example an enzyme, the corresponding gene name will consist of "official enzyme name" + "gene" e.g. "ammonia monooxygenase gene". The short name "amoA" is rather understood as a gene symbol (see next chapter).

Other gene names, assigned by phenotypes, should be avoided as far as possible (example: "beta-lactamase gene" is preferred over "ampicillin resistance gene".

The usage of descriptive gene names is important, because gene symbols are not always unique and unambiguous.

Gene symbols
Gene symbols always represent short names for a gene. Unfortunately, no encompassing gene symbol nomenclature exists, which might be used for every species. A general article on gene symbols can be found on Wikipedia.

Gene symbols for prokaryots
For prokaryots, a uniform nomenclature exists, which was established by Demerec et al. (1966).

Gene symbols are written in italic and consist out of three lowercase letters followed by an upercase letter: Example: amoA

Gene symbols for eukaryots
For eukaryots, no uniform nomenclature exists. Specific nomenclatures do however exist for some model organisms, e.g.:
 * Saccharomyces cerevisiae
 * Neurospora crassa
 * Drosophila melanogaster
 * Caenorhabditis elegans
 * etc.

Comprehensive lists for all nomenclatures are available at:

Styleguide

Uniprot

NCBI EntrezGene ressources

NCBI RefSeq ressources

Gene symbols in Uniprot
Uniprot uses the term gene name and gene symbol synonymous. Moreover, there is a distinction into 4 name-types:

The section "Names and Taxonomy" of a protein entry contains the information about gene names. The name-type is also indicated. Please note, that "Name" stands for the approved gene symbol (or recommended gene symbol), while "Synonym" stands for alternative Symbols. If no nomenclatures are available for an organism (watch this list), Uniprot recommends gene symbols based on symbols of orthologous genes. In best case, Uniprot additionally provides references for gene and protein names.
 * Name
 * Synonym
 * Ordered Locus Name (OLN)
 * ORF Name

OLNs and ORF names do not represent gene names as understood by PANGAEA.

For more information, read:

http://www.uniprot.org/help/gene_name

Gene symbols in NCBI, Entrez Gene
The gene symbols in Entrez Gene are organism-specific symbols. If no symbol is available, yet, a dummy gene symbol is assigned until an official symbol becomes available. Dummy gene symbols consist of the prefix "LOC" + Entrez Gene-ID.

The guidlines on gene names/symbols can be found here

Locus tags
Locus tags are identifiers for coding and non-coding genes in a genome. One locus tag is used for all components of a single gene (exons, CDS, mRNA etc.) Each genome project needs to register their tag prefix at NCBI, EBI or DDBJ to ensure uniqueness. Prefixes are alphanumerical and at least 3 characters long, separated by an underscore. Often, prefixes are acronyms of the sequenced species.

Locus tags are added in sequential order to the genes on the genome. Additional information about chromosomes (I, II, ...) or RNA types ("r" for rRNA, "t" for tRNA) is added directly after the underscore.

Example for a locus tag: ABC_0001

Official NCBI guidelines for locus tags: https://www.ncbi.nlm.nih.gov/genomes/locustag/Proposal.pdf

Please note, that locus tags are different from accession numbers or gene names!

Databases for genes and gene products - finding proper names and symbols
KEGG: Kyoto Encyclopedia of Genes and Genomes

The best way to find proper gene names and symbols in KEGG is to directly search for the gene in the KEGG Orthology "KO" database. The KEGG Orthology database has one entry of all orthologous genes (sharing a common ancestor) and therefore represent a "Master entry" for a gene in all organisms. The entries contain common gene symbols and names (see also the help-document). By having a look at the Genes-section, you can check, which gene symbols are used for which organism (in brackets). Example entry

KO-entries can usually also be accessed from KEGG Genes-entries by clicking on the KO-number underneath the Desciption. This is usefull, if other Databases (NCBI, Uniprot etc.) link to the gene entry.

GPSDB = Gene/Protein Synonyms finder (by Expasy) (publication)

The synonym finder is a very powerful tool for quickly finding many synonyms of gene and protein names and symbols. The results are ordered by species, thus GPSDB takes into account that different species-specific nomenclatures and symbol assignments exist. Furthermore, GPSDB suggests preferred names and symbols and provides URLs to source-databases. A button directly redirects the user to all Pubmed articles for the chosen synonyms. GPSDB serves as an ideal entry point to get an overview about existing synonyms. Moreover, ambiguities (completely different genes with the same gene symbol) can be avoided. Updating of the database has been discontinued in 2013, however, the service is still very useful.

Entrez Gene

Other databases:

genenames.org (only human and vertebrates, example)

Protein names in Uniprot
http://www.uniprot.org/help/protein_names

Enzyme Symbols
= Additional terminologies =

Enzyme activity quantities
= Open questions/topics =
 * NCBI RefSeq sequences - exchanged within INSDC or only Genbank intern? (solved!)
 * Protein databases: Preferred databases? Proper XRefs to other protein DBs available?
 * Protein INSDC accessions: only translated entries?
 * Linking gene references in databases: Proper Identifier for XRef available?