Intern:Molecular data in PANGAEA

The Wiki-article is not complete, yet and still under development.

The internal section of this article shall serve as a curator guideline and is structured by data type (IDs/Accessions, Nomenclature, Databases)

= Accession numbers and IDs =

Recognizing the origin of a set of IDs/accessions numbers
First step: Check the number of prefix letters

1 letter:

2 letters: likely to be a simple nucleotide sequence (Check: ...)

2 letters and underscore: likely to be a NCBI-Ref Seq (Check: ...)

tbc Second step: Check the number of digits

Does the combination of letter and digits fit the scheme of the suspected data type? (see description of the respective chapter)

Third step: Check the list of prefixes

Is the prefix listed in the chapter of the suspected data type?

Fourth step: Try to search for the accession number /ID in the suspected origin database

If you are unsure, if the IDs, given by the scientists, represents INSDC accessions, you can try the link-format http://www.ebi.ac.uk/ena/data/view/xxx (xxx = INSDC accession number), to check the ID (attention: This only functions, if the data is already published!)

Rules for using the parameter "Accession number, genetics" (ID = 153190)
Please use this parameter for all INSDC accession numbers (=accession number, which are resolvable by NCBI, ENA or DDBJ). A routine will automatically link to the entry in ENA, using the URL: http://www.ebi.ac.uk/ena/data/view/xxx (xxx = accession number). Please do not add a new column "Accession number, link" (ID = 130729), because this will be redundant.

Do not use the parameter for non-INSDC-accession numbers. In that case, you should use the generic parameter "accession number" instead.

Standard INSDC accession numbers
INSDC accession numbers for single nucleotide sequences consist out of 1 or 2 letters + 5-6 numbers.

Protein sequence accession numbers consist out of 3 letters + 5 numbers

WGS (Whole Genome Shotgut Sequencing project) accession numbers consist out of 4 letters + 8-9 numbers.

Version numbers are separated by a dot: accession_number.#

A list of all standard INSDC accession number prefixes (sorted by data type and origin) can be found here.

Sequence Read Archive accession numbers
Sequence Read Archive (SRA) accession numbers (-omics data, high throughput) vary from standard INSDC accession numbers, but also represent INSDC accession numbers. The origin of most data can be deduced by the first letter of the accession prefix (N = NCBI, E = ENA, D = DDBJ)
 * Project accession number prefixes:
 * BioProject: PRJNA (NCBI), PRJE (EBI), PRJD (DDBJ) (sometimes plus an additional letter)
 * Synonyms: SRP (NCBI), ERP (ENA), DRP (DDBJ)
 * Sample accession number prefixes
 * BioSample: SAMN (NCBI), SAME (EBI), SAMD (DDBJ) (sometimes plus an additional letter)
 * Synonyms: SRS (NCBI), ERS (ENA), DRS (DDBJ)
 * Experiment accession number prefixes:
 * SRX (NCBI), ERX (ENA), DRX (DDBJ)
 * Sequencing run accession number:
 * SRR (NCBI), ERR (ENA), DRR (DDBJ)

These pages summarize information on different accession number types and their prefixes:

http://www.ebi.ac.uk/ena/submit/accession-number-formats (formats are using regular expressions)

https://www.ncbi.nlm.nih.gov/books/NBK56913/#search.why_does_sra_have_so_many_differe   (scroll down for prefix tables) For very comprehensive information about the SRA archive data structure, you may have a look at this page: https://trace.ddbj.nig.ac.jp/dra/submission_e.html

You can also have a look at the schematic overview (will be added)

Mapping SRA data to PANGAEA
As in PANGAEA, where data is substructured into projects, campains, events and datasets, SRA-databases also substructure there data. However, the data granularity of PANGAEA is not exactly the same as the granularity of SRA-archives.

How shall SRA-data be integrated into PANGAEA?

to be continued

non-INSDC IDs/accession numbers
For a list of external, non-INSDC-databases and example-IDs please have a look at this site: http://www.insdc.org/db_xref.html

NCBI GenInfo Identifiers
The GenInfo Identifiers run in parallel to the accession-version system. However GI-numbers are NCBI-specific and can not be resolved with GenBank.

The prefix is "GI:#######". The number of numerals can vary.

However, some scientists leave away the prefix in their datasets and only provide the integer-strings. This can lead to confusion, especially, if the GenInfo Identifier is wrongly termed accession number. In case of uncertainty (other databases also use simple integer strings), please check the number in the NCBI database or ask the authors.

NCBI RefSeq database accessions
The RefSeq database by NCBI is separate database from GenBank and only contains non-redundant reference sequences (DNA, RNA, Protein), which are derived from INSDC sequences. The RefSeq accession number format can be distinguished from the INSDC format by containing an underscore, which separates the numerals from the preceding two letters. A list of existing RefSeq prefixes can be found here.

RefSeq accession numbers can not be resolved by ENA. However, the RefSeq entry contains information about identical INSDC sequences.

NCBI Entrez Gene database accessions
The Entrez Gene database contains additional information on genes, found on RefSeq sequences. Entrez Gene Identifiers consist only out of numerals, which do not represent real accession numbers and which can not be resolved by ENA.

UniProt accession numbers
UniProt accession numbers hava a quite complex format and consist out of 6 or 10 characters. The accession numbers are not separated in a strict prefix- (letters) and suffix-part (numbers). Numbers and letters can be mixed.

The construction schema can be found in the UniProt manual.

The Uniprot database consists out of two sections: Curated protein sequences (Swissprot) and automatically annotated, non-reviewed sequences (TrEMBL = translated EMBL, i.e. translated from genetic sequence). The evidence level (evidence for protein existance) is also given in an entry (more information hereand here). The Uniprot database rather serves as a reference database. Therefore, Uniprot accessions will rather rarely be given in PANGAEA datasets.

= Nomenclatures: Names and symbols (abbreviations) = The information about proper naming and abbreviations of genes, proteins, enzymes etc. is very important for parameter names. The term "symbol" is often used synonymously for gene, protein or enzyme names. However, in contrast to names, symbols represent short abbreviations following a defined schema.

Gene names
tbc

Gene symbols
Unfortunately, no encompassing gene symbol nomenclature exists, which might be used for every species. Specific nomenclatures do however exist for some model organisms, e.g.:
 * Saccharomyces cerevisiae
 * Neurospora crassa
 * Drosophila melanogaster
 * Caenorhabditis elegans
 * etc.

A comprehensive list is available in the Styleguide or offered by Uniprot.

Gene symbols in Uniprot
Uniprot uses the term gene name and gene symbol synonymous. Moreover, there is a distinction into 4 name-types:

The section "Names and Taxonomy" of a protein entry contains the information about gene names. The name-type is also indicated. Please note, that "Name" stands for the approved gene symbol (or recommended gene symbol), while "Synonym" stands for alternative Symbols. If no nomenclatures are available for an organism (watch this list), Uniprot recommends gene symbols based on symbols of orthologous genes. In best case, Uniprot additionally provides references for gene and protein names.
 * Name
 * Synonym
 * Ordered Locus Name (OLN)
 * ORF Name

OLNs and ORF names do not represent gene names as understood by PANGAEA.

For more information, read:

http://www.uniprot.org/help/gene_name

Locus tags
tbc

Databases for genes
KEGG: Kyoto Encyclopedia of Genes and Genomes

The best way to find proper gene names and symbols in KEGG is to directly search for the gene in the KEGG Orthology "KO" database. The KEGG Orthology database has one entry of all orthologous genes (sharing a common ancestor) and therefore represent a "Master entry" for a gene in all organisms. The entries contain common gene symbols and names (see also the help-document). By having a look at the Genes-section, you can check, which gene symbols are used for which organism (in brackets). Example entry

KO-entries can usually also be accessed from KEGG Genes-entries by clicking on the KO-number underneath the Desciption. This is usefull, if other Databases (NCBI, Uniprot etc.) link to the gene entry.

Entrez Gene

Other databases:

genenames.org (only human and vertebrates, example)

Protein names in Uniprot
http://www.uniprot.org/help/protein_names

Enzyme Symbols
= Additional terminologies =

Enzyme activity quantities
= Open questions/topics =
 * NCBI RefSeq sequences - exchanged within INSDC or only Genbank intern? (solved!)
 * Protein databases: Preferred databases? Proper XRefs to other protein DBs available?
 * Protein INSDC accessions: only translated entries?
 * Linking gene references in databases: Proper Identifier for XRef available?