Intern:Molecular data in PANGAEA

The Wiki-article is not complete, yet and still under development.

= I do not know the origin of a set of IDs/accessions given by the scientist. How can I recognize the origin? = First step: Check the number of prefix letters

1 letter:

2 letters: likely to be a simple nucleotide sequence (Check: ...)

2 letters and underscore: likely to be a NCBI-Ref Seq (Check: ...)

tbc Second step: Check the number of digits

Does the combination of letter and digits fit the scheme of the suspected data type? (see description of the respective chapter)

Third step: Check the list of prefixes

Is the prefix listed in the chapter of the suspected data type?

Fourth step: Try to search for the accession number /ID in the suspected origin database

If you are unsure, if the IDs, given by the scientists, represents INSDC accessions, you can try the link-format http://www.ebi.ac.uk/ena/data/view/xxx (xxx = INSDC accession number), to check the ID (attention: This only functions, if the data is already published!)

= Rules for using the parameter "Accession number, genetics" (ID = 153190) = Please use this parameter for all INSDC accession numbers (=accession number, which are resolvable by NCBI, ENA or DDBJ). A routine will automatically link to the entry in ENA, using the URL: http://www.ebi.ac.uk/ena/data/view/xxx (xxx = accession number). Please do not add a new column "Accession number, link" (ID = 130729), because this will be redundant.

Do not use the parameter for non-INSDC-accession numbers. In that case, you should use the generic parameter "accession number" instead.

= Further information on INSDC-accession numbers (genetics) =

Standard INSDC accession numbers
INSDC accession numbers for single sequences consist out of 2 letters + 5-6 numbers. Genome accession numbers consist out of 4 letter + 8-9 numbers. Version numbers are seperated by a dot: accession_number.#

Sequence Read Archive accession numbers
Sequence Read Archive accession numbers (-omics data, high throughput) vary from standard INSDC accession numbers, but also represent INSDC accession numbers. The origin of most data can be deduced by the first letter of the accession prefix (N = NCBI, E = ENA, D = DDBJ)
 * Project accession number prefixes:
 * BioProject: PRJNA (NCBI), PRJD (DDBJ)
 * Synonyms: SRP (NCBI), ERP (ENA), DRP (DDBJ)
 * Sample accession number prefixes
 * BioSample: SAMN (NCBI), SAMD (DDBJ)
 * Synonyms: SRS (NCBI), ERS (ENA), DRS (DDBJ)
 * Experiment accession number prefixes:
 * SRX (NCBI), ERX (ENA), DRX (DDBJ)
 * Sequencing run accession number:
 * SRR (NCBI), ERR (ENA), DRR (DDBJ)

These pages summarize information on different accession number types and their prefixes:

http://www.ebi.ac.uk/ena/submit/accession-number-formats

https://www.ncbi.nlm.nih.gov/books/NBK56913/#search.why_does_sra_have_so_many_differe   (scroll down for prefix tables) For very comprehensive information about the SRA archive data structure, you may have a look at this page: https://trace.ddbj.nig.ac.jp/dra/submission_e.html

You can also have a look at the schematic overview (will be added)

= Crossreference IDs/accessions for genetics (non-INSDC)= For a list of external, non-INSDC-databases and example-IDs please have a look at this site: http://www.insdc.org/db_xref.html

NCBI RefSeq database
The RefSeq database by NCBI is separate database from GenBank and only contains non-redundant reference sequences (DNA, RNA, Protein), which are derived from INSDC sequences. The RefSeq accession number format can be distinguished from the INSDC format by containing an underscore, which separates the numerals from the preceding two letters. A list of existing RefSeq prefixes can be found here.

RefSeq accession numbers can not be resolved by ENA. However, the RefSeq entry contains information about identical INSDC sequences.

NCBI Entrez Gene database
The Entrez Gene database contains additional information on genes, found on RefSeq sequences. Entrez Gene Identifiers consist only out of numerals, which do not represent real accession numbers and which can not be resolved by ENA.

= Genes and ORFs (open reading frames) =

Distinction between gene name and symbol
Definition: tbc

= Open questions/topics =
 * NCBI RefSeq sequences - exchanged within INSDC or only Genbank intern? (solved!)
 * Protein databases: Preferred databases? Proper XRefs to other protein DBs available?
 * Linking gene references in databases: Proper Identifier for XRef available?