Molecular data in PANGAEA

This page contains information about the types of molecular data available in PANGAEA and describes guidelines for integrating new molecular data in PANGAEA datasets.

Definition of molecular data
The term "molecular data" summarizes all kinds of data, which are associated with molecular biology its central dogma, mainly genetic data as well as protein and enzyme data and metadata.

Which kind of molecular data is accepted by PANGAEA?
In context of molecular data, PANGAEA serves as a repository for environmental data and metadata, which specialized databases do not store.

Sequencing data (nucleotides)
PANGAEA does not take nucleotide sequences and directly related data. For this purpose, the submission to one of the INSDC-databases (International Nucleotide Sequence Database Collaboration) is appreciated
 * DNA Data Bank of Japan (DDBJ)
 * GenBank
 * European Nucleotide Archive (ENA)

However metadata related to nucleotide sequences and "omics"-projects can be submitted to PANGAEA.

By default, PANGAEA links to the European Nucleotide Archive (ENA) using accession numbers. The following kinds of sequencing data can be linked to PANGAEA:
 * Nucleotide sequences
 * gene sequences (enzymes, 16S/18S rRNA)


 * Sequence Read Archive "SRA" data ((meta)-genomics/transciptomics)
 * Sequencing projects (e.g. Bioproject)
 * Samples (e.g. Biosample)
 * Sequencing runs
 * Whole genome shotgun sequencing project (WGS)
 * etc.

Protein/Enzyme data
PANGAEA does not take data, which characterizes proteins and enzymes. However references can be added to a dataset, e.g.:
 * Protein accession numbers (INSDC-databases, UniProt)
 * ORF /gene names
 * EC-numbers
 * etc.

Experimental data
Experimental data refers to measurements and counts in context of molecular biology. PANGAEA contains many different measurement quantities (simple and complex parameters). Listed are some examples:


 * Gene expression data
 * Protein content data
 * Protein production data
 * Enzyme activity data
 * FISH counts
 * etc.

Gene names and symbols
tbc

Locus tags
tbc

Database accession numbers
An accession number is an unique identifier, usually having an alphanumerical format. In biosciences, the term is mainly used for genetic and protein sequence IDs, which refer to a certain entry within the database.

INSDC accession numbers
All kinds of INSDC-accession numbers for genetics are clustered using the PANGAEA parameter "Accession number, genetics".

The INSDC-databases (NCBI, ENA, DDBJ) provide simple nucleotide sequences, but also big sequencing projects, which structure the information and sequences into different entries (project-, sample- and experiment-level (metadata) as well as run and genome-assembly-level (raw and processed data). Due to this data granularity, many different accession number formats exist, to distinguish between the different types.

Since PANGAEA resolves all INSDC-accession numbers with ENA and provides a link to the entry, there is no need to distinguish between the different data entry types.

Other accession numbers and identifiers
Accession numbers from other than INSDC-databases can not be directly linked by PANGAEA. If available, the INSDC-accession numbers should be preferably provided, when submitting data to PANGAEA. Examples for non-INSDC-accession numbers are:
 * Gold Study IDs from the Genomes Online "GOLD" database
 * etc.

Furthermore, the NCBI GenInfo Identifier (GI) should not be confused with the accession number, when submitting data. The GI number is not unique and is not consistently used by each INSDC database. Therefore, the GI number is not suited for cross referencing between databases.

Please note, that gene names and gene symbols are no accession numbers (although some might resemble them in structure), because they are not unique and unambiguous. Read the section genes and gene data for more information.

Protein accession numbers
tbc