Molecular data in PANGAEA

This page contains information about the types of molecular data available in PANGAEA and describes guidelines for integrating new molecular data in PANGAEA datasets.

Definition of molecular data
The term "molecular data" summarizes all kinds of data, which are associated with molecular biology its central dogma, mainly genetic data as well as protein and enzyme data and metadata.

Which kind of molecular data is accepted by PANGAEA?
In context of molecular data, PANGAEA serves as a repository for environmental data and metadata, which specialized databases do not store.

Sequencing data (nucleotides)
PANGAEA does not take nucleotide sequences and directly related data. For this purpose, the submission to one of the INSDC-databases (International Nucleotide Sequence Database Collaboration) is appreciated
 * DNA Data Bank of Japan (DDBJ)
 * GenBank
 * European Nucleotide Archive (ENA)

However metadata related to nucleotide sequences and "omics"-projects can be submitted to PANGAEA.

By default, PANGAEA links to the European Nucleotide Archive (ENA) using INSDC accession numbers. The following kinds of sequencing data can be linked to PANGAEA:
 * Nucleotide sequences
 * Gene sequences (enzymes, 16S/18S rRNA)


 * Sequence Read Archive "SRA" data ((meta)-genomics/transciptomics)
 * Sequencing projects (e.g. Bioproject)
 * Samples (e.g. Biosample)
 * Sequencing runs
 * Whole genome shotgun sequencing project (WGS)
 * etc.

Protein/Enzyme data
PANGAEA does not take data, which characterizes proteins and enzymes. However references can be added to a dataset, e.g.:
 * Protein accession numbers (INSDC-databases, UniProt)
 * ORF /gene names /gene symbols
 * EC-numbers of IUBMB (International Union of Biochemistry and Molecular Biology)
 * etc.

EC-numbers will in future be used for crosslinking to BRENDA (BRaunschweig ENzyme DAtabase)

Experimental data
Experimental data refers to measurements and counts in context of molecular biology. PANGAEA contains many different measurement quantities (simple and complex parameters). Listed are some examples:


 * Gene expression data
 * Protein content data
 * Protein production data
 * Enzyme activity data
 * FISH counts
 * etc.

Submission guideline for molecular data
Depending on the dataset, please consider the following rules and recommendations.

Gene names and symbols

 * Preferably use approved / official gene symbols in accordance with applicable nomenclatures
 * Provision of full gene product name (functional RNA, protein, enzyme)is appreciated due to possible ambiguities of gene symbols

Protein names and symbols

 * Gene and protein nomenclature are intertwined. As for gene symbols, follow the applicable nomenclatures
 * Enzyme names: Usage of accepted names by the nomenclature committee of IUBMB (International Union of Biochemistry and Molecular Biology) is strongly recommended.

Database accession numbers
An accession number is an unique identifier, usually having an alphanumerical format. In biosciences, the term is mainly used for genetic and protein sequence IDs, which refer to a certain entry within the database.

Genetic accession numbers

 * Only provide INSDC accession numbers! (PANGAEA resolves all INSDC-accession numbers with ENA)
 * Do not use other kinds of accession numbers (for example Gold Study IDs from the Genomes Online "GOLD" database
 * The NCBI GenInfo Identifier (GI) is no INSDC accession number and can not be resolved by PANGAEA.

Protein accession numbers

 * INSDC accession numbers for proteins can be resolved by PANGAEA as for genetic sequences and are therefore prefered.
 * Uniprot accession numbers are also allowed, but cannot be resolved.