Molecular data in PANGAEA
This page contains information about the types of molecular data available in PANGAEA and describes guidelines for integrating new molecular data in PANGAEA datasets.
Definition of molecular data
The term "molecular data" summarizes all kinds of data, which are associated with molecular biology or its central dogma, mainly genetic data as well as protein and enzyme data and metadata.
Which kind of molecular data is accepted by PANGAEA?
In the context of molecular data, PANGAEA serves as a repository for environmental data and metadata that are not stored in specialized databases.
Sequencing data (nucleotides)
PANGAEA does not accept nucleotide sequences and directly related data. For archiving nucleotide sequences and directly related data, the submission to one of the INSDC-databases (International Nucleotide Sequence Database Collaboration) is appreciated:
However metadata related to nucleotide sequences and "omics"-projects can be submitted to PANGAEA.
By default, PANGAEA links data to the European Nucleotide Archive (ENA) using INSDC accession numbers. The following kinds of sequencing data can be linked to PANGAEA:
- Nucleotide sequences
- Gene sequences (enzymes, 16S/18S rRNA)
- Sequence Read Archive "SRA" data ((meta)-genomics/transciptomics)
- Sequencing projects (e.g. Bioproject)
- Samples (e.g. Biosample)
- Sequencing runs
- Whole genome shotgun sequencing project (WGS)
- etc.
Protein/Enzyme data
PANGAEA does not accept data that characterizes proteins and enzymes. However references can be added to a dataset, e.g.:
- Protein accession numbers (INSDC-databases, UniProt)
- ORF /gene names /gene symbols
- EC-numbers of IUBMB (International Union of Biochemistry and Molecular Biology)
- etc.
In the future, EC-numbers will be used for crosslinking PANGAEA and BRENDA (BRaunschweig ENzyme DAtabase)
Experimental data
Experimental data refers to measurements and counts in the context of molecular biology. PANGAEA contains many different measurement quantities (simple and complex parameters). Listed are some examples:
- Gene expression data
- Protein content data
- Protein production data
- Enzyme activity data
- FISH counts
- etc.
Submission guideline for molecular data
Depending on the dataset, please consider the following rules and recommendations.
Gene names and symbols
- Preferably use approved / official gene symbols in accordance with applicable nomenclatures
- Provision of full gene product name (functional RNA, protein, enzyme)is appreciated due to possible ambiguities of gene symbols
Protein names and symbols
- Gene and protein nomenclature are intertwined. As for gene symbols, follow the applicable nomenclatures
- Enzyme names: Usage of accepted names by the nomenclature committee of IUBMB (International Union of Biochemistry and Molecular Biology) is strongly recommended.
Database accession numbers
An accession number is an unique identifier, usually having an alphanumerical format. In biosciences, the term is mainly used for genetic and protein sequence IDs, which refer to a certain entry within the database.
Genetic accession numbers
- Only provide INSDC accession numbers! (PANGAEA resolves all INSDC-accession numbers with ENA)
- Do not use other kinds of accession numbers (for example Gold Study IDs from the Genomes Online "GOLD" database
- The NCBI GenInfo Identifier (GI) is no INSDC accession number and can not be resolved by PANGAEA.
Protein accession numbers
- INSDC accession numbers for proteins can be resolved by PANGAEA as for genetic sequences and are therefore prefered.
- Uniprot accession numbers are also allowed, but cannot be resolved.