Data standards
Overview
Standardisation of data entities is of increasing value in ensuring researchers and end-users are able to navigate the world of brassica big data. The adoption of FAIR (Findable Accessible Interoperable and Re-usable) data principles underlies the abilitie of humans, machines and cyborgs to make use of relevbant data and make meaningful connections. Several initives are underway of relevance to the brassica research community. These include development of Trait Dictionaries, Ontologies and standardisation of gene nomenclature (names).
Standardisation of gene model nomenclature for reference Brassica genomes
- He et al. (2021) 'Genome structural evolution in Brassica crops' published in Nature Plants - aligned reference pan-genomes having consistent MBGP gene nomenclature
- Following MBGP Steering Committee discussion in Jan 2019, the following (version 2) convention for gene model nomenclature was proposed, and is now adopted:
- Adopt a diploid pan-genome based nomenclature, recognised by the inclusion of 'p' prior to the gene model number, and prefixed by Genome and canonical chromosome representation.Thus:
<GENOME 1 LETTER>|<X><Chromosome number (leading zero)>p<5 digit gene model number>.<version number>_<GENUS 1 letter>[<species 2 letters>]<3 LETTER designating genotype/line/cultivar>
Example: C01p010030.1_BnaDAR
- It is proposed that the final three-letter cultivar/genotype code be managed as a unique identifier (linked to DOI or Biosample?) by an online registry analogous to airport codes, which allow for 26^3 (17,576) combinations.
Xenologues (later transfer) would be represented as the canonical diploid gene - eg in the case of genes on the translocated bottom of
B. napus cv Westar A07 as a canonical C06p012345.1_BnaWES.
- The above requires a consensus pan-genome for each of the diploid A,B,C genomes.
- It will then be possible to provide a 'legacy lookup' of existing gene models assigned to the standards
Previous Version 1: 2014-18, now deprecated
A distinct nomenclature standard for gene-model annotation was established and provisionally ratified by the MBGP Steering Group in 2014. In July 2018, it was decided to review this in light of pan-genome complexity and need for a pragmatic system that would be resilient for the coming decade. (see version 2 above)
Following discussions within the MBGP steering group and publication of reference genomes in 2014, the following standard was adopted for naming of gene models assigned to pseduo-chromosome sequences.
For clarity, this convention is specifically for genome annotation, and is distinct from the established convention of functional genome annotation originally proposed by Ostergaard & King below.
Formally:
The genus/species and genome designations follow the convention of Ostergaard & King, with chromosome numbers assigned with leading zero (thus BnaC01 for chromosome C01 in B. napus. Gene models are assigned numbers decatonically (eg 10, 20, 30) with 5-digit leading zero integers from top to bottom of correctly orientated pseudochromosome sequence. This allows for additional or alternative gene models to be inserted. A default version number of 1 (eg for different splicing models) is assigned after a ".". Following this, in order to distinguish between reference sequences from different plant genotypes (e.g. TO1000 and O212 for B. oleracea) a single capital letter is allocated (e.g. "T" or "O"). Thus:
<GENUS 1 LETTER> [<species 2 letters>]<GENOME 1 LETTER>|<X><Chromosome number (leading zero)>g<5 digit gene model number>.<version number><1 LETTER designating Genotype/line/cultivar>
where < > surrounds categories, [ ] indicates an optional item and | denotes "or". When referring to gene names, the string is italicized, whilst the corresponding protein name is not.
Example: BnaC01g010030.1D
Shortened forms of the above are also used, where the version number and preceding '.' delimiter is omitted and the genotype concatenated with the gene model number. Thus:
from Genoscope annotation: BnaC01g010030D
Functional gene nomenclature
- A standardised nomenclature was proposed by Lars Ostergaard (JIC, Norwich) and Graham King (Rothamsted Research) for genes described within the Brassica genus. This enables a distinction to be made between copies associated with the different haploid genomes, as well as at paralogous loci. The nomenclature convention was discussed at the January 2008 MBGP Steering Committee meeting, and then put out for wider consultation within the international research community. Useful feedback was obtained, and incorporated where possible into the subsequent publication. The standard nomenclature convention has now been circulated to editors of plant and genetics journals, as well as GenBank/EMBL/DDBJ so that there is consistency in use within the literature and database repositories.
syntax
<GENUS 1 LETTER> [<species 2 letters>]<GENOME 1 LETTER>|<X>.<NAME 3-6 LETTER CODE>.<locus assignment 1 letter>
where < > surrounds categories, [ ] indicates an optional item and | denotes "or". When referring to gene names, the string is italicized, whilst the corresponding protein name is not.
For information -
Gene Class Symbol list for Arabidopsis (TAIR)