.

Ensembl Gene Set

Background

Ensembl is a system providing automated genome annotation and subsequent visualisation of annotated genomes. The Ensembl analysis and annotation pipeline is based on a rule set of heuristics, a human annotator would use (Curwen et al., 2004). As genomes nowadays are sequenced on an industrial scale, labour intensive manual curation can no longer cope with the amount of information generated. The Ensembl genome annotation pipeline was thus conceived to facilitate annotation of genome sequences in a timely fashion.

All Ensembl gene predictions are based on experimental evidence, which is imported via manually curated UniProt/Swiss-Prot, partially manually curated NCBI RefSeq, automatically annotated UniProt/TrEMBL records. Untranslated regions (UTRs) are annotated to the extent supported by EMBL mRNA records. As there is no guarantee that UTR sequences in EMBL records are complete there is similarly no guarantee that the Ensembl genome analysis and annotation pipeline has enough biological evidence to predict complete UTR regions. Promoter regions are currently not annotated by Ensembl as the set of well-characterised promoters is still small and there is currently no algorithm yielding reliable results on a genomic scale.

Sources of Biological Evidence

Gene-Build Procedure

Ensembl gene builds are rather complex and involve two important steps (Curwen et al., 2004). The initial targeted build aligns all species-specific protein and mRNA information to the genome sequence. An additional similarity build is based on information from closely related species and aims to broaden the spectrum of transcript predictions. This second step is especially important for less popular model organisms with a much smaller amount of direct, species-specific protein and mRNA evidence available.

External References Mapping Procedure

Naming of transcripts occurs at a later step, after the gene-build is completed. If the transcript or protein models can be mapped to species-specific UniProt/Swiss-Prot, RefSeq or UniProt/TrEMBL entries then Ensembl refers to them as known genes, if not (e.g. genes predicted on the basis of evidence from closely related species) they are called novel genes.

The difference is thus due to evidence coming directly from a manually curated database (UniProt/Swiss-Prot), a partially manually curated (RefSeq) or a species-specific entry (UniProt/TrEMBL) or whether the gene model is inferred from a closely related species. For these reasons, known genes will dominate in all established model organisms, while less popular organisms will display a significantly larger fraction of novel genes. But again, all Ensembl gene predictions are based on experimental evidence.

Supporting Evidence

All sequence records the Ensembl analysis and annotation pipeline used for the annotation of a particular transcript model are available on a per exon basis from the 'Supporting Evidence' section of corresponding 'ExonView' pages. These pages are linked from 'GeneView', 'TransView' and 'ProteinView' pages via the [Exon Information] links.

While Ensembl is a browser providing automatically annotated genomes, the Vertebrate Genome Annotation Browser (Vega) is its counterpart for manually curated genome annotation. Since manual curation is very labour-intensive it is currently limited to certain chromosomes of certain species.

References


 

© 2024 Inserm. Hosted by genouest.org. This product includes software developed by Ensembl.

                
GermOnline based on Ensembl release 50 - Jul 2008
HELP