What is genome annotation in bioinformatics?

The technique of linking biological information to genome sequences is termed genome annotation. Gene annotation is the method of identifying gene locations and coding sections. It helps us understand what these genes are doing in the body through establishing structural characteristics and linking them to the actions of various proteins.

The importance of genome annotation

Genome projects are scientific undertakings that try to determine an organism's full genome sequence. To understand the meaning of a genome after it has been sequenced, it must be annotated. Molecular biology and bioinformatics have necessitated genome annotation since the 1980s. Researchers identify all protein-coding genes and assign each protein a function when a genome is annotated. Now that the deoxyribonucleic acid (DNA) nucleotide sequences of over a thousand individual humans (The 100,000 Genomes Project, UK) and some model organisms are fully complete. Genome annotation remains a key hurdle for scientists exploring the human genome.

The diagrammatic representation of genome annotation of a DNA sample is shown in the figure.
CC-BY | Image Credits: https://theg-cat.com

Manual curation and automatic annotation

In contrast to manual annotation, also known as curation, which requires human skill, automatic annotation technologies try to execute these processes using computer analysis. These methodologies should ideally coexist and complement one another in the same annotation workflow. To generate gene models and functional predictions, computational methods can be used, although they are prone to errors.
Annotating gene sequences manually, according to Terry Gaasterland and Christoph Sensen, could take up to a year per person per megabase. In light of genome annotation experiences, researchers now feel that this estimate is inflated by a factor of five or six. Nonetheless, genome annotation has undoubtedly become the limiting stage in most genome studies. Humans, after all, are intended to be inconsistent and prone to making mistakes. As a result, there are financial incentives to automate as much of the annotation process as possible.

Genome annotation databases

In recent years, a variety of genome annotation databases have been built to accommodate the growing volume of genomic data collected for commercial and public use, whether they are industrial, educational, or governmental. These databases make it possible to find and annotate genes as well as their functions. This can be done automatically, but users can also manually annotate genes. Some examples of genome annotation databases are Mouse Genome Informatics(MGI), WormBase (a nematode information resource), and FlyBase (the drosophila database).

How does genome annotation operate?

The two main steps involved in genome annotation are:

Structural annotation (gene prediction): Structural annotation is the determination of which parts of the genome do not encode for proteins. It involves gene prediction or finding, which is the process of recognizing elements in the genome.

Functional annotation: This involves assigning biological information to these recognized elements.

Structural genome annotation

To begin, we must first identify the genomic structures that encode proteins. The term ‘structural annotation’ refers to this step of the annotation process. It includes information on the identification and positioning of open reading frames (ORFs), gene architecture and coding sequences, and regulatory motifs. There are numerous tools in bioinformatics to annotate structure. Augustus (for eukaryotes) and Glimmer 3 (for prokaryotes) are two tools used in bioinformatics for gene prediction.

Gene prediction or gene finding

The process of discovering the sections of the genome that encode genes is known as gene finding or gene prediction. This comprises both protein-coding genes and RNA (ribonucleic acid)-coding genes, as well as the prediction of other functional elements like regulatory regions. Once a species' genome has been sequenced, discovering genes is one of the first and most crucial steps in comprehending it.

Structural annotation tools for genes

AUGUSTUS: This is a free program that detects genes from eukaryotic genome sequences. This has a protein profile extension (PPX) that allows it to recognize members and associated exon-intron organization of a family of proteins provided by a block profile by using protein family-specific conservation. Alternative splicing and alternate transcripts, including introns, can be predicted using mRNA (messenger RNA) alignments, EST (expressed sequence tag) alignments, conservation, and other sources of information.
GENEID: This is a program that predicts genes, genomic untranslated regions, splice sites, and other genomic DNA information.
Repeat asker: A repeat asker is a program that looks for interspersed repetitions and low-complex sequences in DNA (Deoxyribonucleic acid).
Codon Usage Database (Kazusa): The Codon Usage Database has codon usage tables for a variety of species.
AtGDB Geneseqer Web server: The AtGDB Geneseqer Webserver is for determining splice junctions in Arabidopsis sequences.
GENEMARK: The Genemark is the collection of algorithms for predicting genes in genomic DNA, offered by Georgia Institute of Technology's Bioinformatics Group.
TSSP-TCM (TSSplant-transductive confidence machine): SSP-TCM offers plant promoter identification.
WISE2: WISE2 matches the sequence of a protein to the nucleotide sequence of genomic DNA, accounting for introns and frameshifting defects.

Functional genome annotation

The term ‘functional gene annotation’ refers to the description of a protein's biochemical and biological activity. Functional gene annotation analyses can be used in the identification of transmembrane domains in polypeptide sequences and similarity searches. Prediction of gene clusters of secondary metabolites and searching for gene ontology terms are done using functional gene annotation analyses. Researchers use the NCBI BLAST (Basic Local Alignment Search Tool) + BLASTP (Basic Local Alignment Search Tool Program) to locate identical proteins in a protein data bank for similarity searches.

Functional annotation tools

Blast2GO (used to find Go annotation terms), Wolf Sort (used for predicting the subcellular localization of eukaryote proteins), and TMHMM-Transmembrane Helices; Hidden Markov Model (used to find transmembrane domains of protein sequences) are some examples of functional annotation tools used in bioinformatics to annotate function. 
Using BLAST to detect similarities and then annotate genome sequences based on those is the most basic level of annotation in bioinformatics. However, the annotation platform is now receiving an increasing amount of supplementary information. Manual annotators can use the additional information to deconvolute differences between genes that have the same annotation.

The diagrammatic representation of structural annotation is shown in the figure.
CC-BY | Image Credits: https://www.slideshare.net

Context and Applications

This topic is significant in the exams at school, graduate, and post-graduate levels, especially for Bachelors in Zoology/Genetics/Biotechnology and Masters in Zoology/Genetics/Biotechnology.

Practice Problems

Question 1: Which of the following is used as a tool in gene prediction in genome annotation?

  2. WormBase
  3. FlyBase
  4. All of the above

Answer: Option a is correct.

Explanation: The AUGUSTUS is a tool for gene prediction, and others are annotation databases.

Question 2: Which of the following is used for plant promoter identification?

  3. WISE2
  4. None of the above

Answer: Option b is correct.

Explanation: TSSP-TCM (TSSplant-transductive confidence machine) is a structural annotation tool. It offers plant promoter identification.

Question 3: NCBI BLAST+BLASTP is used for _____.

  1. Similarity search
  2. Finding transmembrane domains in proteins
  3. Finding splice junctions
  4. None of the above

Answer: Option a is correct.

Explanation: Researchers use the NCBI BLAST+ BLASTP to locate identical proteins in a protein data bank for similarity searches.

Question 4: What is the function of structural genome annotation?

  1. Identifying and positioning of open reading frames (ORFs)
  2. Finding gene architecture
  3. Finding coding sequences
  4. All of the above

Answer: Option d is correct.

Explanation: The annotation process involves identifying and positioning open reading frames (ORFs), gene architecture and coding sequences, and regulatory motifs.

Question 5: Which of the following is an example of the database used to find and annotate genes and their functions?

  1. WormBase
  3. WISE2
  4. None of the above

Answer: Option a is correct.

Explanation: WormBase is an example of an annotation database, and others are gene prediction tools.

Want more help with your biology homework?

We've got you covered with step-by-step solutions to millions of textbook problems, subject matter experts on standby 24/7 when you're stumped, and more.
Check out a sample biology Q&A solution here!

*Response times may vary by subject and question complexity. Median response time is 34 minutes for paid subscribers and may be longer for promotional offers.

Search. Solve. Succeed!

Study smarter access to millions of step-by step textbook solutions, our Q&A library, and AI powered Math Solver. Plus, you get 30 questions to ask an expert each month.

Tagged in



Genome annotation

Genome annotation Homework Questions from Fellow Students

Browse our recently answered Genome annotation homework questions.

Search. Solve. Succeed!

Study smarter access to millions of step-by step textbook solutions, our Q&A library, and AI powered Math Solver. Plus, you get 30 questions to ask an expert each month.

Tagged in



Genome annotation