HCG: a database for hierarchical classification of functionally equivalent genes in prokaryotes Fenglou Mao*, Hongwei Wu*, Victor Olman, Ying Xu 1 Computational Systems Biology Laboratory Department of Biochemistry and Molecular Biology, and Institute of Bioinformatics University of Georgia, Athens, GA 30602, USA *These authors contributed equally to this paper 1 Correspondence author Abstract Background: The existing gene annotation schemes generally classify genes into two- levels of parallel and unrelated homologous and/or orthologous gene groups, limiting our capabilities for gene function prediction at higher resolution. While homology and orthology are useful concepts for evolutionary studies of genes, they may not be the most appropriate ones for functional classification of genes, especially at a high-resolution level. Results: We present a new gene annotation database: the h ierarchical c lassification system of g enes (HCG), which provides functional annotation of prokaryotic genes in general at higher resolution than the existing functional classification schemes. The HCG database consists of clusters, hierarchically organized, of functionally equivalent genes at varying levels of resolution. Gene clusters at the top of the HCG hierarchy representing homologous gene groups and descendent gene clusters representing functionally equivalent genes at an increasingly higher resolution going down from the top to the leaf- level clusters along the classification hierarchy. We also provide several examples to demonstrate how HCG can be used to make specific gene function annotation. For each HCG cluster, we provide a p-value assessing the statistical significance in grouping its genes together, based on the functional relationship among its genes and their relationship with genes outside of the cluster. Conclusion: The HCG database, implemented using MySQL, currently consists of 658,174 genes, 51,205 clusters organized into 21,109 trees, from 224 prokaryotic genomes. The on-line database supports four search capabilities, namely (1) browsing HCG classification by trees, (2) browsing HCG classification by organisms, (3) querying 1
18
Embed
HCG: the database for hierarchical gene classificationcsbl.bmb.uga.edu/HCG/HCG-database-10-BMC-bioinformatics.pdfHCG classification by trees, (2) browsing HCG classification by organisms,
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
HCG: a database for hierarchical classification of functionally equivalent genes in prokaryotes
Fenglou Mao*, Hongwei Wu*, Victor Olman, Ying Xu1
Computational Systems Biology Laboratory Department of Biochemistry and Molecular Biology, and Institute of Bioinformatics
University of Georgia, Athens, GA 30602, USA *These authors contributed equally to this paper
1Correspondence author
Abstract Background: The existing gene annotation schemes generally classify genes into two-
levels of parallel and unrelated homologous and/or orthologous gene groups, limiting our
capabilities for gene function prediction at higher resolution. While homology and
orthology are useful concepts for evolutionary studies of genes, they may not be the most
appropriate ones for functional classification of genes, especially at a high-resolution
level.
Results: We present a new gene annotation database: the hierarchical classification
system of genes (HCG), which provides functional annotation of prokaryotic genes in
general at higher resolution than the existing functional classification schemes. The HCG
database consists of clusters, hierarchically organized, of functionally equivalent genes at
varying levels of resolution. Gene clusters at the top of the HCG hierarchy representing
homologous gene groups and descendent gene clusters representing functionally
equivalent genes at an increasingly higher resolution going down from the top to the leaf-
level clusters along the classification hierarchy. We also provide several examples to
demonstrate how HCG can be used to make specific gene function annotation. For each
HCG cluster, we provide a p-value assessing the statistical significance in grouping its
genes together, based on the functional relationship among its genes and their
relationship with genes outside of the cluster.
Conclusion: The HCG database, implemented using MySQL, currently consists of
658,174 genes, 51,205 clusters organized into 21,109 trees, from 224 prokaryotic
genomes. The on-line database supports four search capabilities, namely (1) browsing
HCG classification by trees, (2) browsing HCG classification by organisms, (3) querying
1
genes against the HCG database to find its gene cluster at the highest resolution possible
and its parent clusters if any, and (4) annotating sequences provided by a user.
1. Background
With the rapid accumulation of genome sequences along with their genes accurately
predicted, numerous efforts have been devoted to the computer-aided functional
annotation of genes, which have led to the development of a number of functional
classification schemes and associated databases such as Clusters of Orthologous Groups
(COG) [1], Pfam [2], and InterPro [3]. There are also other databases that integrate gene
annotation information with pathway information, such as Kyoto Encyclopedia of Genes
and Genomes (KEGG) [4], BioCyc[5] and the subsystem annotation environment SEED
[6]. While these and other functional classification schemes and databases provide highly
useful information for functional annotation of genomes, they are generally limited to
classification of genes into homologous and/or orthologous gene groups, although
homology and orthology are originally defined from evolution and don’t indicate gene
function relationship. The classification result of such schemes is generally represented as
a collection of parallel and unrelated functionally “equivalent” gene groups, providing a
two-level classification of functionally equivalent genes. We believe that the functional
relationship between genes can be better represented using a hierarchical system, which
is confirmed by recent development of Gene Ontology (GO) [7], which employs a DAG
(Directed Acyclic Graph) structure, more general than a hierarchical structure. Generally
gene function classifications can be grouped into two classes: two-level classification
such as COG, KEGG orthologs and Pfam and multi-level classification such as GOA and
our classification scheme HCG.
The Gene Ontology Annotation (GOA) Database [8] is the only database that
employs multi-level classification of for gene functions up until now. GOA annotates
genes using GO terms so it stands on a solid ground for function classification. However
most annotations in GOA are extracted from UniProt and InterPro by using three scripts
(ec2go, skpw2go and InterPro2go), and others are annotated manually with the help of
annotation tools such as GOAnnotator, thus it is hard to evaluate the annotation quality.
There are other genome databases with gene annotation information, such as the
integrated microbial genomes (IMG) system [9] and Integr8 [10]. While useful, the gene
2
annotation in IMG is created through using rather simple methods, namely RPS-BLAST
(reverse position specific BLAST) and bidirectional best hits, which is widely thought to
be inaccurate [11], have low sensitivity [12] and yield high false positive rates [13], and it
also adopts the two level of classification strategies such as Pfam and COG. Integr8 also
used the annotation from other database such as InterPro and Pfam.
We have developed a functional classification scheme for prokaryotic genes,
based on both sequence similarity information and genomic neighborhood information
[14]. A key unique feature of this classification scheme is that it classifies genes into
functionally equivalent clusters at multiple resolution levels, and these clusters are either
parallel-to each other or inside-of one another, hence giving rise to a multi-level
hierarchical structure, under which genes could have “equivalent” functions measured at
varying resolution. For example, genes in any root-level cluster, in this functional
hierarchy, are functionally equivalent in the sense that they are homologous, and genes in
any lower-level cluster represent a group of functionally equivalent genes with higher
specificity (or higher resolution). The functional equivalence relationships among genes
at different resolution are derived based on a two-level classification scheme [14]. The
algorithm first derives the functional relationships among individual gene pairs based on
their sequence similarity and their co-location information in genomes, and then derives
the functional relationships among a group of genes by detecting the groups of genes with
high densities of pair-wise functional relationships within each group versus the
(relatively) lower densities of relationships between each gene group and genes outside of
the group. For each predicted gene cluster (group), we also provide a p-value to measure
how standout the cluster is in the background where these genes sit. In some sense, this
value also reflects the consistency of annotation of gene groups, or called annotation
quality.
By applying this classification scheme to genes of 224 prokaryotic genomes, we
have established a database, HCG, of functionally equivalent gene clusters. Intuitively,
the HCG system can be viewed as a “forest” of trees, where each tree consists of a root-
level cluster and its descendent clusters, possibly at different levels. For each cluster in
the HCG system, we have provided an annotation to characterize the common biological
function of the cluster, based on the Gene Ontology (GO) annotation (GOA Proteome
3
Sets) and NCBI gene-product description. Other information such as Pfam and COG
annotation is also provided for cross-reference purposes.
2. Construction and Content
2.1 The Construction of the Database
The HCG database currently consists of the classification result from 224 complete
prokaryotic genomes (released of NCBI, 03/05/2005). While the detailed description of
the clustering algorithm and an analysis of the data has been published elsewhere [14],
we here outline the procedure for database construction and application. The HCG
system has been created using the following steps:
(a) All homologous gene pairs are identified using reciprocal BLASTP [15] with e-
values < 1 for both directions of the search against all the 658,174 genes.
(b) The Smith-Waterman algorithm [16] is performed on all homologous gene pairs
selected from (a) to obtain a multi-value feature vector for each homologous
gene pair, representing the quality of their sequence alignment.
(c) A positive training set consisting of orthologous gene pairs as well as a negative
training set consisting of homologous but non-orthologous gene pairs is created
for the purpose of training a classifier (see [14] for details) .
(d) A parameterized linear classification function is employed to discriminate
orthologous genes from homologous but non-orthologous genes, whose
parameters are selected so that the classification function optimally
discriminates the positive from the negative training data.
(e) A scoring scheme is developed to measure the functional equivalence between
two genes based on the sequence similarity information derived from (d) and
genomic neighborhood information derived based on three operon prediction
programs, namely (i) VIMSS [17], (ii) JPOP [18, 19], and (iii) GeneChords [20].
(f) A graph representation is constructed to represent all the 658,174 genes from
224 prokaryotic genomes and their functional equivalence relationship defined
in (e).
4
(g) A graph-partition algorithm is applied to the representing graph of these genes
and their functional relationships to generate a collection of dense sub-graphs
(and sub-sub-graphs, etc), each of which represents a gene cluster. These gene
clusters form a hierarchical structure. For each cluster, a p-value is calculated to
assess its statistical significance.
(h) Each gene cluster is annotated using a set of keywords and GO terms, based on
common features of the NCBI and GO annotations [10] of individual genes of
the cluster, where the keywords are extracted from the NCBI description of each
gene product, and the GO terms for each cluster are selected based on a
majority-rule vote among GO assignments to individual genes in the cluster.
(i) All gene-classification data is integrated into a MySQL database; and a web
server is created at http://csbl.bmb.uga.edu/HCG to facilitate searching and
accessing the database.
The validity of the predicted gene clusters are checked through comparing the HCG
classification against the genome taxonomy, COG classification [1] and Pfam
classification [2] of genes. The detailed validation procedure and results are given in [14].
2.2 Database Tables
To store the tree structure of the HCG system in a MySQL relational database, we have
designed two tables, Node and Edge shown in Figure 1, to represent the HCG clusters
and the parent-child relationship. Other information such as gene attributes, cluster
annotation, and the p-values of each cluster are also stored in the MySQL tables. Figure 1
shows the relationship among the tables. The table “Gene” is used to store the
information of individual genes, such as gene attributes. The tables “GO”, “Node_GO”
and “Gene_GO” are used to store GO terms, GO annotation for individual genes and GO
term-based annotation for individual clusters, respectively. The table “Gene_Node” is
used to store the genes in each cluster, and the table “Species” is used to store species
information of a genome. There are several additional internal tables that are not
described in Figure 1 and are omitted for further discussion.
Fenglou Mao designed the database and implemented the online server; Fenglou Mao and
Hongwei Wu worked together to generate the data of HCG; Victor Olman designed the
hierarchical clustering program; Ying Xu coordinated the whole procedure and provided
the financial support.
Acknowledgement
This work was supported in part by National Science Foundation (NSF/DBI-0354771,
NSF/ITR-IIS-0407204, NSF/DBI-0542119) and by a “Distinguished Scholar” grant from
the Georgia Cancer Coalition.
Reference
1. Tatusov RL, Koonin EV, Lipman DJ: A genomic perspective on protein families. Science 1997, 278(5338):631-637.
2. Finn RD, Mistry J, Schuster-Bockler B, Griffiths-Jones S, Hollich V, Lassmann T, Moxon S, Marshall M, Khanna A, Durbin R et al: Pfam: clans, web tools and services. Nucleic Acids Res 2006, 34(Database issue):D247-251.
3. Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Bradley P, Bork P, Bucher P, Cerutti L et al: InterPro, progress and status in 2005. Nucleic Acids Res 2005, 33(Database issue):D201-205.
4. Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M: The KEGG resource for deciphering the genome. Nucleic Acids Res 2004, 32(Database issue):D277-280.
5. Keseler IM, Collado-Vides J, Gama-Castro S, Ingraham J, Paley S, Paulsen IT, Peralta-Gil M, Karp PD: EcoCyc: a comprehensive database resource for Escherichia coli. Nucleic Acids Res 2005, 33(Database issue):D334-337.
6. Overbeek R, Begley T, Butler RM, Choudhuri JV, Chuang HY, Cohoon M, de Crecy-Lagard V, Diaz N, Disz T, Edwards R et al: The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes. Nucleic Acids Res 2005, 33(17):5691-5702. Print 2005.
7. Harris MA, Clark J, Ireland A, Lomax J, Ashburner M, Foulger R, Eilbeck K, Lewis S, Marshall B, Mungall C et al: The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res 2004, 32(Database issue):D258-261.
8. Camon E, Magrane M, Barrell D, Lee V, Dimmer E, Maslen J, Binns D, Harte N, Lopez R, Apweiler R: The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Res 2004, 32(Database issue):D262-266.
9. Markowitz VM, Korzeniewski F, Palaniappan K, Szeto E, Werner G, Padki A, Zhao X, Dubchak I, Hugenholtz P, Anderson I et al: The integrated microbial genomes (IMG) system. Nucleic Acids Res 2006, 34(Database issue):D344-348.
14
10. Kersey P, Bower L, Morris L, Horne A, Petryszak R, Kanz C, Kanapin A, Das U, Michoud K, Phan I et al: Integr8 and Genome Reviews: integrated views of complete genomes and proteomes. Nucleic Acids Res 2005, 33(Database issue):D297-302.
11. Fulton DL, Li YY, Laird MR, Horsman BG, Roche FM, Brinkman FS: Improving the specificity of high-throughput ortholog prediction. BMC Bioinformatics 2006, 7:270.
13. Mao F, Su Z, Olman V, Dam P, Liu Z, Xu Y: Mapping of orthologous genes in the context of biological pathways: An application of integer programming. Proc Natl Acad Sci U S A 2006, 103(1):129-134.
14. Wu H, Mao F, Olman V, Xu Y: Hierarchical Classification of Functionally Equivalent Genes of Prokaryotes. accepted by Nucleic Acids Research 2007, 0(0):0.
15. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389-3402.
16. Smith TF, Waterman MS: Comparison of biosequences. Advances in Applied Mathematics 1981, 2(4):482-489.
17. Price MN, Huang KH, Alm EJ, Arkin AP: A novel method for accurate operon predictions in all sequenced prokaryotes. Nucleic Acids Res 2005, 33(3):880-892. Print 2005.
18. Chen X, Su Z, Dam P, Palenik B, Xu Y, Jiang T: Operon prediction by comparative genomics: an application to the Synechococcus sp. WH8102 genome. Nucleic Acids Res 2004, 32(7):2147-2157.
19. Chen X, Su Z, Xu Y, Jiang T: Computational Prediction of Operons in Synechococcus sp WH8102. Proceedings of 15th International Conference on Genome Informatics 2004:211-222.
20. Zheng Y, Anton BP, Roberts RJ, Kasif S: Phylogenetic detection of conserved gene clusters in microbial genomes. BMC Bioinformatics 2005, 6:243.
21. Tatusov RL, Galperin MY, Natale DA, Koonin EV: The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res 2000, 28(1):33-36.
15
Figure 1: HCG database table relationship
16
Figure 2: A screenshot of the HCG browser
17 17
Figure 3: The tree structure of cluster HCG-21, consisting of a group of two-component sensors. A circle represents a cluster which cannot be further divided; a rectangle represents a cluster containing only genes from the same genome; a triangle represents a cluster that does not have genes from the same genome. Colors do not have any particular meaning here.