Top Banner
REVIEW TB database 2010: Overview and update James E. Galagan a, b, c, * , Peter Sisk a , Christian Stolte a , Brian Weiner a , Michael Koehrsen a , Farrell Wymore d , T.B.K. Reddy d , Reinhard Engels a , Marcel Gellesch a , Jeremy Hubble e , Heng Jin d , Lisa Larson a , Maria Mao e , Michael Nitzberg d , Jared White a , Zachariah K. Zachariah d , Gavin Sherlock e , Catherine A. Ball d , Gary K. Schoolnik f a Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA b Department of Biomedical Engineering, Boston University, Boston, MA 02215, USA c National Emerging Infectious Diseases Lab, Boston University, Boston MA 02118, USA d Department of Biochemistry, Stanford University School of Medicine, Stanford, CA 94305, USA e Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305-5120, USA f Department of Microbiology & Immunology, Stanford University School of Medicine, Stanford, CA 94305, USA article info Article history: Received 15 March 2010 Accepted 31 March 2010 Keywords: Tuberculosis Database Genome Microarray Diversity summary The Tuberculosis Database (TBDB) is an online database providing integrated access to genome sequence, expression data and literature curation for TB. TBDB currently houses genome assemblies for numerous strains of Mycobacterium tuberculosis (MTB) as well assemblies for over 20 strains related to MTB and useful for comparative analysis. TBDB stores pre- and post-publication gene-expression data from M. tuberculosis and its close relatives, including over 3000 MTB microarrays, 95 RT-PCR datasets, 2700 microarrays for human and mouse TB related experiments, and 260 arrays for Streptomyces coelicolor. To enable wide use of these data, TBDB provides a suite of tools for searching, browsing, analyzing, and downloading the data. We provide here an overview of TBDB focusing on recent data releases and enhancements. In particular, we describe the recent release of a Global Genetic Diversity dataset for TB, support for short-read re-sequencing data, new tools for exploring gene expression data in the context of gene regulation, and the integration of a metabolic network reconstruction and BioCyc with TBDB. By integrating a wide range of genomic data with tools for their use, TBDB is a unique platform for both basic science research in TB, as well as research into the discovery and development of TB drugs, vaccines and biomarkers. Ó 2010 Elsevier Ltd. All rights reserved. 1. Overview TBDB (tbdb.org) is an online database that provides integrated access through a single portal to sequence data and annotation, expression data, literature curation, and analysis tools for Tuber- culosis. Data integrated in TBDB include: Genome sequences for publicly available strains of Mycobac- terium tuberculosis (MTB), Genome sequences for over 20 strains related to MTB including M. africanum, M. bovis, M. avium, M. leprae, and M. smegmatus. Global sequence polymorphism data for M. tuberculosis. Sequence annotations for all genomes including genes, proteins, RNAs, and epitopes. Protein structure information. Functional annotations including enzyme function, metabolic reactions and pathways, and GO terms. Gene expression data e including raw and processed data - for M. tuberculosis including over 3000 microarrays and 95 RT-PCR datasets. Gene expression data from over 2700 microarrays for human and mouse TB related experiments, and also 260 arrays for Streptomyces coelicolor . Curated literature for over 2656 genes and 45 gene expression datasets. TBDB provides access to these data through an integrated search engine and a suite of tools for visualization, analysis, and download. We present here a survey of the capabilities and uses of TBDB. We focus particularly on features and data in TBDB that have been * Corresponding author. Department of Biomedical Engineering, Boston Univer- sity, 24 Cummington Street, Boston, MA 02215, USA. Tel.: þ1 617 875 9874. E-mail address: [email protected] (J.E. Galagan). Contents lists available at ScienceDirect Tuberculosis journal homepage: http://intl.elsevierhealth.com/journals/tube ARTICLE IN PRESS 1472-9792/$ e see front matter Ó 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.tube.2010.03.010 Tuberculosis xxx (2010) 1e11 Please cite this article in press as: Galagan JE, et al., TB database 2010: Overview and update, Tuberculosis (2010), doi:10.1016/j.tube.2010.03.010
11

TB database 2010: overview and update

Dec 08, 2022

Download

Documents

Junjie Zhang
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: TB database 2010: overview and update

lable at ScienceDirect

ARTICLE IN PRESS

Tuberculosis xxx (2010) 1e11

Contents lists avai

Tuberculosis

journal homepage: http : / / int l .e lsevierhealth.com/journals / tube

REVIEW

TB database 2010: Overview and update

James E. Galagan a,b,c,*, Peter Sisk a, Christian Stolte a, Brian Weiner a, Michael Koehrsen a,Farrell Wymore d, T.B.K. Reddy d, Reinhard Engels a, Marcel Gellesch a, Jeremy Hubble e, Heng Jin d,Lisa Larson a, Maria Mao e, Michael Nitzberg d, Jared White a, Zachariah K. Zachariah d, Gavin Sherlock e,Catherine A. Ball d, Gary K. Schoolnik f

aBroad Institute of MIT and Harvard, Cambridge, MA 02142, USAbDepartment of Biomedical Engineering, Boston University, Boston, MA 02215, USAcNational Emerging Infectious Diseases Lab, Boston University, Boston MA 02118, USAdDepartment of Biochemistry, Stanford University School of Medicine, Stanford, CA 94305, USAeDepartment of Genetics, Stanford University School of Medicine, Stanford, CA 94305-5120, USAfDepartment of Microbiology & Immunology, Stanford University School of Medicine, Stanford, CA 94305, USA

a r t i c l e i n f o

Article history:Received 15 March 2010Accepted 31 March 2010

Keywords:TuberculosisDatabaseGenomeMicroarrayDiversity

* Corresponding author. Department of Biomedicalsity, 24 Cummington Street, Boston, MA 02215, USA.

E-mail address: [email protected] (J.E. Galagan).

1472-9792/$ e see front matter � 2010 Elsevier Ltd.doi:10.1016/j.tube.2010.03.010

Please cite this article in press as: Galagan JE

s u m m a r y

The Tuberculosis Database (TBDB) is an online database providing integrated access to genome sequence,expression data and literature curation for TB. TBDB currently houses genome assemblies for numerousstrains of Mycobacterium tuberculosis (MTB) as well assemblies for over 20 strains related to MTB anduseful for comparative analysis. TBDB stores pre- and post-publication gene-expression data fromM. tuberculosis and its close relatives, including over 3000 MTB microarrays, 95 RT-PCR datasets, 2700microarrays for human and mouse TB related experiments, and 260 arrays for Streptomyces coelicolor. Toenable wide use of these data, TBDB provides a suite of tools for searching, browsing, analyzing, anddownloading the data. We provide here an overview of TBDB focusing on recent data releases andenhancements. In particular, we describe the recent release of a Global Genetic Diversity dataset for TB,support for short-read re-sequencing data, new tools for exploring gene expression data in the context ofgene regulation, and the integration of a metabolic network reconstruction and BioCyc with TBDB. Byintegrating a wide range of genomic data with tools for their use, TBDB is a unique platform for bothbasic science research in TB, as well as research into the discovery and development of TB drugs, vaccinesand biomarkers.

� 2010 Elsevier Ltd. All rights reserved.

1. Overview

TBDB (tbdb.org) is an online database that provides integratedaccess through a single portal to sequence data and annotation,expression data, literature curation, and analysis tools for Tuber-culosis. Data integrated in TBDB include:

� Genome sequences for publicly available strains of Mycobac-terium tuberculosis (MTB),

� Genome sequences for over 20 strains related toMTB includingM. africanum, M. bovis, M. avium, M. leprae, and M. smegmatus.

� Global sequence polymorphism data for M. tuberculosis.

Engineering, Boston Univer-Tel.: þ1 617 875 9874.

All rights reserved.

, et al., TB database 2010: Ove

� Sequence annotations for all genomes including genes,proteins, RNAs, and epitopes.

� Protein structure information.� Functional annotations including enzyme function, metabolicreactions and pathways, and GO terms.

� Gene expression data e including raw and processed data - forM. tuberculosis including over 3000 microarrays and 95 RT-PCRdatasets.

� Gene expression data from over 2700 microarrays for humanand mouse TB related experiments, and also 260 arrays forStreptomyces coelicolor.

� Curated literature for over 2656 genes and 45 gene expressiondatasets.

TBDB provides access to these data through an integrated searchengine and a suite of tools for visualization, analysis, and download.

We present here a survey of the capabilities and uses of TBDB.We focus particularly on features and data in TBDB that have been

rview and update, Tuberculosis (2010), doi:10.1016/j.tube.2010.03.010

Page 2: TB database 2010: overview and update

J.E. Galagan et al. / Tuberculosis xxx (2010) 1e112

ARTICLE IN PRESS

added to the system since the first published description.1 Thesenew data and capabilities include:

� Illumina sequence and polymorphism data from a globalsurvey of TB Genetic Diversity carried out by Dr. Gagneux andcolleagues

� Support for visualization of short-read alignment data usinga fast interactive browser called GenomeView

� New tools for the visualization and exploration of expressiondata, particularly in a gene regulatory context

� New interfaces for performing gene set enrichment analysesand downloading batch gene annotation data.

� Support for a metabolic network reconstruction for M. tuber-culosis through the integration of BioCyc with TBDB

� Updated tutorials for first-time users

1.1. Quick search

The primary entry point for TBDB is the Quick Search. Quicksearch is an integrated search engine that allows users to search alldata in TBDB simultaneously. The quick search interface is availablefrom the TBDB home page (Figure 1) and also at the top right cornerof every TBDB page. Datawithin TBDB can be searched using a genename, sequence name, author name, title, or any other keyword.A search returns a page with a count of all data of each type that

Figure 1. TB Database Homepage. The site is organized into Publications, Expression Data,more details) that provides access to all data. Also available from the front page are tutoria

Please cite this article in press as: Galagan JE, et al., TB database 2010: Ove

corresponds in some way to the search term (Figure 2A). Selectingthe data type (e.g. “H37RV Coding Genes”) provides a list of resultsfor that type, ranked by relevance.

For example, if a user searches for “dosR” and then selects“H37RV Coding Genes” from the Quick Search results, a page ofgenes related to the dosR term is provided (Figure 2B). The mostrelevant result is described at the top of the table, in this case thedosR (also known as devR) gene itself. Subsequent entries arerelated to dosR, as indicate by the column “Matching Field.” In thisexample, the second result is devS e the sensor kinase for the dosRresponse regulator e which is related to dosR in TBDB by sharedBLASTX hits and shared publications, as indicated in the lastcolumn. Subsequent gene results are related to dosR throughshared publications.

Users can also easily search for a particular gene with a genename or identifier (ID). Although Quick Search is a powerful way ofaccessing all data in TBDB, often a user is only interested inaccessing data for a single gene whose ID is known. Entering anygene ID (a gene name or RV number for H37Rv) and clicking on the“Search for Gene” button on the Quick Search will bring the userdirectly to the gene information page for that gene.

1.2. Downloading data

The Quick Search feature (and other advanced search andbrowsing tools) provides the ability to find and view subsets of data

and Genomic Data. At the center of the home page is a Quick Search form (see text forls for first time users.

rview and update, Tuberculosis (2010), doi:10.1016/j.tube.2010.03.010

Page 3: TB database 2010: overview and update

Figure 2. Quick Search results pages. (A) Searches return a page with a count of all data of each type that matches the search term. (B) Selecting one data type (in this example,H37Rv Coding Genes) returns a list of data objects of that type. In this example, a search for “DosR” returns Rv3133c (the dosR gene) as the top result and related genes a subsequentresults.

J.E. Galagan et al. / Tuberculosis xxx (2010) 1e11 3

ARTICLE IN PRESS

in TBDB. Certain users, however, may prefer to download data fromTBDB en masse, for analysis offline or import into other systems.TBDB provides the ability to download raw sequence, annotation,and expression data. Expression data can be downloaded byselecting Expression Data -> Download from the top level menu onany page. Raw microarray data and tab-delimited processed geneexpression data (in pcl format) associated with any TBDB publica-tion can then be downloaded. Similarly, all expression data can bedownloaded as a function of the organism from which it wasderived. Genome sequence and annotation data can be down-loaded by selecting Genomic Data -> Download from the top levelmenu. Sequence and annotation files of many different formats e

including standard formats such as FASTA and GFF formats e canthen be accessed. Users may also choose to upload unpublisheddata to TBDB where it will be archived in a non-public site that ispassword protected.

2. Genome sequence and annotation

2.1. Organisms in TBDB

TBDB houses genome sequence data for a range of speciesrelevant to tuberculosis. Primary among these data is the sequenceforM. tuberculosis strain H37Rv e the standard lab strain long usedfor experimental and animal infection studies. Also available areother publicly available M. tuberculosis strain assemblies includingthose for strains CDC1551, F11, C, Haarlem, and H37Ra. In addition,as described below, new to TBDB are short-read re-sequencing datafor 30 M. tuberculosis strains representing TB global geneticdiversity.

TBDB also includes data for other sequenced Mycobacteria.These includeM. africanum GM041182, a strain commonly found inWest African countries; M. bovis AF2122/97, which causes tuber-culosis mainly in cattle; M. bovis BCG str. Pasteur 1173P2, related tothe TB vaccine strain; M. leprae TN, the causative agent of leprosy;M. marinum, a pathogen of fish and amphibians;M. ulcerans Agy99,the causative agent of Buruli ulcer; M. avium k10, an obligatepathogen which causes Johne’s disease in cattle and other rumi-nants; M. avium 104, an opportunistic pathogen isolated from anadult AIDS patient in Southern California; and M. smegmatusMC2155, initially isolated from human smegma and a frequentexperimental model system for MTB.

To facilitate comparative sequence analyses across a widerphylogenetic range, TBDB provides sequence data for bacteria fromrelated taxa, focusing on members of the Actinomycetes family ofhighGþCcontent, Gram-positive organismsofwhichM. tuberculosis

Please cite this article in press as: Galagan JE, et al., TB database 2010: Ove

is a member. These include representatives of the Corynebacteria,Streptomyces, and Rhodococcus.

All genome sequences have been annotated with a variety ofgenomic features including genes, operons, sequence similarity toGenBank sequences using BLAST,2 transfer RNAs using tRNAScan,3

protein domains and families using PFAM4 and non-coding RNAsbased on RFAM.5 Known immune epitopes have also been mappedthrough collaborationwith BioHealthBase now FluDB (http://www.fludb.org/brc/home.do?decorator¼influenza). A suite of analyticaltools is also provided to allow comparative genomic analysis of M.tuberculosis. Access to the annotated genome sequences andcomparative data is provided through several search interfaces,some of which are described in subsequent sections.

2.2. Gene details page

All information about annotated features on any sequence isavailable through Feature Detail pages. The core Feature Detail pageis the Gene Details page (Figure 3). This page provides a single sitefor organizing all information about a gene. The information isorganized into the following sections, which can be collapsed orexpanded as desired:

� Gene Info� Diversity Graph� Gene Expression� Functional Annotation� Transcript Info� Tools� Curated Publications� Publications Awaiting Curation

In these sections, users can retrieve information about genenomenclature, gene products, and protein domains; informationabout functional annotations including GO terms, enzyme function,KEGG pathway, and COG term; and information specific to H37RVgenes for TraSH essentiality.6,7,8 Users can also see a list of publica-tions that have been manually curated for the corresponding genesand access the publication on PubMed directly from the page.Publications that have been computationally associated with thegene, but notmanually curated, are also provided. External links areprovided to relateddatabases including TubercuList (http://genolist.pasteur.fr/TubercuList/), TB Structural Genomics Consortium(http://www.doe-mbi.ucla.edu/TB/) Protein Structure Information,the Proteome 2D-PAGE Database (http://web.mpiib-berlin.mpg.de/cgi-bin/pdbs/2d-page/extern/index.cgi), and Google Scholar. In

rview and update, Tuberculosis (2010), doi:10.1016/j.tube.2010.03.010

Page 4: TB database 2010: overview and update

Figure 3. Gene Details Page. These pages provide a single site to access all informationavailable for any gene in any organism in TBDB.

J.E. Galagan et al. / Tuberculosis xxx (2010) 1e114

ARTICLE IN PRESS

Please cite this article in press as: Galagan JE, et al., TB database 2010: Ove

addition, users can access information about genetic diversity andgene expression, as described below.

2.3. Tool bar

The Gene Details page is also an access point for a wide range oftools that TBDB provides for data visualization and analysis. Themost commonly used tools are available using the Tool Bar at thetop of each Gene Details page (Figure 3). Data analysis and visual-ization options in the Tool Bar are organized by biological topic:expression data, genome sequence data, and sequence andexpression data in the context of gene regulation. Mousing overeach tool bar element displays more information about each ToolBar analysis or visualization option.

2.4. Sequence analysis tools

We summarize a number of the tools provided for sequenceanalysis here. As illustrated in Figure 4A, the Argo Genome Browserapplet provides a linear view of a genome sequence with differentfeatures (e.g. genes) displayed as arrows. The Argo applet is fullydynamic, allowing users to scroll and to zoom from the nucleotidelevel up to the entire loaded region, without needing to reload thepage. Individual features can be double clicked to open the corre-sponding feature details page. A region of up to 100 Kb can beloaded into the Argo applet browser. Sequences larger than100 Kbp, including entire genomes, can be viewed in the Argoapplication version of the browser (also available through the ToolBar).

Originally developed by BioHealthBase, and now integrated intoTBDB, the Protein Structural Viewer (Figure 4B) provides a dynamicvisualization of structures for H37RV from the TB StructuralGenomics Consortium (http://www.doe-mbi.ucla.edu/TB/). Thisviewer allows annotated features such as epitopes and singlenucleotide polymorphisms to be visualized in the context of proteinstructure (see below).

To take fullest advantage of the range of sequence data availablein TBDB, a range of comparative sequence analysis tools are alsoavailable. To enable the analysis of gene evolution and to facilitatefinding the corresponding genes in different organisms, we havegenerated automated predictions of gene families for all genes inTBDB. If a gene is a member of a gene family, an entry under GeneFamily is provided in the Gene Info section of the gene details page.Clicking on this entry opens a page showing an interactive view(developed using JalView9) of the alignment and membership ofthat gene family (Figure 4C). Users may choose to view the align-ment of the coding sequence, protein sequence, and upstream anddownstream coding sequences.

To provide a view of evolution at the genome scale, TBDBprovides a dynamic genome Dot Plot viewer, available from the toolbar (Figure 4D). This viewer displays genome synteny between anyreference genome on the x-axis and one or more query genomes onthe y-axis. Users may search for genes by entering keywords in thesearch box. Matching genes are then displayed as colored arrows onthe individual axes. Selecting a gene highlights the syntenic regionon the other genomes.

3. TB genetic diversity

Data from the M. tuberculosis Phylogeographic DiversitySequencing Project are a recent addition to TBDB. Led bySebastien Gagneux and Peter Small in collaboration with theNIAID-funded Broad Genomic Sequencing Center for InfectiousDisease (http://www.broadinstitute.org/science/projects/gscid/genomic-sequencing-center-infectious-diseases), this project

rview and update, Tuberculosis (2010), doi:10.1016/j.tube.2010.03.010

Page 5: TB database 2010: overview and update

Figure 4. Sequence Analysis Tools. (A) The Argo Genome Browser applet provides a fully interactive and dynamic view of genome sequences and annotations, (B) the ProteinStructure Viewer, provides dynamic visualization of structures for H37RV along with annotated features, (C) the Gene Family Page provides an alignment of predicted orthologsacross all organisms within TBDB, and (D) the Dot Plot provides a visualization of genome synteny between different organisms.

J.E. Galagan et al. / Tuberculosis xxx (2010) 1e11 5

ARTICLE IN PRESS

builds on existing models of TB global population structure10

by re-sequencing 31 TB strains that were carefully selected asrepresentatives of the global diversity of TB. Sequence poly-morphisms between these strains were detected by alignmentto the H37Rv genome sequence. All detected polymorphisms,as well as all read alignments are now available throughTBDB. These data are available under Genomic Data ->Diversity Sequencing and through the Gene Details pages forH37Rv.

Please cite this article in press as: Galagan JE, et al., TB database 2010: Ove

The Gene Diversity Graph, available via the Gene Informationpage (Figure 3), provides a graphical view of the degree of nucle-otide polymorphism at each position of a gene. At each positionalong a gene, the number of strains that have a polymorphism atthis site relative to H37Rv is shown. Polymorphism counts are alsocolor coded by whether they are synonymous, non-synonymous,intergenic, or indels. The bars are links to lists of the polymorphismat that site (Figure 5B), and users can view a multiple alignment ofthe gene in all M. tuberculosis genomes in TBDB (Figure 5C).

rview and update, Tuberculosis (2010), doi:10.1016/j.tube.2010.03.010

Page 6: TB database 2010: overview and update

Figure 5. TB Genetic Polymorphisms Tools. Re-sequencing data from over 30 strains representing TB global diversity are available through TBDB. Strains have been aligned to H37Rvand polymorphisms detected. Users can search for polymorphisms by strain (A) or by position (not shown). (B) Searches return a results page with a list of polymorphisms andsummary data for each. (C) For any gene, users may view an alignment of all strains with polymorphisms displayed. (D) The Polymorphisms Details page provides detailedinformation about each polymorphic locus in H37Rv.

J.E. Galagan et al. / Tuberculosis xxx (2010) 1e116

ARTICLE IN PRESS

Users can also search individual polymorphisms by eitherstrain (Figure 5A) or position. The position search is particularlyuseful for users who have identified a polymorphism in their ownstrain and wish to see if this polymorphism has been seen inother strains. Searches return a list with summary informationand links are also provided to the alignment view for the cor-responding gene (Figure 5C) and the Polymorphism Details page(Figure 5D).

The Polymorphisms Details (Figure 5D) pages provide detailedinformation about each polymorphic locus in H37Rv. Using thispage users can access information about the location of the poly-morphic locus, the reference nucleotide and amino acid in H37Rv,and all alleles that differ from the reference strain and the strainswith these alternate alleles.

The polymorphisms presented in TBDB represent the analysis ofan underlying set of aligned sequencing reads. Users may directly

Please cite this article in press as: Galagan JE, et al., TB database 2010: Ove

access theseunderlyingalignments fromthePolymorphismsDetailspage. Selecting a strain name from one of the alleles on a poly-morphism details page launches an application called GenomeView(http://www.broadinstitute.org/software/genomeview/) thatdisplays the aligned reads for this strain centered on the poly-morphic locus (Figure 6). GenomeView provides a dynamic andinteractive genome browser-style visualization of the referencegenome, featureson thegenome (e.g. genes) andaligned reads.WithGenomeView, TBDB usersmay zoom from a full genome viewdownto a single nucleotide. Aligned reads show mismatches to thereference in yellow, and called polymorphisms are positions withmismatches in a majority of reads and thus appear as a yellowvertical stripe. By providing access to the underlying read align-ments, TBDB enables users to verify reported polymorphisms, lookfor possible missed polymorphisms and visualize regions with lowcoverage where possible polymorphisms cannot be identified.

rview and update, Tuberculosis (2010), doi:10.1016/j.tube.2010.03.010

Page 7: TB database 2010: overview and update

Figure 6. GenomeView display of short read alignments from a strain of TB sequenced as part of the TB Diversity Project aligned to H37Rv as the reference. Reads are displayed asgreen (forward reads) or blue (reverse reads) and mismatches between reads and the reference genome are indicated by a yellow square. Polymorphic positions display as a verticalstrip of yellow. GenomeView is fully dynamic and interactive. Users may pan and zoom from the full genome to the nucleotide level.

J.E. Galagan et al. / Tuberculosis xxx (2010) 1e11 7

ARTICLE IN PRESS

4. TB metabolic network reconstruction

The metabolic pathways and reactions of each organism inTBDB are now represented as a Biocyc Pathway/Genome data-base (http://biocyc.org/) within TBDB (and directly at http://tbcyc.tbdb.org) (Figure 7). Originally, these databases werecreated as a collaboration between SRI International and Stan-ford University and subsequent updating of the dataset wasassumed in 2006 by BioHealthBase BRC now FluDB. In 2009,

Figure 7. TB Metabolic Map and Integration with BioCyc. TBDB now supports a BioCyc instaorganism in TBDB.

Please cite this article in press as: Galagan JE, et al., TB database 2010: Ove

TBDB adopted the TB pathways collection and reconstructed themetabolic network for each organism by integrating geneannotations from TBDB with enzyme predictions from EFICAz,and subsequent curation based on recent genome-scale meta-bolic models.11,12 Links from the gene details page open thecorresponding pathway for a given enzyme, and links fromgenes within the pathway views open the corresponding genedetails page. Metabolic map reconstructions may also becompared across organisms.

nce that provides access to a genome scale metabolic network reconstruction for each

rview and update, Tuberculosis (2010), doi:10.1016/j.tube.2010.03.010

Page 8: TB database 2010: overview and update

Figure 8. Gene Expression Samples and Conditions displays the expression of a singlegene, RV2429, over all microarrays in TBDB. The significance histogram has beenselected in the illustration.

J.E. Galagan et al. / Tuberculosis xxx (2010) 1e118

ARTICLE IN PRESS

5. Expression data

TBDB provides researchers a suite of tools to explore, visualizeand analyze publicly-available gene expression data generatedfrom both the bacterial pathogen itself and its human and mousehosts. Most gene expression data in TBDB were generated usingmicroarrays, but TBDB also houses data generated using quantita-tive RT-PCR (and the sequences of the probes and primers of thevalidated TaqMan sets used to obtain the RT-PCR data), and isactively working to incorporate tools to explore RNA-seq data. Inaddition to making available public data, TBDB allows researchersto load their pre-publication gene expression data. Pre-publicationdata entry enables researchers to analyze and share data with theircolleagues and collaborators via password-protected access. Suchdata remain private until the researchers publish them or decide tomake them public. At that point, TBDB can export the data to publicrepositories such as the NCBI’s Gene Expression Omnibus (GEO)13

or the EBI’s ArrayExpress14 as required by various journals, inaddition to making the data publicly available through TBDB. TBDBcurators also import gene expression data from other resources likeGEO and ArrayExpress or obtain data directly from the researchersfollowing their publication. As of March 2010, TBDB has publiclyavailable data for M. tuberculosis from more than 2100 microarraysderived from over 30 publications and several unpublished exper-iments. TBDB also hosts data frommore than 500 microarrays usedto study human and mouse TB related experiments and from thestudy of Streptomyces grown under different in vitro conditions.The latter allows knowledge about Streptomyces physiology andmetabolic and biosynthetic pathways to be applied to M. tubercu-losis, especially for orthologous genes and gene clusters.

In addition to access to the raw gene expression data andanalysis tools, TBDB provides access to pre-analyzed data soresearchers can quickly get answers to questions such as: Whatgenes have similar expression patterns as my gene? Under whatconditions does my gene show significant expression changes? Arethe genes I am interested in co-expressed under a particularcondition? Some of these tools are more fully described below.

5.1. Pre-analyzed data

Although TBDB provides a rich and powerful suite of dataselection and analysis tools, novice users can find it daunting todeal with the raw data. We have thus developed tools that provideaccess to pre-analyzed data and present an easy means to rapidlyaccess summary information about the expression pattern ofa given gene or key signatures that exist in published datasets. Afterthe gene expression data from a publication are obtained, loadedinto TBDB and annotated, TBDB curators then filter, normalize,transform and cluster the data. These pre-clustered datasets areassociated with the relevant publication, and are available todownload or visualize. The Gene Profiles tool allows users tovisualize the expression pattern for a gene of interest in publisheddatasets and determine which genes show the most similar anddissimilar expression patterns in those datasets. In addition to geneexpression profiling, data from each published set are subjected toa statistical analysis that provides information about which genesare significantly up or down-regulated in every sample or experi-mental condition. These data are the basis of the “Samples andConditions” tool described below (Figure 8).

5.2. Samples and conditions

The Samples and Conditions tool displays a histogram of signif-icance values calculated for a given gene in all public expressiondatain TBDB (Figure 8). The experimental conditions are analyzed to

Please cite this article in press as: Galagan JE, et al., TB database 2010: Ove

determine if anycondition is over-represented in the extremevaluesfor that gene, so that those with a small p-value (and hence highsignificance) are highlighted. The histogram can also be applied toview the signal intensity of the original microarray data or theexpression value of the gene, but here we cover the significancecalculations. Samples with highly significant expression values fora given gene are on the right-hand side of the histogram. Users canmanipulate sliders to reduce or expand the stringency of selectedsignificance values. A table describing all the experimental condi-tions under which the gene’s expression meets the stringencycriteria is provided to the right of the histogram (not shown). Datafrom the table can be downloaded, as can tab-delimited geneexpression data from the microarrays that meet the selectioncriteria.

5.3. Cluster my genes

Using the Cluster My Genes tool, a user can explore geneexpression profiles for a list of genes they are interested in froma publication, or from sets of samples chosen by specific annotatedconditions like hypoxia, isoniazid, oleic acid, starvation, etc. Thisallows a user to pose the simple question e “are the genes on mylist of interested co-expressed under a certain condition?” As

rview and update, Tuberculosis (2010), doi:10.1016/j.tube.2010.03.010

Page 9: TB database 2010: overview and update

J.E. Galagan et al. / Tuberculosis xxx (2010) 1e11 9

ARTICLE IN PRESS

shown in the Figure 9 a user can choose samples based on exper-imental annotations, a publication, a mutation present in thebackground strain, or the specific strain the data were generated. Inaddition, a user can provide a list of genes they are interested in, oropt to see all genes, and then cluster the resulting data. Theresulting cluster is then viewable using the Gene Profiles tool,described below.

5.4. Gene profiles

The Gene Profiles tool (Figure 10) allows users to explore pre-processed clustered data from each gene expression publication inTBDB without going through any data processing or analysispipeline. In addition to providing a way to explore data froma single publication, Gene Profiles provides users with a method toexplore the expression pattern for a given gene in each publication.

When a user clicks on a gene expression row in the Gene Profilesheat-map, the genes that have the most and least similar expres-sion patterns are displayed. The gene locus’ accession number maybe clicked to view the gene detail page for the selected gene.

5.5. Advanced expression data analysis tools

TBDB hosts a powerful suite of data selection, analysis andvisualization tools commonly known as the analysis pipeline. Thepipeline consists of a series of web pages that allows the user tocustomize microarray data selection, gene and microarray sampleannotations, data filters and transformations as well as variousmicroarray analysis tools. The data analysis pipeline is based on thetools available via the Stanford Microarray Database15 and

Figure 9. Genes and Conditions. This interface allows users to select data for multiple gannotations, or the publications from where those experiments derive.

Please cite this article in press as: Galagan JE, et al., TB database 2010: Ove

therefore include all tools in the powerful GenePattern micrloarraydata analysis software package.16

6. Publications

TBDB has associated several thousand publications withM. tuberculosis genes, and includes dozens of publications associatedwith gene expression data.Most of the gene expression data inTBDBare associated with a publication, and are thus well annotated withthe experimental details. Data associated with a gene expressionpublication may be downloaded and are available for viewing andmanipulation using the aforementioned tools. In addition to stan-dard searches such as keyword, author name, abstract text or title,TBDB can execute gene name searches to find papers that providebiological insight or gene expression data for a given gene. Links toPubMed17 and full-text versions of a publication are provided soTBDB users can pursue the primary data and experiments describedinapublication. TBDBpublications canbe foundusing thenavigationmenu called “Publications.” Cross-references between gene expres-sion publications and the curated gene publications are provided.

7. Gene regulation

The integration of genome sequence data and expression dataprovides the opportunity to view both in the context of the TB generegulatory network. TBDB provides a growing set of tools foranalyzing gene regulation (Figure 11), all of which are accessiblethrough the Tool Bar on each gene details page.

The most fundamental unit of gene regulation in bacteria is theoperon. Conservation of gene order and orientation betweenadjacent genes have proven to be strong indicators of operon

enes across multiple conditions, based on various criteria, such as the experiments’

rview and update, Tuberculosis (2010), doi:10.1016/j.tube.2010.03.010

Page 10: TB database 2010: overview and update

Figure 10. Gene Profiles display shows a selected gene with the most correlated and anti-correlated genes.

J.E. Galagan et al. / Tuberculosis xxx (2010) 1e1110

ARTICLE IN PRESS

structure.18,19 In addition, because genes in an operon are co-transcribed, a significant correlation in gene expression betweengenes in an operon is expected.20e22 The Operon Browser in TBDBprovides an integrated view of both types of evidence (Figure 11A).

Figure 11. Gene Regulation Tools. (A) The Operon Browser. The top half of the browser dispgenes and gene order. (B) The Correlation Catalog tool provides a list of the genes that areexpression data within TBDB. Selecting one gene opens a Gene Expression Scatter Plot shexperimental conditions. The Gene Expression Scatter Plot is also available directly from th

Please cite this article in press as: Galagan JE, et al., TB database 2010: Ove

The top half of the browser displays the correlation in expressionbetween a set of neighboring genes: red diamonds indicate corre-lated expression and thus a triangle of red indicates a set of adja-cent genes with correlated expression. The bottom half of the

lays the correlation in expression between genes. The bottom half displays orthologousmost positively or negatively correlated with a target gene of interest within a set ofowing the differential expression of the target and correlated genes under differente gene details page tool bar.

rview and update, Tuberculosis (2010), doi:10.1016/j.tube.2010.03.010

Page 11: TB database 2010: overview and update

J.E. Galagan et al. / Tuberculosis xxx (2010) 1e11 11

ARTICLE IN PRESS

browser displays gene order across a range of species, centered ona particular gene, with orthologous genes displayed in identicalcolors. A set of colored genes in the same order and orientationsuggest possible conserved operon structure.

Correlation of expression of genes that are not adjacent suggestspossible co-regulation by the same regulator factor(s). To facilitatethe identification of co-regulated genes, each gene details pageprovides a link to the Correlation Catalog. This tool provides a list ofthe genes that are most positively or negatively correlated witha target gene of interest within a set of expression data within TBDB.For each correlated gene, a link is provided that displays a GeneExpression Scatter Plot of the differential expression of the target andcorrelated genes. In this view, users can also select subsets of theavailable expressiondata to plot to identify those conditions inwhichthe pair of genes is most correlated in expression. The Gene Expres-sion Scatter Plot is also available directly from the gene details pagetool bar to allow users to select any pair of genes for visualization.

8. Future plans

In the next two years, TBDB plans to consolidate and strengthenits current suite of databases and tools and to expand into fouradditional areas of vital interest to the TB Research community:

� Enhanced user interface and training. Data from Googleanalytics show that TBDB is accessed by more than 1,400unique users each week. To further increase the utility of thesite for the TB research community we have solicited andreceived written critiques of the site from several independentreviewers. In response to their comments and recommenda-tions, major user interface enhancements have been imple-mented including the provision of additional tutorials. Thisprocess will continue and in addition will be enhanced byinvolvement of users in a community annotation project, theinitiation of virtual lab meetings between TBDB staff and theresearch community and access to online individual assistancefrom a TBDB curator via [email protected]

� Next generation sequencing database capacity and tooldevelopment. TBDB will expand into two emerging areas offunctional genomics made possible by the advent of Next Gensequencing. Not only will we increase our capacity to host NextGen sequencing data, but wewill expand our suite of analyticaland visualization tools focused on two applications: (1) the useof RNA-seq for expression profiling, re-annotation of operonsand the identification of small RNAs that may play essentialroles in gene regulation; and (2), ChIP-Seq for the identificationof promoters bound by transcription factors.

� Immuno-profiling database and tool development. Recog-nizing the role of the host immune system in the control andpathogenesis of tuberculosis, we will enhance our capacity tohost and analyze RNA expression data of M. tuberculosis-infected host tissues and develop a suite of tools for the anal-ysis of data from the immuno-profiling assays, including: cellphenotyping by flow cytometry; phospho-flow data; and T-cellintracellular cytokine staining. To these datasets, we will addthe capacity to add proteomics, glycomics and lipidomics data,resulting in a multi-dimensional portrait of host and pathogenfrom the same tissue.

� Tracking molecular epidemiology and drug resistance dataon a spatial-temporal global map. TBDB will explore theinterface between TB epidemiology/public health and func-tional and comparative genomics by providing data and toolsto map in real time the emergence and geographical spread ofdrug resistance mutants, including MDR and XDR strains, andtheir molecular fingerprints.

Please cite this article in press as: Galagan JE, et al., TB database 2010: Ove

Acknowledgements

Support for TBDB was provided by the Bill; Melinda GatesFoundation. The TB metabolic maps were originally created asa collaboration between SRI International and Stanford Universityand was funded by DARPA under contract N66001-01-C-8011 andby the NIH NIAID under grant AI44826. Additional enhancementswere provided in 2006 by BioHealthBase BRC under contract fromthe NIH NIAID. We are grateful to the research community for theirvaluable input and suggestions in building and maintaining thisdatabase.

Funding: None.

Competing interests: None declared.

Ethical approval: Not required.

References

1. Reddy TB, Riley R, Wymore F, Montgomery P, DeCaprio D, Engels R, et al. TBdatabase: an integrated platform for tuberculosis research. Nucleic Acids Res2009;37:D499eD508.

2. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignmentsearch tool. J Mol Biol 1990;215:403e10.

3. Lowe TM, Eddy SR. tRNAscan-SE: a program for improved detection of transferRNA genes in genomic sequence. Nucleic Acids Res 1997;25:955e64.

4. Finn RD, Mistry J, Schuster-Bockler B, Griffiths-Jones S, Hollich V, Lassmann T,et al. Pfam: clans, web tools and services. Nucleic Acids Res 2006;34:D247eD251.

5. Gardner PP, Daub J, Tate JG, Nawrocki EP, Kolbe DL, Lindgreen S, et al.Rfam: updates to the RNA families database. Nucleic Acids Res 2009;37:D136eD140.

6. Murry JP, Sassetti CM, Lane JM, Xie Z, Rubin EJ. Transposon site hybridization inmycobacterium tuberculosis. Methods Mol Biol 2008;416:45e59.

7. Sassetti CM, Boyd DH, Rubin EJ. Genes required for mycobacterial growthdefined by high density mutagenesis. Mol Microbiol 2003;48:77e84.

8. Sassetti CM, Boyd DH, Rubin EJ. Comprehensive identification of conditionallyessential genes in mycobacteria. Proc Natl Acad Sci U S A 2001;98:12712e7.

9. Waterhouse AM, Procter JB, Martin DM, Clamp M, Barton GJ. Jalview Version2ea multiple sequence alignment editor and analysis workbench. Bio-informatics 2009;25:1189e91.

10. Hershberg R, Lipatov M, Small PM, Sheffer H, Niemann S, Homolka S, et al. Highfunctional diversity in Mycobacterium tuberculosis driven by genetic drift andhuman demography. PLoS Biol 2008;6:e311.

11. Beste DJ, Hooper T, Stewart G, Bonde B, Avignone-Rossa C, Bushell ME, et al.a web-based genome-scale network model of Mycobacterium tuberculosismetabolism. Genome Biol 2007;8:R89.

12. Jamshidi N, Palsson BO. Investigating the metabolic capabilities of Mycobac-terium tuberculosis H37Rv using the in silico strain iNJ661 and proposingalternative drug targets. BMC Syst Biol 2007;1:26.

13. Barrett T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, Evangelista C, et al. NcbiGeo: archive for high-throughput functional genomic data. Nucleic Acids Res2009;37:D885eD890.

14. Parkinson H, Kapushesky M, Kolesnikov N, Rustici G, Shojatalab M,Abeygunawardena N, et al. ArrayExpress updateefrom an archive of functionalgenomics experiments to the atlas of gene expression. Nucleic Acids Res2009;37:D868eD872.

15. Hubble J, Demeter J, Jin H, Mao M, Nitzberg M, Reddy TB, et al. Implementationof GenePattern within the Stanford Microarray Database. Nucleic Acids Res2009;37:D898eD901.

16. Reich M, Liefeld T, Gould J, Lerner J, Tamayo P, Mesirov JP. GenePattern 2.0. NatGenet 2006;38:500e1.

17. Giglia E. New year, new PubMed. Eur J Phys Rehabil Med 2009;45:155e9.18. Edwards MT, Rison SC, Stoker NG, Wernisch L. A universally applicable method

of operon map prediction on minimally annotated genomes using conservedgenomic context. Nucleic Acids Res 2005;33:3253e62.

19. Westover BP, Buhler JD, Sonnenburg JL, Gordon JI. Operon prediction withouta training set. Bioinformatics 2005;21:880e8.

20. Craven M, Page D, Shavlik J, Bockhorst J, Glasner J. A probabilistic learningapproach to whole-genome operon prediction. Proc Int Conf Intell Syst Mol Biol2000;8:116e27.

21. Sabatti C, Rohlin L, Oh MK, Liao JC. Co-expression pattern from DNA microarrayexperiments as a tool for operon prediction. Nucleic Acids Res2002;30:2886e93.

22. Tjaden B, Haynor DR, Stolyar S, Rosenow C, Kolker E. Identifying operons anduntranslated regions of transcripts using Escherichia coli RNA expressionanalysis. Bioinformatics 2002;18(Suppl. 1):S337eS344.

rview and update, Tuberculosis (2010), doi:10.1016/j.tube.2010.03.010