Commentary Comparative sequence analysis of Sordaria macrospora and Neurospora crassa as a means to improve genome annotation q Minou Nowrousian, Christian W€ urtz, Stefanie P€ oggeler, and Ulrich K€ uck * Lehrstuhl fu ¨ r Allgemeine und Molekulare Botanik, Ruhr-Universita ¨ t Bochum, 44780, Bochum, Germany Received 15 August 2003; accepted 22 October 2003 Abstract One of the most challenging parts of large scale sequencing projects is the identification of functional elements encoded in a genome. Recently, studies of genomes of up to six different Saccharomyces species have demonstrated that a comparative analysis of genome sequences from closely related species is a powerful approach to identify open reading frames and other functional regions within genomes [Science 301 (2003) 71, Nature 423 (2003) 241]. Here, we present a comparison of selected sequences from Sordaria macrospora to their corresponding Neurospora crassa orthologous regions. Our analysis indicates that due to the high degree of sequence similarity and conservation of overall genomic organization, S. macrospora sequence information can be used to simplify the annotation of the N. crassa genome. Ó 2003 Elsevier Inc. All rights reserved. Keywords: Sordaria macrospora; Neurospora crassa; Filamentous fungi; Genome annotation; Exon–intron boundaries; Synteny; Comparative genomics 1. Introduction Over the last few years, several fungal genomes have been fully sequenced, and even more will be se- quenced in the near future (Galagan et al., 2003; Goffeau et al., 1996, http://www-genome.wi.mit.edu/seq/fgi/ candidates.html, http://www.ncbi.nlm.nih.gov/cgi-bin/ Entrez/map00?taxid ¼ 5085). However, the most chal- lenging part in genome analysis today is usually not generating the sequence but annotating it. Features like open reading frames, exon–intron boundaries and reg- ulatory elements within a genome are often difficult to predict correctly from genomic sequence information alone. Identification of transcribed regions can be im- proved if EST 1 sequences are available, but EST data- bases tend not to contain rarely transcribed genes, so often no cDNA sequence information is available. Comparison with public databases is an additional means of identifying open reading frames, but for many genes, no putative homologs are available in the data- bases; e.g., many fungal EST sequencing projects have found that only less than or 50% of the ESTs generated have already characterized homologs among other or- ganisms (e.g., Nelson et al., 1997; Prade et al., 2001; Zhu et al., 2001). Microarray data or other large scale tran- scriptome data can help to discover genes that are reg- ulated similarly at the transcriptional level and therefore might contain common promoter elements (e.g., Hughes et al., 2000; Ren et al., 2000; Roth et al., 1998), but this approach to identify regulatory elements is labor-inten- sive and requires an EST library or annotation of open reading frames as a prerequisite to generate microarrays. Information from within the genome as well as comparisons to databases and experimental data have been used to annotate the genome of Saccharomyces cerevisiae, the first eukaryote to be sequenced (Goffeau et al., 1996) and the one for which annotation has progressed furthest (http://www.yeastgenome.org/). Nevertheless, the gene count between different methods of analysis has varied considerably (e.g., Malpertuy et al., 2000). Recently, it was shown for several q Supplementary data associated with this article can be found, in the online version, at doi: 10.1016/j.fgb.2003.10.005. * Corrresponding author. Fax +49-234-321-4184. E-mail address: [email protected] (U. K€ uck). 1 Abbreviations used: EST, expressed sequence tag; ORF, open reading frame; indel, site corresponding to an insertion or deletion. 1087-1845/$ - see front matter Ó 2003 Elsevier Inc. All rights reserved. doi:10.1016/j.fgb.2003.10.005 Fungal Genetics and Biology 41 (2004) 285–292 www.elsevier.com/locate/yfgbi Fungal Genetics and Biology 41 (2004) 285–292 Fungal Genetics and Biology 41 (2004) 285–292
8
Embed
Comparative sequence analysis of Sordaria macrospora · Commentary Comparative sequence analysis of Sordaria macrospora and Neurospora crassa as a means to improve genome annotationq
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Fungal Genetics and Biology 41 (2004) 285–292
www.elsevier.com/locate/yfgbi
Fungal Genetics and Biology 41 (2004) 285–292Fungal Genetics and Biology 41 (2004) 285–292
Commentary
Comparative sequence analysis of Sordaria macrosporaand Neurospora crassa as a means to improve genome annotationq
Minou Nowrousian, Christian W€urtz, Stefanie P€oggeler, and Ulrich K€uck*
quences for intron donor, acceptor, and branch site are
Fig. 1. Comparison of intron 3 and adjacent regions of the S. macrospora (S.m.) and N. crassa (N.c.) pro11 genes. Intron sequences are indicated in
small case and are underlined. A region that is annotated as part of intron 3 in N. crassa, but most likely is part of exon 4 is shaded in gray. For
further information see text.
30.0
40.0
50.0
60.0
70.0
80.0
90.0
75.0 80.0 85.0 90.0 95.0 100.0% identity exons
% id
entit
y in
tron
s
Fig. 2. Conservation of introns is not linked to conservation of exonic
sequences. For 57 genes, sequence identity of exons was calculated
separately from that of the introns for each gene. A table with the
genes used in this comparison can be found in the supplementary
material (Table 2). The graph depicts the nucleic acid identity of in-
tronic sequences (y-axis) versus exonic sequences (x-axis) for the 57
M. Nowrousian et al. / Fungal Genetics and Biology 41 (2004) 285–292 287
almost identical in S. macrospora and N. crassa (data
not shown). Intron positions within open reading frames
in S. macrospora were highly conserved when compared
to N. crassa in all cases investigated. In some cases, therewere slight variations of the exact intron start or end,
but these might indicate annotation errors rather than
true biological differences. An example for this can be
seen at the 30 end of the third intron of the pro11 gene
(Fig. 1, P€oggeler and K€uck, 2003). The flanking exons
are highly similar whereas within the intron, the
homology is significantly lower and several indels
are present (Fig. 1). In the N. crassa homolog(NCU08741.1), the intron 30 end was annotated another
24 nt downstream of the predicted 30 end within S.
macrospora, but these 24 nt are 100% identical to the
corresponding S. macrospora sequence. This indicates
that they might constitute exonic sequence rather than
intron sequence. Sequencing of the S. macrospora pro11
cDNA confirmed the exon/intron boundary at the po-
sition indicated in Fig. 1.
genes. Black triangles indicate genes for which only partial S. mac-
rospora sequence information was available for comparison. Open
squares indicate complete genes used for this comparison.
3. Conservation of introns is not linked to conservation of
exon sequences
As described before, introns and exons of a given
gene are colinear and highly similar in S. macrospora
and N. crassa. An interesting question that can be askedin a case like this is whether intron and exon conserva-
tion are correlated. In other words, would a high degree
of exon similarity between two orthologous genes also
mean a high degree of intron similarity? To answer this
question, average exon and intron similarities were cal-
culated separately for 57 genes that have introns (Table
2 in the supplementary material). Fig. 2 shows a graph
of exon identities and corresponding intron identities. If
there were a simple correlation between exon and intron
identity, one would expect a linear distribution. How-
ever, as indicated in Fig. 2, this is not the case. Nor
could we identify any other statistically significant cor-
relation between intron and exon identities (data not
shown). Thus, within the genes included in our study, it
seems that there is no apparent correlation between in-
tron and exon sequence identity within a given gene.Usually, exons are under strong selective pressure to
preserve their coding capacities, whereas introns simply
have to retain their splicing signal sequences; thus, this
finding might not be unexpected. However, comparing
288 M. Nowrousian et al. / Fungal Genetics and Biology 41 (2004) 285–292
intron sequence identities might be a means of identi-fying regions within a genome that accumulate muta-
tions more readily than others. This might help to
identify regions which are more susceptible to mutagenic
influences or are less efficiently repaired by DNA repair
mechanisms. Analyses like this might even be more re-
vealing in genomes with higher intron content, e.g., the
human genome.
4. S. macrospora and N. crassa share a high degree of
synteny that simplifies the identification of open reading
frames
Comparison of larger regions of S. macrospora and
N. crassa DNA revealed that sequence identity between
the two species is present even outside of open readingframes. The regions show nearly complete synteny and
can be readily aligned at the nucleotide level (Figs. 3A
4A). Within the 15 kb region shown in Fig. 3A, five
genes (pho88, rad14, pro11, trnN, and etp) can be
identified in the same orientation in both organims. A
sixth predicted N. crassa gene, NCU08740.1, has no
Fig. 3. Synteny between S. macrospora (S.m.) and N. crassa (N.c.) in a 15 k
tergenic regions as gray boxes. (A) Nucleic acid identity between the two geno
the S. macrosporaDNA. (B) The N. crassa open reading frame NCU08740.1
macrospora genome. Sequence identity between the adjoining genes pro11 a
macrospora and N. crassa DNA. For more information see text.
homolog in S. macrospora, in fact, the correspondingORF and adjoining sequences are absent from this
region of the S. macrospora genome (Fig. 3B). There
are several possibilities for this apparent absence of the
gene from the region. It might, for example, be located
elsewhere in the S. macrospora genome, or it is not
present in S. macrospora at all. The latter possibility
would indicate that it is a gene that is not necessary for
S. macrospora, but might be useful for N. crassa. Al-ternatively, NCU08740.1 might not be a true open
reading frame. It is a rather short ORF of 200 nt, 125
nt of which comprise a predicted intron sequence, and
the derived polypeptide sequence is 24 amino acids
long. Such short open reading frames are often difficult
to predict from DNA sequence information alone.
Further information about whether this is a true gene
might be gained by comparing sequences from otherclosely related species.
Another predicted Neurospora ORF which cannot be
verified in S. macrospora is NCU06784.1 (Fig. 4).
NCU06784.1 is part of a larger region of �10 kb which
is strongly homologous in both organisms. Upstream
and downstream from NCU06784.1 are the acl2 and
b genomic region. Exons are given as black, introns as white, and in-
me regions. A region which is shown in detail in (B) is indicated above
and flanking regions are not present at their corresponding site in the S.
nd trnN as well as intergenic sequences are indicated between the S.
Fig. 4. Comparison of the acl-gene containing regions from S. macrospora (S.m.) and N. crassa (N.c.). Exons are given as black, introns as white, and
intergenic regions as light gray boxes. Exons of the open reading frame NCU06784.1 for which no S. macrospora homologue can be identified are
given in dark gray. (A) A syntenic region of about 10 kb contains the acl1 and acl2 genes (Nowrousian et al., 2000). Nucleic acid identity is indicated
between the two sequences. A part which is shown in detail in (B) is indicated above the S. macrospora DNA. (B) Intergenic region between acl1 and
acl2. The upper part of (B) shows sequence identities determined separately for putative exons, the intron, and upstream and downstream regions of
N. crassa ORF NCU6784.1 to their S. macrospora counterparts. The lower part of (B) gives sequence identity between S. macrospora and N. crassa in
various parts of the intergenic regions as determined by local alignment using LALIGN (Huang and Miller, 1991). (C) Sequence alignment of N.
crassa ORF NCU06784.1 and 100 nt of upstream and downstream regions to its corresponding S. macrospora counterpart. Putative translation start
and stop codons are given in bold, intronic sequences in small case. Indels within the putative ORF that do not contain a multiple of three nu-
cleotides, and therefore would result in frame-shift mutations, are shaded in gray.
M. Nowrousian et al. / Fungal Genetics and Biology 41 (2004) 285–292 289
acl1 genes, respectively. In this case, a region with high
homology to NCU06784.1 is present in S. macrospora,
but no bona fide open reading frame can be identified.At the position of the ATG in Neurospora is a GTG in
Sordaria. GTG as a start codon has been reported
within filamentous fungi (e.g., Guti�errez et al., 1991),
but several indels that are not multiples of three inter-
rupt the S. macrospora open reading frame (Fig. 4C)
which makes it unlikely that this is a real gene in S.
macrospora. Additional hints that NCU06784.1 is not a
true ORF come from the fact that the presumptivecoding and non-coding regions in this case do not sig-
nificantly differ in their degree of homology, as is the
case for other genes compared. In fact, within the 3 kb
intergenic region between acl1 and acl2, there are several
regions of equally high or higher nucleic acid identity
than the predicted ORF NCU06784.1 (Fig. 4B). As acl1
and acl2 are divergently transcribed, the 3 kb intergenic
region most likely contains promoter sequences whichregulate the expression of both genes. The high degree of
overall conservation of this region might indicate regu-
latory mechanisms common to S. macrospora and
N. crassa instead of marking an open reading frame.
One way to shed light on questions like this would beto include sequence information from further close
relatives of S. macrospora and N. crassa into the
analysis. Especially information about the absence or
presence of indels as well as the degree of conservation
of exons within a less conserved sequence environ-
ment might help to identify the most likely open
reading frames from genomic DNA sequence informa-
tion alone.
5. Which additional fungal genomes might be sequenced
for a comparative genomics approach?
Sequencing and annotation of the N. crassa genome
has already greatly advanced our knowledge of fungal
genome organization (Galagan et al., 2003). Annotationof genome sequences from closely related species will be
much easier with the N. crassa genome present, but also
290 M. Nowrousian et al. / Fungal Genetics and Biology 41 (2004) 285–292
the N. crassa annotation itself will become much morereliable with the possibility of comparing two or more
genomes. A prerequisite for this is that the compared
sequences are similar enough to show a sufficient degree
of synteny. The comparisons presented here indicate
that the S. macrospora genome is eminently suitable for
this purpose, because it is similar enough to be readily
aligned at the nucleotide level even outside of coding
regions, but has aquired a sufficient degree of dissimi-larity especially in non-coding regions to provide an
adequate signal-to-noise enrichment for distinguishing
functional from non-functional sites. Another point of
interest might be the fact that in S. macrospora, no in-
dication of RIP (repeat-induced point mutation) has
been found yet (Le Chevanton et al., 1989). RIP has
originally been discovered in N. crassa where it inacti-
vates duplicated sequences during the sexual phase ofthe life cycle (Selker et al., 1987). It was also shown to
exist in a milder form in Podospora anserina (Graia
et al., 2001; Hamann et al., 2000) and Magnaporthe
grisea (Ikeda et al., 2002). RIP is thought to be re-
sponsible for the surprisingly low number of multigene
families and duplicated sequences observed in N. crassa
(Galagan et al., 2003). Therefore, a comparison of the
N. crassa genome with that of S. macrospora will bemost interesting with respect to the divergent evolution
of two closely related genomes one of which displays a
very active form of RIP while the other does not.
However, as has been demonstrated for several Sac-
charomyces species, comparative genomics gains power
with the number of species investigated (Cliften et al.,
2003; Kellis et al., 2003). Which additional filamentous
fungi might be suitable candidates for such an approach?The genomes of the pyrenomycetes P. anserina and M.
grisea are at present being sequenced and annotated. Both
species aremuchmore distant relatives toN. crassa than is
S. macrospora, and previous analyses have shown that
synteny between N. crassa and P. anserina or M. grisea,
respectively, is limited (Hamer et al., 2001; Silar et al.,
2003). In both P. anserina and M. grisea, intergenic re-
gions are not conserved even within syntenic regions.Thus, the genomes of P. anserina and M. grisea will cer-
tainly advance our knowledge of filamentous fungi�s bi-ology, but they are less suited for a comparative genomics
approach with N. crassa. In a white paper describing the
aims of the fungal genome initiative, one of the organisms
included in a list of fungi for initial sequencing is Neu-