-
Rogozin et al. Biology Direct 2012,
7:11http://www.biology-direct.com/content/7/1/11
REVIEW Open Access
Origin and evolution of spliceosomal intronsIgor B Rogozin1,
Liran Carmel2, Miklos Csuros3 and Eugene V Koonin1*
Abstract
Evolution of exon-intron structure of eukaryotic genes has been
a matter of long-standing, intensive debate. Theintrons-early
concept, later rebranded ‘introns first’ held that protein-coding
genes were interrupted by numerousintrons even at the earliest
stages of life's evolution and that introns played a major role in
the origin of proteins byfacilitating recombination of sequences
coding for small protein/peptide modules. The introns-late concept
heldthat introns emerged only in eukaryotes and new introns have
been accumulating continuously throughouteukaryotic evolution.
Analysis of orthologous genes from completely sequenced eukaryotic
genomes revealednumerous shared intron positions in orthologous
genes from animals and plants and even between animals, plantsand
protists, suggesting that many ancestral introns have persisted
since the last eukaryotic common ancestor(LECA). Reconstructions of
intron gain and loss using the growing collection of genomes of
diverse eukaryotes andincreasingly advanced probabilistic models
convincingly show that the LECA and the ancestors of each
eukaryoticsupergroup had intron-rich genes, with intron densities
comparable to those in the most intron-rich moderngenomes such as
those of vertebrates. The subsequent evolution in most lineages of
eukaryotes involved primarilyloss of introns, with only a few
episodes of substantial intron gain that might have accompanied
majorevolutionary innovations such as the origin of metazoa. The
original invasion of self-splicing Group II introns,presumably
originating from the mitochondrial endosymbiont, into the genome of
the emerging eukaryote mighthave been a key factor of
eukaryogenesis that in particular triggered the origin of
endomembranes and thenucleus. Conversely, splicing errors gave rise
to alternative splicing, a major contribution to the
biologicalcomplexity of multicellular eukaryotes. There is no
indication that any prokaryote has ever possessed a spliceosomeor
introns in protein-coding genes, other than relatively rare mobile
self-splicing introns. Thus, the introns-firstscenario is not
supported by any evidence but exon-intron structure of
protein-coding genes appears to haveevolved concomitantly with the
eukaryotic cell, and introns were a major factor of evolution
throughout the historyof eukaryotes. This article was reviewed by
I. King Jordan, Manuel Irimia (nominated by Anthony Poole),
TobiasMourier (nominated by Anthony Poole), and Fyodor Kondrashov.
For the complete reports, see the Reviewers’Reports section.
Keywords: Intron sliding, Intron gain, Intron loss, Spliceosome,
Splicing signals, Evolution of exon/intron structure,Alternative
splicing, Phylogenetic trees, Mobile domains, Eukaryotic
ancestor
Genes in pieces: exon-intron structure ofeukaryotic genes and
the two spliceosomesIn a memorable phrase of Walter Gilbert,
eukaryotes pos-sess “genes in pieces” in which protein-coding
sequencesare interrupted by non-coding sequences denoted
introns[1]. The introns are excised at the donor and acceptorsplice
sites such that the flanking coding regions, exons,are spliced by
an extremely complex ribonucleoprotein
* Correspondence: [email protected] Center for
Biotechnology Information NLM/NIH, 8600 RockvillePike, Bldg. 38A,
Bethesda, MD 20894, USAFull list of author information is available
at the end of the article
© 2012 Rogozin et al.; licensee BioMed CentraCommons Attribution
License (http://creativecreproduction in any medium, provided the
or
molecular machine, the spliceosome [2,3]. Multipleintrons
interrupt the coding sequences in the great major-ity of genes in
animals and plants, whereas intron dens-ities in fungi and
unicellular eukaryotes are highlyvariable: many of the unicellular
forms contain only a fewintrons in the entire genome whereas in
others the introndensity approaches that in animals and plants
[4-6]. Re-markably, however, there is no sequenced genome of
afull-fledged eukaryote without introns at all; only oneintronless
genome of a highly degraded remnant of aeukaryotic organism, a
nucleomorph that has also lost thegenes for the spliceosome
subunits, has been reported [7].
l Ltd. This is an Open Access article distributed under the
terms of the Creativeommons.org/licenses/by/2.0), which permits
unrestricted use, distribution, andiginal work is properly
cited.
mailto:[email protected]://creativecommons.org/licenses/by/2.0
-
Rogozin et al. Biology Direct 2012, 7:11 Page 2 of
28http://www.biology-direct.com/content/7/1/11
The ubiquity of introns in eukaryotes is complementedby the
conservation of the spliceosome. The spliceosomeconsists of five
snRNPs (small nuclear ribonucleoproteinparticles), together with
numerous less stably associatedproteins; the core of the
spliceosome is conserved in allwell-characterized eukaryotes
[2,3,8]. The spliceosomeinteracts with specific sites in the intron
and the flankingexons to ensure accurate and efficient splicing.
Thenucleotides at the intron termini and the adjacent nucleo-tides
in the exons are involved in these interactions andcomprise the
splicing signals. The (A/C)AG|GU(A/G)AGU sequence (the splice site
is shown by the verticalstreak and the first two nucleotides of the
intron areunderlined) at the donor splice signal is complementary
tothe 5’ end of the U1 snRNA, and this interaction appearsto be the
major requirement for splicing [9-11]. The (C,U)AG|G sequence (the
last two nucleotides of the intron areunderlined) preceded by a
polypyrimidine tract is typicalof the acceptor splice signal
(Figure 1) and is recognizedby the U5 snRNA [12,13]. A short branch
point signal islocated in the intron sequence upstream of the
acceptorsplice signals and contains the reactive adenosine that
isinvolved in the formation of the lariat-like structure in
thesplicing intermediate [12,13]. The functionally
important(A/C)AG||G exon sequences flanking introns have beendubbed
protosplice sites with the implication that newintrons insert into
sites of this structure [14,15]. Somelineage-specific deviations
from the canonical variants ofsplice signals are known to exist.
For example, someunicellular eukaryotes lack recognizable polyT
tracts be-tween the branch point signal and the 3’ splice
signal[16,17]. Some extremely intron-poor species such as
yeastpossess an unusual, strictly constrained donor splice
signal|GTA(T,A,C)G(T,A,C) with a substantial excess of T atposition
+4 [16-18].The vast majority of spliceosomal introns contain
|GT
at the donor splice site and AG| at the acceptor splicesite.
However, a distinct class of rare introns has beenrecognized on the
basis of their unusual terminal dinu-cleotides: these introns
contain |AT at the donor splicesite and AC| at the acceptor splice
site [20,21]. A closerexamination of the sequences of these
atypical introns
Figure 1 Consensus motifs for donor and acceptor splicing
signals. Thbias based on information content). The data is from
[19].
revealed several properties that distinguish them fromthe
majority of the introns including conservation of un-usual signals
at the donor splice signal (|ATATCCTT)and immediately upstream of
the acceptor splice signal(TCCTTAAC 10-15 bases from the splice
junction)[20,21]. Introns of this class are excised by a distinct,
so-called minor or U12 spliceosome, which contains severalspecific,
low-abundance snRNPs. It has been subse-quently shown that some
|GT-AG| introns are alsoremoved by the U12 spliceosome [22]. The
U12 intronsand the associated minor spliceosome are not
universallyconserved, like the major U2 spliceosome, but are
alsowidespread in eukaryotes, being represented in verte-brates,
insects, plants, and some protists [23-26].Phylogenomic
reconstructions for the small RNA and
protein subunits of the U2 and U12 spliceosomes suggestthat both
spliceosomes were already present in the lastcommon ancestor of the
extant eukaryotes (LECA, LastEukaryotic Common Ancestor) as a
result of ancient du-plication of the genes for the respective
components [24].Taking into account a potentially important role of
U12introns in regulation of gene expression [27-29], it mightbe
tempting to speculate that the ancestral introns were ofthe U12
type (for example, see discussion by the reviewer#3 below) but have
been subsequently converted to U2introns. However, comparison of
protosplice sites (exonicsequences surrounding introns) of ancient
U2 and U12introns in human and Arabidopsis revealed close
similar-ity of ancestral introns to U2 but not to U12. Thus,
theprimordial spliceosomal introns were most likely of theU2-type
[30].The two principal mechanisms of splicing signal recog-
nition are known as exon definition and intron
definition[31-34]. Evidence of these two mechanisms has comefrom
analyses of interactions between pre-mRNAs andvarious splicing
factors [32,33,35]. The exon definitionmechanism involves SR
proteins binding to exonic spli-cing enhancers (ESE) and recruiting
U1 to the down-stream donor splicing signal and the splicing factor
U2AFto the upstream acceptor splicing signal. The U2AF factorthen
recruits U2 to the branch site. Therefore, when theSR proteins bind
the ESEs, they promote formation of a
e Y axis indicates the strength of splicing signals (base
composition
-
Rogozin et al. Biology Direct 2012, 7:11 Page 3 of
28http://www.biology-direct.com/content/7/1/11
“cross-exon” recognition complex by placing the basalsplicing
machinery at the splice sites flanking the sameexon. The intron
definition mechanism requires bindingof U1 to the upstream donor
splice site and binding ofU2AF/U2 to the downstream acceptor splice
signal andbranch site, respectively, of the same intron. Therefore,
in-tron definition selects pairs of splice sites located on
bothends of the same intron, and SR proteins can also mediatethis
process [32,36]. The efficiency of splicing under theexon
definition depends on the length of exons but is notaffected by the
length of introns; conversely, under the in-tron definition, the
efficiency of splicing depends on thelength of introns, but not
that of exons [31-35,37].
Introns-early, introns-late, introns-first: thecompeting
scenarios of intron origin andevolutionEvolution of exon-intron
structure of eukaryotic genesand evolutionary properties of introns
had been longconsidered in the context of the “introns-early”
vs.“introns-late” debate [38-42]. The original,
“strong”introns-early hypothesis held that eukaryotic
genesinherited (nearly) all introns from prokaryotic ancestorsand
that the differences in gene structure among hom-ologous eukaryotic
genes were due mostly to differentialintron loss [39]. Under this
scenario, the extant prokar-yotes have lost all the primordial
introns and the spliceo-some in the process of ‘genome
streamlining’. The lateradaptations of the introns-early hypothesis
assumed anintermediate position by allowing emergence of somenew
introns, in addition to the ancient ones [40]. Theintrons-late
concept countered that introns were aeukaryotic novelty and new
introns have been emergingcontinuously throughout eukaryotic
evolution; in thisscenario, bacteria and archaea never possessed
intron orthe spliceosome [41,43,44]. These hypotheses have
beenlater merged into a synthetic concept that can be
21
100 bp
1 kbp int
ron
leng
th
FungiHolozoaAmoebozoaGreen plants &
algaeNaegleriaEmilianaAlveolates & heterokonts
Ehux
Ptet
Bbov
Tgon
ScerAaeg S
Ngru
Figure 2 Intron density and intron length in 100 eukaryotes. The
data
denoted ‘many introns early in eukaryotic evolution’[45,46] and
that we discuss in greater detail below. Inaddition, there has been
an attempt to revitalize theintrons early idea in the ‘introns
first’ scenario accordingto which exons of protein-coding genes
emerged fromthe primordial introns, i.e. non-coding regions that
arepresumed to have been interspersed between functionalRNA
sequences in the genes that existed in the RNAworld and antedated
proteins [47,48].
Intron density, size and distribution in protein-coding genes
across the eukaryote domainGenes of eukaryotes from different
groups dramaticallydiffer in intron density and size distribution,
from only afew introns in the entire genome (that is, near
zerodensity per gene or per kilobase) in many unicellularorganisms
to approximately 6 introns per kilobase (kb)of coding sequence in
mammals (Figure 2). With respectto intron content, eukaryotic
genomes are often crudelyclassified into intron-poor ones (most
unicellular forms)and intron-rich ones including animals, plants,
somefungi, and a few unicellular organisms such as Chlamy-domonas
or some diatoms (Figure 2) [42,49-52]. Al-though this division is
appealing in its simplicity andmay be convenient for the purpose of
various compara-tive analyses, examination of intron densities in
100sequenced eukaryotic genomes does not present an obvi-ous
bimodal distribution (Figure 2). Actually, it appearsthat all
intron densities between 0 and 6 introns perkilobase are observed
in some eukaryote genomes. How-ever, when intron density is plotted
against intronlength, partitioning of eukaryote genomes into
twoclasses becomes apparent. While up to a density of
ap-proximately 3 introns per kilobase, all introns are short,with
no significant correlation between the density andlength of
introns, for more intron-rich genomes, astrong positive correlation
is observed (linear correlation
543
introns per kbp coding sequence
Lbic
Sman
Hrob
Tcas
Bmor
Hsap
Cint
TguDrer
Spur
Bflo
Tadh
C16
Vcar
Vvin
bic ln y=0.558x+2.7089
is from [53].
-
Rogozin et al. Biology Direct 2012, 7:11 Page 4 of
28http://www.biology-direct.com/content/7/1/11
coefficient = 0.16, P = 0.003, Figure 2). Even amongintron-rich
organisms, vertebrates are outstanding inhaving a substantial
fraction of extremely long introns(Figure 2). This strong
correlation notwithstanding,there are exceptions to the general
trend: intron-rich ba-sidiomycete fungi (3-4 introns/kbp) have only
shortintrons whereas some insects show broad intron
lengthdistributions with multiple long introns despite
relativelylow intron density (2-3 introns/kbp) (Figure 2). We
re-turn to the dependencies between intron density, intronlength
and structure of splice signals later, in the discus-sion of the
selection pressures affecting the evolution ofeukaryote gene
architecture and the underlyingpopulation-genetic factors.As
pointed out above, despite the existence of numer-
ous, diverse intron-poor genomes, eukaryotes do notlose the
“last” intron or the spliceosome although deg-radation of the
spliceosome including loss of many com-ponents does occur, e.g. in
yeast. The only firmlyestablished exception is the tiny genome of a
nucleo-morph (an extremely degraded intracellular symbiont ofalgae)
that has lost both all the introns and the spliceo-some [7];
preliminary genomic data indicate that allintrons might have been
lost also in a microsporidium, ahighly degraded intracellular
parasite distantly related toFungi [54]. In general, it remains
unclear whether thereare any selective factors or functional
constraint under-pinning this surprising preservation of at least a
fewintrons in eukaryote genomes [55]. However, in manycases, the
few introns that are retained in highly reducedgenomes are present
in 5’-portions of genes encodingribosomal proteins [16,56]. The
introns in these genesare important for regulation of expression
and ribosomalbiogenesis, and their deletion leads to significant
fitnessreduction in yeast [57]. Thus, the extreme rarity ofcomplete
loss of introns in eukaryotes at least in part islikely to be due
to deleterious effect of the loss of spe-cific, functionally
important introns.
Evolutionary conservation of intron positions androutes of gene
architecture evolution ofeukaryotesThe realization that (nearly)
all eukaryotes possess ‘genesin pieces’ but that the intron
densities and size widelyvary, triggered intense, ongoing
discussion of possibleevolutionary scenarios behind these patterns.
Severalmechanisms of intron evolution have been suggested
in-cluding intron loss, gain, and sliding [44,58-61]. Intronloss
and gain are the major phenomena in the evolutionof eukaryotic gene
architecture. The relative contri-butions of these two processes
have been a matter ofconsiderable debate and controversy.
Systematic com-parative analyses of exon-intron structures of
ortholo-gous genes from animals, fungi and plants have shown
that approximately 25% to 30% of the intron positionsare shared
(that is, located in the exact same position inorthologous genes)
by at least two of these three lineagesof complex eukaryotes with
intron-rich genomes [45,62].The prevailing interpretation of these
fundamentalobservations is that most, if not all, introns that
occupythe same positions in orthologous genes are conserved,i.e.
were already present in the equivalent position of thecorresponding
ancestral gene. However, the alternativeview, i.e., that a
substantial fraction or even most of theshared introns have been
independently inserted in thesame position in orthologous genes in
different lines ofdescent, cannot be automatically dismissed (see
discus-sion below).The apparent conservation of many intron
positions in
distant eukaryotes notwithstanding, intron densities
ineukaryotic genomes differ widely (see above), and the lo-cation
of introns in orthologous genes does not alwayscoincide even in
closely related species [63-65]. Likelycases of intron insertion
and the more common intronloss have been described (e.g.,
[59,63,66-82], and indica-tions of high intron turnover rate in
some eukaryoticlineages have been obtained [42]. Furthermore,
parsi-mony and maximum likelihood (ML) reconstructions ofevolution
of exon-intron structure of highly conservedeukaryotic genes (see
details below) suggest that bothloss and gain of introns have been
extensive during evo-lution of eukaryotic genes [45,83-88].
Together theresults of these analyses indicate that the rates of
introngain and loss differ widely between eukaryotic lineages.Some
eukaryotes, such as yeast Saccharomyces cerevi-siae, seem to have
lost nearly all of the ancestral introns,whereas others, e.g.,
nematodes, have experienced highlydynamic evolution, with both loss
and acquisition of nu-merous introns [45,83-88]. However, intron
gain is noteasy to detect: comparative analysis of intron
positionsin orthologous genes from vertebrates revealed only afew
losses but no apparent gain of introns in mammaliangenes [89,90],
and similar results have been obtained inan analysis of evolution
of exon-intron structure of par-alogous genes in several eukaryotic
lineages [91]. Thesefindings imply that intron loss dominates at
short evolu-tionary distances, whereas bursts of intron
insertionmight accompany major evolutionary transitions. How-ever,
intron gain could be an ongoing process in nema-todes: Coghlan and
Wolfe [66] identified 81 introns inCaenorhabditis elegans and 41
introns in C. briggsae thatappear to have been recently inserted.
However, the val-idity of these recent intron gains has been
questioned asit has been shown that after including additional
gen-omes in the analysis many of the reported intron gainscould be
parsimoniously explained by intron loss [92]. Ahigh rate of recent
intron gain has been reported forparalogous gene pairs in
Arabidopsis thaliana that were
-
Rogozin et al. Biology Direct 2012, 7:11 Page 5 of
28http://www.biology-direct.com/content/7/1/11
duplicated simultaneously 20-60 million years via
tetra-ploidization [93]. A low rate of recent intron gains hasbeen
identified in plastid-derived genes in plants [94].Similarly,
spliceosomal introns have been detected insome genes horizontally
transferred from bacteria tobdelloid rotifers [95]. Probably, the
most striking knowncase of apparent recent intron gains has been
found inpopulations of Daphnia pulex endemic to Oregon wheretwo
polymorphic introns have been identified [70].These new introns do
not have an obvious source andare not represented in any studied D.
pulex populationsoutside Oregon, other species of Daphnia or any
otherorganism for which sequence data are available. Further-more,
the new introns are both found in the same genethat encodes a Rab
GTPase (rab4), and are insertedwithin one base pair from each
other. These findings putinto doubt the rarity of intron gain
considering that twointron gain events have been discovered in an
initial sur-vey of only 6 nuclear loci in 36 Daphnia individuals
[70].This result was further supported by the analysis of
24discordant intron/exon boundaries between the whole-genome
sequences of two Daphnia pulex isolates. Se-quencing of intron
presence/absence loci across a collec-tion of D. pulex isolates and
outgroup Daphnia specieshas shown that most polymorphisms result
from recentgains, with parallel gains often occurring at the same
lo-cation in independent allelic lineages [96].The great majority
of studies aimed at reconstruction
of evolution of gene architecture in eukaryotes have fo-cused on
introns in conserved portions of protein-coding regions. For
example, the conclusion that therewas no appreciable intron gain in
mammals [89] is basedon this type of data. However, evolution of
poorly con-served segments of protein-coding sequences,
untrans-lated regions of protein-coding genes, alternativelyspliced
regions, and genes originated from transposableelements appears to
be much faster and more dynamic,with numerous intron gains in
mammals [97-101]. A caseof such intron acquisition has been
reported for theRNF113B retrogene that encodes a RING finger
protein (apredicted E3 subunit of ubiquitin ligase of unknown
speci-ficity) and is present in the genomes of all primates(Figure
3) [101]. This primate-specific gene underwentrapid evolution that
included an intron gain. The presenceof the intron is supported by
several human mRNAsequences and comparisons with multiple primate
gen-omes (marmoset, macaque, orangutan, and chimpanzee).Sequence
alignment analysis shows that the intron ofRNF113B is not a de novo
insertion but rather a derivativeof an exonic sequence (a tandem
mutation AG >GT gen-erated the donor site). The new intron
contains 59 nucleo-tides from former coding sequence and 46
nucleotidesfrom the 3’ UTR. This finding was further supported
bysequencing of the human RNF113B cDNAs which
revealed two alternative RNF113B isoforms (Figure 3)[101]. In
general, due to the lack of evolutionary conserva-tion in such
genes and gene regions, reconstruction ofintron gain and loss
events in their evolution is difficultand sometimes inaccurate
(especially without experimen-tal verification). Accordingly,
evolutionary studies tend toconcentrate on highly conserved genes.
Thus, the conclu-sions on intron stasis in some groups of
eukaryotes, suchas mammals, in part appear to stem from sampling
biaseswhereas the overall intron turnover might be much
moreextensive than is currently appreciated.The same problem
pertains to non-coding RNA genes.
For example, mammalian genomes contain numerous (>10,000)
genes for long non-coding RNAs (lncRNAs) thatencompass numerous
introns [102]. In a recent detailedstudy, over 8,000 lncRNA genes
have been identified,with a mean intron density of ~1.9 per
kilobase, and ex-tensive alternative splicing of these non-coding
RNAshas been detected, with ~2.3 RNA isoforms per gene[103]. One of
the best studied lncRNAs is Xist which isinvolved in X-chromosome
inactivation in females of eu-therian mammals [104]. The Xist RNA
appears to haveevolved as a result of pseudogenization of the
Lnx3protein-coding gene in early eutherians followed by
inte-gration of mobile elements [105]. Analysis of Xist in sev-eral
mammalian species revealed an overall conservationof the Xist gene
structure (Figure 4). Four of the 10 Xistexons found in eutherians
show significant sequencesimilarity to exons of the Lnx3 gene
(Figure 4) whereasthe remaining 6 Xist exons are similar to
different trans-posable elements. Thus, some Xist introns were
inher-ited from the Lnx3 gene but some appear to have beengained in
the course of evolution of the Xist gene [105].Comparative analysis
of >3,000 mouse lncRNA genessuggested that conservation of the
exon/intron structuremight be a general lncRNA property [106]. It
was foundthat 65% and 40% of mouse lncRNA |GT-AG| splicesites are
conserved in human and rat, respectively. Thesenumbers are
significantly greater than the number ofconserved intronic GT and
AG dinucleotides that arenot involved in splicing indicating
evolutionary conser-vation of splice signals in lncRNAs [106].
Detailed re-construction of the origin and evolution of introns
inlncRNAs awaits further comparative genomic studies.The
distributions of intron positions over the length of
coding regions differ substantially between eukaryotictaxa. In
intron-poor genes of single-cell eukaryotes,introns are strongly
over-represented in the 5’-portionswhereas in intron-rich
multicellular organisms, the dis-tribution is closer to uniformity
[64,65]. A mechanisticexplanation for these patterns suggests that
introns arepreferentially lost from the 3’-portion of a gene,
conceiv-ably due to the over-representation of the
respectivesequences among the cDNAs that are produced by
-
RNF113B mRNA splice variant 1 (intron spliced)
RNF113B mRNA splice variant 2 (intron retained)
500 bp 1000 bp
Protein coding sequence Untranslated region
New intron
Figure 3 An example of a recent intron acquisition in a
retrotransposon-derived gene: structure of two splice variants of
RNF113B. Thenew intron of RNF113B is not a de novo insertion but
rather a derivative of exonic sequences (this intron contains 59
nucleotides from the formercoding sequence and 46 nucleotides from
the 3’ UTR). A partial alignment of three RNF113B sequences and
three RNF113A sequences is shownabove the spliced RNF113B isoform.
The donor splice site is marked in yellow, the predicted branch
point signal is marked in blue, and theacceptor splice site is
marked in gray. The data is from [101].
Rogozin et al. Biology Direct 2012, 7:11 Page 6 of
28http://www.biology-direct.com/content/7/1/11
reverse transcription and are thought to mediate intronloss via
homologous recombination [65,107-109]. How-ever, a complementary,
selectionist interpretation of theobserved distributions of
introns, to the effect thatintrons located in the 5’-portion of a
gene are moreoften involved in one or more intron-mediated
functions(see below), has been proposed as well [65]. Analysis
ofdistributions of intron positions over the length of thecoding
region suggested that both loss and insertion ofintrons occurred
preferentially in the 3’-regions of genes,which suggested
reverse-transcription-mediated mechan-isms for both processes
[110]. This hypothesis appears tobe compatible with the positive
association that has beenshown to exist between the rates of intron
gain and loss inindividual genes [111]. However, a more recent
probabilis-tic analysis of intron gain and loss indicates that
themechanisms of loss and gain are most likely to bedifferent, with
reverse transcription involved only in in-tron loss [112].
exon #
Lnx3
primates
rodents
dog
cow
1 2
Figure 4 The Xist gene evolved from a protein-coding gene and a
setfrom Lnx3; red boxes indicate exons originating from
transposable elemenis from [105].
Intron sliding (also called slippage or migration; here-inafter
IS) can be defined as relocation of intron/exonboundaries over
short distances (1-60 bases) in thecourse of evolution. Intron
sliding has been postulatedby advocates of the introns-early
hypothesis to explainthe surprising finding that the positions of
apparentlyorthologous introns sometimes varied among lineages[60].
However, the introns-late camp maintained that IS,if it occurs at
all, has contributed little to the diversity ofintron positions
[44,59]. The reality of IS had beendebated for a long time. A Monte
Carlo statistical ana-lysis of broadly sampled data on intron
positions hasstrongly suggested that one-base-pair IS, although a
rela-tively rare event occurring in
-
Rogozin et al. Biology Direct 2012, 7:11 Page 7 of
28http://www.biology-direct.com/content/7/1/11
the emergence and fixation of IS during evolution [114].Given
the near ubiquity of alternative splicing (AS) inmany groups of
animals and possibly plants [48], Tarrioet al. proposed that AS
could be an intermediate stagein the evolution of IS. Under this
scenario, emergence ofa new splicing signal adjacent to a
pre-existing oneresults in AS but, if and when the original
splicing signalsubsequently deteriorates, the net result is IS
[114]. Theproposed route of IS evolution via AS is likely to
becommon in poorly conserved regions of protein-codinggenes with
frequent AS events (e.g. 5’- and 3’-regions ofmany genes) but rare
in conserved portions of protein-coding genes. Comparative analysis
of closely locatedintrons among 12 Drosophila genomes has
suggestedthat IS is a relatively frequent cause of novel intron
posi-tions in Drosophila [115]. All things considered, there
iscurrently no doubt that IS is real and can yield new in-tron
positions but the actual impact of IS in the evolu-tion of
eukaryotic genes will be accurately determinedonly when multiple
sets of closely related genomes be-come available and rigorous
methods for statistical ana-lysis are developed.
Evolution of splicing signals, protosplice sites,and intron
phase distributionAs pointed out above, the densities of
spliceosomalintrons vary dramatically among eukaryotes (Figure
2),and so does the strength of splicing signals
[18,45,51,116].There is a striking correspondence between low
introndensity and high information content of donor splice sig-nals
across eukaryotic genomes [51]. Intron-poor genes(genomes) with
strong donor sites are found in severalgroups of eukaryotes (e.g.
fungi) that also include intron-rich genomes with weaker donor
sites. Evolutionaryreconstruction suggests that ancestral donor
signals hadlow information content but that many lineages have
in-dependently underwent concomitant major intron lossand donor
signal strengthening [51]. This evolutionarytrend receives a
straightforward explanation within theframework of the
population-genetic concept of evolutionof gene architecture (see
below).However, the acceptor splice signal shows a different
trend: it is weak in most fungi, intermediate in plantsand some
unicellular eukaryotes, and strongest inmetazoans where it
gradually strengthens from nema-todes to mammals [116]. This
observation can be inter-preted in the light of the results of a
large-scale analysis ofsplicing signals in 61 eukaryotic species
which revealed asignificant negative correlation between the
strength ofthe branch point signal and the strength of the
acceptorsplice site (Figure 5; R = -0.52, P = 0.000025)
[117].Although the correlation between the strength of thedonor
splice signal and the combined strength of thebranch point signal
and the acceptor splice signal was not
significant (R = 0.19, P = 0.15), the positive sign of this
cor-relation still could reflect congruent evolution of
splicingsignals. In general, a complex interplay exists between
in-tron density, intron size, the strength of splice signals andthe
strength of splicing enhancers/silencers. For example,splice
signals in long and short introns in Drosophila showonly minor
differences [118]. Several weak but statisticallysignificant
correlations have been observed between verte-brate intron length,
splice site strength, and potentialexonic splicing signals that
attest to a compensatory rela-tionship between splice sites and
exonic splicing signals,depending on vertebrate intron length
[119].It has been proposed that the functionally important
(A/C)AG||G exon sequences flanking introns are relicsof
recognition signals for the insertion of introns thathave been
dubbed protosplice sites [14,15]. Protosplicesites became an
important staple of the introns-late hy-pothesis of intron
evolution because, if intron insertionwas limited to strictly
defined protosplice sites, parallelgain of introns would be likely
and could account forthe large number of shared introns among
orthologsfrom distant eukaryotic lineages [41,63,83]. Support
forthe protosplice site hypothesis has been harnessed
fromexperiments demonstrating that elimination of the regu-lar
splice sites in actin genes resulted in activation ofcryptic splice
sites, most of which coincided with exonjunctions in orthologous
genes from other species [120].Nevertheless, it remained unclear
whether the consensusnucleotides flanking the splice junctions were
remnantsof the original protosplice sites or evolved
convergentlyafter intron insertion. The existence of protosplice
siteswas directly addressed by examining the context ofintrons
inserted within codons which encode aminoacids conserved in all
eukaryotes and, accordingly, arenot subject to selection for
splicing efficiency. Accordingto the parsimony principle, these
codons (e.g., GGN forconserved glycines or CCN for conserved
prolines) canbe inferred to have been present already in the
commonancestor of all extant eukaryotes, so the ancient
proto-splice sites (if such existed) should have survived andcould
be examined directly. This analysis has shown thatintrons, indeed,
predominantly insert into and/or arepreferentially fixed in
specific (protosplice) sites with theconsensus sequence (A/C)AG||Gt
[121].Recently, correlation between positions of cryptic spli-
cing signals (sequences that are similar to splicing sig-nals
but normally do not function in splicing) andintrons has been
found: cryptic splicing signals withinexons of one species
frequently match the exact positionof introns in orthologous genes
from another species.This observation suggests that in the course
of evolutionmany introns were inserted into cryptic splicing
signalsthat had been in place prior to intron insertion
[122].However, this conclusion is contradicted by another
-
4.00
5.00
6.00
7.00
8.00
9.00
10.00
4.00 5.00 6.00 7.00 8.00
Branch point signal
Acc
epto
r si
gn
al
Figure 5 Correlation between the strength of the branch point
signal and the strength of the acceptor splice site. The linear
correlationcoefficient is R = -0.52 (P = 0.000025) after exclusion
of the obvious outlier Aureococcus anophagefferens [117]. The
information content of splicingsignals in 61 eukaryotic species is
from [117]. Species names: B. taurus, C. familiaris, E. caballus,
H. sapiens, M. domestica, M. musculus, O. anatinus, R.norvegicus,
S. scrofa, B. florida, C. intestinalis, C. savignyi, D. rerio, G.
gallus, O. latipes, P. marinus, T. guttata, X. tropicalis, A.
gambiae, A. mellifera, C.elegans, D. pulex, D. melanogaster, H.
magnipapillata, L. gigantea, M. brevicollis, N. vectensis, S.
purpuratus, T. castaneum, B. dendrobatidis, C.heterostrophus, C.
neoformans, M. grisea, N. haematococca, P. chrysosporium, P.
blakesleeanus, P. infestans, P. placenta, S. cerevisiae, S.
commune, T.virens, A. anophagefferens, D. discoideum, D. purpureum,
N. gruberi, O. lucimarinus, P. tricornutum, T. pseudonana, T.
adhaerens, A. thaliana, ChlorellaNC64A, C. reinhardtii, M. pusilla,
Micromonas RCC299, O. sativa, P. patens, P. trichocarpa, S.
moellendorffii, S. bicolor, V. vinifera, V. carteri.
Rogozin et al. Biology Direct 2012, 7:11 Page 8 of
28http://www.biology-direct.com/content/7/1/11
observation, that recently gained introns in animal genesof the
alpha-amylase were not associated with specificsequence motifs
(protosplice sites) [123]. In the samegene family, old introns were
embedded within strongprotosplice motifs which were found to be
much weakerin homologous genes lacking the intron in the
givenposition [123]. These findings are consistent with the
hy-pothesis that sites of de novo intron insertion are effect-ively
random and that selection drives the emergence ofprotosplice-like
sequences following intron insertion.The presence of much stronger
protosplice sites aroundold introns compared to young introns [123]
seems tosuggest that evolution of protosplice sites subsequent
tointron insertion is a slow process [123,124].The hypothesis that
selection acts on the exonic parts
of splice signals was supported by comparison of the nu-cleotide
sequences around the splice junctions that flankold (shared by two
or more major lineages of eukar-yotes) compared with new
(lineage-specific) introns ineukaryotic genes. The distributions of
information con-tent between the intronic and exonic parts of the
splicessignals have been found to be substantially different inold
and new introns [125]. Old introns have lower infor-mation content
in the exonic regions adjacent to thesplice sites than new introns
but, conversely, have higherinformation content in the intron
itself. These findings
imply that introns insert into protosplice sites but duringthe
evolution of an intron after insertion, the splice sig-nal shifts
from the flanking exonic regions to the ends ofthe intron itself.
Accumulation of information inside theintron during evolution is
best compatible with the viewthat new introns, largely, emerge de
novo and not viapropagation and migration of pre-existing introns
[125].The contradictory findings on the protosplice site in-
volvement versus the evolution of these motifs after in-tron
gain (in which case ‘protosplice site’ becomes amisnomer) might
reflect objectively existing differencesin the evolution of the
gene architectures among genefamilies, in particular between highly
conserved andmore dynamic families. The definitive assessment of
thevalidity of the protosplice site hypothesis requires fur-ther,
comprehensive comparative genomic studies.Introns occur in three
phases (0, 1, and 2) which are
defined as the position of the intron within or betweencodons:
introns of phase 0, 1, and 2 are located, respect-ively, between
two codons, after the first position in acodon, and after the
second position. In (nearly) all ana-lyzed genomes, there is a
significant excess of phase 0introns over those in the other two
phases [125-130].The only known remarkable exception is the
rapidlyevolving tunicate Oikopleura that shows a uniform
dis-tribution of introns among the three phases [131].
-
Rogozin et al. Biology Direct 2012, 7:11 Page 9 of
28http://www.biology-direct.com/content/7/1/11
An excess of protosplice sites in phase 0 was notice-able in
some species (Figure 6) [132], however theprotosplice site model
cannot fully explain the observedover-representation of phase 0
introns under the as-sumption that introns randomly insert into
protosplicesites (Figure 6) [125,127,128]. Furthermore, it has
beenshown that phase 0 introns were, on average, located inmore
highly conserved portions of genes than phase 1and 2 introns [45].
This observation suggests that phase1 and phase 2 introns
experience a greater deleterious-mutation-driven loss and could
reconcile the observedphase distribution with the protosplice site
hypothesis[125,127,128,130].
Conservation versus parallel gains of intronsAs discussed above,
comparative analyses revealed nu-merous introns that occupy the
same position in ortho-logous genes from distant species [45,62].
In particular,orthologous genes from humans and the green plant
A.thaliana share ~25% intron positions [45]. The straight-forward
interpretation of these observations is that theshared introns were
inherited from the common ances-tor of the respective species
whereas lineage-specificintrons were inserted into genes at later
stages of evolu-tion [45,62]. Under this premise, parsimonious
recon-structions indicate that even early eukaryotes alreadyhad a
relatively high intron density, perhaps, comparable(at least within
an order of magnitude) to that in mod-ern plant and animal genes.
However, the inference thatshared intron positions reflect
evolutionary conservationis challenged by the potential
non-randomness of introninsertion: introns are inserted or fixed
mostly in distinctprotosplice sites as discussed in the preceding
section.In principle, if there were few protosplice sites in
eachgene, the presence of introns in the same positions of
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Phase 0 Phase 1 P
Figure 6 Fractions of protosplice sites and actual introns in
the three(Hs) human Homo sapiens. An excess of protosplice sites in
phase 0 is noticintrons are randomly inserted into protosplice
sites, is unable to fully explafrom [125,132].
orthologous genes in distantly related species could
becompletely or at least to a large extent explained by par-allel
gains. At least two cases of apparent parallel gain ofintrons in
orthologous genes from plants and animalshave been reported
[133,134]. Moreover, probabilisticmodeling of intron evolution
discussed above suggestedthat many, if not most, introns shared by
phylogenetic-ally distant species were likely to originate by
parallelgain of introns in protosplice sites [83]. The
implicationis that intron distribution in extant organisms is
largelydetermined by relatively recent insertions and cannot beused
to infer exon-intron structure of ancestral genes.However, the
dataset of 10 gene superfamilies by Qiuand co-workers [83]
contained numerous ancient dupli-cations combined with frequent
lineage-specific losses ofgenes, because of which analysis of
intron conservationand intron gains is likely to be confounded by
problemsof phylogenetic reconstructions.The extent of independent
insertion of introns in the
same sites (parallel gain) in orthologous genes
fromphylogenetically distant eukaryotes was assessed withinthe
framework of the protosplice site model [132]. It wasshown that
protosplice sites are no more conserved dur-ing evolution of
eukaryotic gene sequences than randomsites. Simulation of intron
insertion into protosplice siteswith the observed protosplice site
frequencies and introndensities has shown that parallel gain could
account foronly 5 to 10% of shared intron positions in
distantlyrelated species [132]. The results of this simulation
sug-gest that the presence of numerous introns in the samepositions
in orthologous genes from diverse eukaryotes,such as animals,
fungi, and plants, reflects mostly bonafide evolutionary
conservation [132].Analysis of intron gain and loss rates over
branches of
the phylogenetic tree for 19 eukaryotic species allowed
hase 2
At_introns
At_prososplice sites
Hs_introns
Hs_protosplice sites
phases. Species abbreviations: (At) green plant Arabidopsis
thaliana,eable, however the ‘protosplice site’ hypothesis, which
posits thatin the observed over-representation of phase 0 introns.
The data is
-
Rogozin et al. Biology Direct 2012, 7:11 Page 10 of
28http://www.biology-direct.com/content/7/1/11
for the estimation of the probability of parallel gain foreach
intron position that is shared by more than onespecies [111]. The
resulting estimates indicated that par-allel gain, on average,
accounts for only ~8% of theshared intron positions, in agreement
with the simula-tion results discussed above [111]. However, the
distri-bution of parallel gains over the phylogenetic tree
ofeukaryotes is highly non-uniform. There are almost noparallel
gains in closely related lineages, whereas fordistant lineages,
such as animals and plants, parallelgains could contribute up to
20% of the shared intronpositions. Taken together, the results of
these analysesindicate that, although parallel gain of introns is
non-negligible, the substantial majority of introns that occupythe
same positions in orthologous genes from distantlyrelated
eukaryotes are ancestral including many inher-ited from LECA
[111].
Reconstruction of evolution of exon-intronstructure of eukaryote
genesThe patterns of conservation and variation of intronpositions
in orthologous and paralogous genes can beemployed to reconstruct
evolutionary scenarios for theexon-intron structure of eukaryotic
genes using evolu-tionary parsimony or maximum likelihood
approaches.Once multiple eukaryotic genomes have been
sequenced,such genome-wide evolutionary reconstruction has be-come
realistic. The comparative data on intron positionscan be naturally
represented as a matrix of intron pres-ence/absence (usually
encoded as 1/0), and to thesematrices, various reconstruction
methods can be applied.The first of such studies employed
orthologous gene setsfrom 8 eukaryotic species and took the most
straightfor-ward approach to evolutionary reconstruction by
apply-ing the parsimony principle in the specific form of
Dolloparsimony [45]. Given a species tree topology and a pat-tern
of intron presence/absence, the Dollo algorithmconstructs the most
parsimonious (simplest) scenario forthe evolution of gene
structure, i.e. the distribution ofinferred intron gain and loss
events over the treebranches. The main underlying assumption is
that acharacter (intron in a given position) once lost cannotbe
regained whereas as many parallel intron losses indifferent
branches of the tree are allowed as needed toaccount for the
observed pattern of states. By analyzingmore than 7,000 intron
positions in highly conservedgenes of eukaryotes, the Dollo
parsimony approachproduced an evolutionary scenario under which
thecommon ancestor of modern eukaryotes possessedintron-rich genes,
with intron density only a few foldlower than that in most
intron-rich modern forms(vertebrates and plants). Massive intron
losses were in-ferred for several groups, especially yeasts,
nematodes
and insects, whereas in vertebrates and plants introngain was
inferred to be the main evolutionary trend [45].The parsimony
approach is obviously oversimplified
given that all lineage-specific introns are automaticallytreated
as newly gained, with the possibility that some ofthese introns
could be ancestral, having been lost inmultiple lines of descent.
Furthermore, parsimony doesnot provide confidence estimates for the
estimates of an-cestral states. These limitations of parsimony
potentiallycould grossly distort the results of evolutionary
recon-struction, especially if the number of analyzed species
issmall. Probabilistic approaches such as maximum likeli-hood
models can address these problems, at least inprinciple. However,
the first two statistical studies intointron evolution produced
opposite results: Qiu et al.concluded that ancient introns (if they
ever existed) havenot survived in extant genes [83] whereas Roy
andGilbert came to the conclusion that the great majority
ofintrons, even lineage-specific ones, were ancient [84].The first
conclusion implies that intron gain was domin-ant over intron loss
in the evolution of eukaryotic genes,whereas the second one
suggests that intron loss is theprincipal evolutionary process.
This major discrepancybetween the results of the two studies has
indicated thatoptimal parameters for maximum likelihood modelingof
intron evolution remained to be determined [135].The next
generation of increasingly sophisticated ML
reconstructions of intron gain and loss during
eukaryoticevolution suggested that the protein-coding genes of
an-cient eukaryotic ancestors, including the Last EukaryoticCommon
Ancestor (LECA), already possessed introndensity comparable to that
found in modern, moderatelyintron-rich genomes [85-88,136].
Accordingly, the his-tory of eukaryotic genes, with respect to the
dynamics ofintrons, appears to be to a large extent dominated
bylosses, perhaps punctuated by a few episodes of majorgain
[87,88,91,137]. This conclusion is based on analysesxof
considerably larger data sets than those used in earl-ier studies;
for example, Carmel and co-workers [87]analyzed 391 sets of
orthologous genes from 19eukaryotic species. This extended data set
not onlyallowed for a more definitive reconstruction of the
genestructure evolution, but also permitted zooming in onspecific
portions of the eukaryotic tree [87]. A compre-hensive
probabilistic model of intron evolution wasdeveloped that
incorporated heterogeneity of intron gainand intron loss rates with
respect to both lineages andgenes as well as variability among
sites within a gene[87]. It was demonstrated that ancestral
eukaryoticforms were intron-rich and that evolution of
eukaryoticgenes involved numerous gains and losses of introns,with
losses being somewhat more common. Three dis-tinct modalities of
intron gain and loss during eukaryoticevolution were identified.
The ‘balanced’ mode appears
-
Rogozin et al. Biology Direct 2012, 7:11 Page 11 of
28http://www.biology-direct.com/content/7/1/11
to operate in all eukaryotic lineages, and is characterizedby
proportional intron gain and loss rates [87]. In additionto this,
apparently universal process, many lineages exhibitelevated loss
rate, whereas a few others exhibit elevatedgain rate. Moreover, the
rates of intron gain and loss arehighly non-uniform over the time
course of the evolutionof eukaryotes: both rates seem to have been
decreasingwith time over the last several hundred million years.
Thedecrease in gains was faster than the decrease in
losses,resulting in many lineages with very limited intron gainover
the last several hundred million years [87].Figure 7 illustrates
the latest reconstruction of intron
gain and loss for 99 eukaryotic species that was performedusing
a Markov Chain Monte Carlo (MCMC) approach[53]. In this, so far the
most extensive study, the Malinsoftware package [138] was used to
infer ancestral statesfrom a matrix of shared introns which
comprised 8403 in-tron presence-absence profiles from 245 sets of
ortholo-gous genes. The MCMC method infers ancestral introndensity
by using a probabilistic intron gain-loss model,taking into account
rate heterogeneity across lineages andacross sites within genes.
This reconstruction provides athorough view of the evolution of
gene structure in threeeukaryotic supergroups and reveals several
general trends(Figure 7) [53]. Most lineages show net intron loss
thatcan be substantial as in alveolates, some lineages of
fungi,green algae or insects, or offset by concomitant introngains
as in land plants, most animal lineages, and somefungi. Massive
intron gains were inferred only for severaldeep branches, most
conspicuously the stem of theMetazoa, and to a lesser extent, the
stems of Mamiellales(a branch of green algae), Viridiplantae,
Opisthokonta,and Metazoa together with Choanoflagellata (Figure
7).These findings vindicate, on a much larger data set andwith
greater confidence, the previous conclusions thatcompared to the
common and substantial intron loss, ex-tensive intron gain was rare
during the evolution of eukar-yotes. Episodes of substantial intron
gain seem to coincidewith the emergence of major new groups of
organismswith novel biological characteristics such as the
Metazoa(Figure 7) [53]. Several previous studies, performed onmuch
smaller data sets and with less robust reconstruc-tion methods,
have suggested that at least some eukaryoticancestral forms could
have possessed intron-rich genes[84,85,136], and observations on
gene structures in pri-mitive animals such as the sea anemone
Nematostella[139] and the flatworm Platynereis [140] were
compatiblewith these inferences. A particularly striking
conclusionhas been reached in the reconstruction of the evolutionof
gene architecture in Chromalveolata: although all se-quenced
genomes in this supergroup of eukaryotes are in-tron-poor,
intron-rich last common ancestors have beeninferred for
Chromalveolata and particularly Alveolata[141]. Clearly, the
reconstruction led to this conclusion
only because, although very few intron positions are con-served
among the intron–poor orthologous genes of dif-ferent
chromalveolates, many of these genes share a largefraction of
intron positions with intron-rich orthologsfrom plants and/or
animals.The latest MCMC reconstruction reinforced these
conclusions by inferring high intron densities for theancestors
of each major group of eukaryotes within eachof the three
supergroups (Figure 7) [53]. The implicationis that, whenever an
extant eukaryotic genome shows alow intron density, this
intron-poor state is a result ofextensive, lineage-specific intron
loss. Remarkably, somany intron positions are shared between
eukaryotesthat, with the large and apparently representative set
ofcompared genomes, Dollo parsimony reconstructioninfers similarly
intron-rich ancestral genomes as theMCMC and maximum likelihood
methods [53]. Theresults of this reconstruction indicate in
particular thatthe entire line of descent from LECA to mammals was
acontinuous intron-rich state (Figure 7) that would pro-vide for
uninterrupted evolution of the growing reper-toire of functional
alternative spliced forms (see below).The unprecedented intron gain
at the onset of animalevolution (Figure 7) could further contribute
to theexpansion of alternative forms. This spurt of intron
gainconceivably resulted from a combination of a
populationbottleneck that led to weak purifying selection
withincreased transposon activity (see below).
Evolution of exon-intron structure in paralogousgene familiesThe
reconstructions of the evolution of gene architec-ture in
eukaryotes were performed on sets of ortholo-gous genes with a
single representative (or a single mostconserved representative) in
each of the analyzed gen-omes. Obviously, this type of
reconstruction reflects onlyone facet of evolution of gene
structure given that alleukaryotic genomes encompass numerous
families ofparalogous genes with broad distributions of the
numberof members. Reconstruction of parsimonious scenariosof gene
structure evolution in paralogous gene familiesin animals, plants
and malaria parasites revealed numer-ous apparent gains and losses
of introns [91,142]. In allanalyzed lineages, the number of
acquired new intronswas substantially greater than the number of
lost ances-tral introns. This trend held even for lineages in
whichvertical evolution of genes involved many more intronlosses
than gains, suggesting that gene duplicationboosts intron
insertion. However, dating gene duplica-tions and the associated
intron gains and losses basedon the molecular clock assumption
showed that veryfew, if any, introns were gained during the last
approxi-mately 100 million years of animal and plant evolution,in
agreement with previous conclusions reached through
-
Ehux
Pram
Pcap
Psoj
Aano
Tpse
Ftri
Fcyl
Ptet
Tthe
Tgon
Pfal
Pviv
Pyoe
Bbov
Tann
Tpar
Spun
Bden
Sjap
Spom
Scer
Bfuc
Sscl
GzeaM
griNcra
Anid
Cimm
Uree
Mfij
Mgra
PnodC
hetPrep
Mlar
PgraSros
Um
ayC
neoC
cinA
bisL
bicFchr
Am
acR
oryM
cir
Pbla
Cow
c Mbr
e
Prsp
Sman
Lgi
gC
tel
Hro
bB
mal
Cbr
iC
ele
Isca
Dpu
lA
pis
Phum
Tca
sA
mel
Nvi
tB
mor
Aae
gA
gam
Dm
elD
moj
Spur
Bflo
Cint
Drer
Hsap
Ggal
Tgut
Nvec
Hmag
Tadh
Ddis
Dpur
Ehis
Edis
C169
C64A
Crei
Vcar
Otau
Oluc
O809
M299
Mpus
Ppat
Smoe
Atha
Vvin
Osat
Sbic3.2
1.2
1.1
1.2
0.4
1.0
0.5
0.8
2.0
2.5
4.3
1.2
1.2
1.1
1.4
2.4
2.4
3.9
3.4
0.9
1.0
0.1
1.9
1.9
1.91.8
1.92.2
2.42.3
1.41.3
2.02.1
2.14.8
4.74.8
0.44.8
4.74.9
5.1
4.6
1.6
2.6
2.4
3.3
2.6 5.3
6.4
4.2
6.6
6.1
7.3
5.9
3.2
3.5
5.3
4.7
4.2
3.6
2.5
3.6
3.5
4.1
1.8
1.6
1.7
1.7
6.5
7.1
4.5
6.7
6.9
6.9
7.0
7.7
6.0
7.8
1.3
1.5
0.1
0.2
6.4
7.2
6.3
6.0
0.6
0.6
0.7
1.0
1.2
5.5
6.0
6.0
6.1
6.1
6.1
LECA 4.3
Bilateria 7.7
Opisthokonts 5.1
Metazoa 8.8
Apicomplexa 4.2
Embryophyta 6.4
Dikarya 4.2
Het
erok
onts
Alv
eola
tes
Chr
omal
veol
ata
Fungi
Asco
myce
tes
Basidiomy
cetes Filozoa
Spiralia Nematodes
Insects
Vertebrates
Am
oebozoa
Archaeplastida
Green algae
Land plants
Figure 7 Reconstruction of intron gains and losses in the
evolution of eukaryotes and intron density in ancestral eukaryote
forms. Thedata is from [53]. Branch widths are proportional to
intron density which is shown next to terminal taxa and some deep
ancestors, in units of theintrons count per 1 kbp coding sequence.
Human (Hsap) is marked by a blue dot. Horizontal bars show
ancestral (top) and current (bottom)intron content; gain and loss
(in the lineage from the respective ancestor) are shown by red and
green, respectively. The bars are aligned so thatthe pale yellow
part shows the retained introns from the ancestor. Species names
and abbreviations: Aureococcus anophagefferens (Aano), Aedesaegypti
(Aaeg), Agaricusbisporus (Abis), Anopheles gambiae (Agam),
Allomyces macrogynus (Amac), Apis mellifera (Amel), Aspergillus
nidulans (Anid),Acyrthosiphon pisum (Apis), Arabidopsis thaliana
(Atha), Babesia bovis (Bbov), Batrachochytrium dendrobatidis
(Bden), Branchiostoma floridae (Bflo),Botryotinia fuckeliana
(Bfuc), Brugia malayi (Bmal), Bombyx mori (Bmor), Coccomyxa sp.
C-169 (C169), Chlorella sp. NC64a (C64a), Caenorhabditisbriggsae
(Cbri), Caenorhabditis elegans (Cele), Coprinopsis cinerea okayama
(Ccin), Cochliobolus heterostrophus C5 (Chet), Coccidioides
immitis(Cimm), Ciona intestinalis (Cint), Cryptococcus neoformans
var. neoformans (Cneo), Chlamydomonas reinhardtii (Crei), Capitella
teleta (Ctel),Capsaspora owczarzaki (Cowc), Dictyostelium
discoideum (Ddis), Dictyostelium purpureum (Dpur), Drosophila
melanogaster (Dmel), Drosophilamojavenis (Dmoj), Daphnia pulex
(Dpul), Danio rerio (Drer), Entamoeba dispar (Edis), Entamoeba
histolytica (Ehis), Emiliania huxleyi (Ehux),Fragilariopsis
cylindrus (Fcyl), Phanerochaete chrysosporium (Fchr), Phaeodactylum
tricornutum (Ftri), Gallus gallus (Ggal), Gibberella zeae (Gzea),
Hydramagnipapillata (Hmag), Helobdella robusta (Hrob), Homo sapiens
(Hsap), Ixodes scapularis (Isca), Laccaria bicolor (Lbic), Lottia
gigantea (Lgig),Micromonas sp. RCC299 (M299), Monosiga brevicollis
(Mbre), Mucor circinelloides (Mcir), Mycosphaerella fijiensis
(Mfij), Mycosphaerella graminicola(Mgra), Magnaporthe grisea
(Mgri), Melampsora laricis-populina (Mlar), Micromonas pusilla
(Mpus), Neurospora crassa (Ncra), Nematostella vectensis(Nvec),
Nasonia vitripennis (Nvit), Ostreococcus sp. RCC809 (O809),
Ostreococcus lucimarinus (Oluc), Oryza sativa japonica (Osat),
Ostreococcus taurii(Otau), Phytophthora capsici (Pcap), Plasmodium
falciparum (Pfal), Puccinia graminis (Pgra), Pediculus humanus
(Phum), Phaeosphaeria nodorum(Pnod), Physcomitrella patens subsp.
patens (Ppat), Phytophthora ramorum (Pram), Pyrenophora
tritici-repentis (Prep), Proterospongia sp. (Prsp),Phytophthora
sojae (Psoj), Paramecium tetraurelia (Ptet), Plasmodium vivax
(Pviv), Plasmodium yoelii yoelii (Pyoe), Rhizopus oryzae (Rory),
Sorghumbicolor (Sbic), Saccharomyces cerevisiae (Scer),
Schizosaccharomyces japonicus (Sjap), Schistosoma mansoni (Sman),
Selaginella moellendorffii (Smoe),Schizosaccharomyces pombe (Spom),
Spizellomyces punctatus (Spun), Strongylocentrotus purpuratus
(Spur), Sporobolomyces roseus (Sros), Sclerotiniasclerotiorum
(Sscl), Trichoplax adhaerens (Tadh), Theileria annulata (Tann),
Tribolium castaneum (Tcas), Toxoplasma gondii (Tgon), Taenopygia
guttata(Tgut), Theileria parvum (Tpar), Thalassiosira pseudonana
(Tpse), Tetrahymena thermophila (Tthe), Ustilago maydis (Umay),
Uncinocarpus reesii (Uree),Volvox carteri (Vcar), Vitis vinifera
(Vvin).
Rogozin et al. Biology Direct 2012, 7:11 Page 12 of
28http://www.biology-direct.com/content/7/1/11
analysis of orthologous gene sets. These results are gen-erally
compatible with the emerging notion of intensiveinsertion and loss
of introns during transitional epochsin contrast to the relative
quiet (stasis) of the interveningevolutionary spans [91,137,143].
The prevalence of in-tron gain over intron loss in evolving
families of paralogsremains a somewhat controversial issue. It has
been sug-gested that the Dollo parsimony approach used byBabenko
and co-workers [91] could significantly under-estimate the rate of
intron losses [144]. However, even
should that be the case, the independently estimatednumber of
intron gains in the same data set that wasused in the original work
of Babenko and coworkers[91] still exceeded the number of intron
losses [144].Furthermore, numerous anecdotal observations
(e.g.,[93,145-147]) have suggested that at least some parala-gous
gene families have gained more introns than theyhave lost.In
contrast, comparison of the exon–intron structures
of ancient eukaryotic paralogs reveals the absence of
-
Rogozin et al. Biology Direct 2012, 7:11 Page 13 of
28http://www.biology-direct.com/content/7/1/11
conserved intron positions in these genes (Figure 8)[148]. This
is in contrast to the conservation of intronpositions in
orthologous genes from even the most evo-lutionarily distant
eukaryotes and in more recent para-logs (Figure 8) [91]. The lack
of conserved intronpositions in ancient eukaryotic paralogs
probably reflectsthe origin of these genes during the earliest
phase ofeukaryotic evolution that was characterized by concomi-tant
invasion of genes by group II self-splicing elements(which were to
become spliceosomal introns subse-quently; see below) (Figure 9)
and extensive duplicationof genes [148,149]. Similar estimates were
obtained forparallel intron gains in ‘pseudo-paralogous’ genes
encod-ing cytosolic and mitochondrial ribosomal proteins thatby
definition have acquired their intron independently:approximately
2.3% of the intron positions were foundin homologous positions
[150]. The lack of conservedintrons in ancient eukaryotic paralogs
[148,150] is con-sistent with the results of an earlier analysis of
introndistribution in 20 most ancient (duplicated before the
di-vergence of bacteria and archaea) paralogous familieswhich
appear to have accumulated introns independ-ently [151]. Along with
other lines of evidence, theseobservations do not seem to be
compatible with theintrons early hypothesis.
Evolution of exon-intron structure in connectionwith other
features of eukaryote genesThe combined advances of comparative
genomics andsystems biology provide means to characterize genes
bymany features, for example expression level and connect-ivity in
protein-protein interaction or regulatory networks.Various
significant correlations have been demonstratedto exist between
these variables; in particular, one of themost prominent, recurrent
links is that the sequence of
4000 2000 2000
3631 total
2282 shared 126 shared
Recent /Ancient
Figure 8 Conservation of intron positions in ancient and recent
eukarmultiple alignments of paralogous sequences from 6 species (H.
sapiens, C.position was considered to be conserved if it was shared
by any pair of pa
highly expressed genes tends, on average, to be more con-served
[152-154]. Connections between various features ofintrons and other
characteristics of genes also haveemerged. Here, we discuss the
links between two key fea-tures of introns, the rates of gain and
loss and intronlength, and other aspects of gene evolution,
expressionand function.Probabilistic evolutionary reconstruction of
gene
structure yields gene-specific rates of intron gain andloss and
thus provides for analysis of possible relation-ships between these
rates and other characteristics ofthe respective genes [87]. It has
been shown that introngain rate was negatively and significantly
correlated withthe sequence evolutionary rate; conversely, intron
lossrate was positively and significantly correlated with therate
of sequence evolution. Thus, perhaps somewhatcounter-intuitively,
highly conserved genes appear to ac-cumulate introns in the course
of evolution, even ifslowly. Also significant, although of a lesser
magnitude,was the positive correlation between gene expressionlevel
and intron gain rate and the converse negative ofexpression with
intron loss rate. This finding suggeststhat introns might
contribute to optimal gene expression[155] although this effect is
confounded by the strongerconnection between expression and
evolution rate.Although expression may be enhanced by the mere
presence of multiple introns in a gene, highly expressedgene in
human and Drosophila have, on average, shorterintrons than genes
expressed at a lower level [156]. Thisfinding has been subsequently
validated and expanded byseveral independent research groups on
other modeleukaryotes, for exon lengths as well, and for a variety
ofmethods used to measure expression level [157-165]. Twocompeting
(although not necessarily mutually exclusive)hypotheses have been
proposed to explain the apparent
40000 8000
7584 total
yotic paralogs. Conservation of introns was assessed by analysis
ofelegans, D. melanogaster, S. pombe, S. cerevisiae, A. thaliana).
An intronralogs [148].
-
-proteobacterialancestor of mitochondria
Archaea-likeancestor ofeukaryotes
Group II introns/retroelements
Group II introns/ retroelements –restricted spread
Emergingproto-eukaryoticcell
α-proteobacterialendosymbiont
Massiveinvasion ofGroup IIintrons into host genome
Origin ofspliceosomal introns/genes in pieces
α
Figure 9 A hypothetical scenario of early history of
spliceosomal introns. The scheme shows the inferred sequence of
events from putativeancestors of eukaryotes to the origin of
spliceosomal introns from group II introns invading the host genome
upon mitochondrial endosymbiosis[46].
Rogozin et al. Biology Direct 2012, 7:11 Page 14 of
28http://www.biology-direct.com/content/7/1/11
link between gene compactness and expression. The selec-tion
hypothesis holds that evolution of highly expressedgenes is driven
by selection for minimization of the timeof transcription and/or
energy expenditure resulting inshrinking of these genes, especially
introns [156]. Thealternative view, known as the genomic design
hypothesis,holds that genes that are expressed under tight tissue-
anddevelopmental stage-dependent control require complexregulation
and therefore need long introns to accommo-date additional
regulatory elements. Under the genomicdesign view, due to the
positive association between thebreadth and rate of gene
expression, genes that are consti-tutively expressed at a high
level and do not require com-plex regulation possess shorter
introns [160].Surprisingly, however, the opposite trend has
been
reported to exist in plants, with highly expressed
genescontaining longer introns [166]. This discrepancy wasresolved
by examining the relationship between genelength and expression
level at a finer resolution: the rela-tionship between intron
length (as well as other mea-sures of gene compactness such as the
length of exonsor entire genes) and expression level is universal
acrossall eukaryotes (for which sufficient amount of data
onexpression was available) but is non-monotonic [167].Genes that
are highly expressed indeed tend to haveshorter introns but genes
expressed at low to mediumlevels show a positive correlation
between intron lengthand expression; hence a roughly bell-shaped
dependencybetween expression level and intron length (Figure
10)[167]. The phenomena that underlie this non-monotonicdependency
are not quite clear but might involve com-petition between two
opposing trends. Selective pressureto maximize expression rate and
minimize energy ex-penditure could be dominant in highly expressed
genesas originally suggested [156]. In contrast, requirementfor
more complex regulation in moderately expressedgenes that gain
additional functions with increasedexpression might result in the
positive correlation be-tween intron length and expression
[167].
A population-genetic perspective on evolution ofintrons and
eukaryotic gene architectureThe question famously posed by Walter
Gilbert in theseminal note on the origin of splicing [1] - Why
Genesin Pieces? - certainly remains pertinent to this day.
Tofurther sharpen the question: Why are some genomes,in particular
those of multicellular eukaryotes (plantsand animals), intron-rich
whereas others, i.e. those ofthe great majority of unicellular
eukaryotes, are intron-poor? In principle, accumulation of introns
in genes ofmulticellular organisms could be considered as an
adap-tation that ensures evolution of organismal
complexity,especially via AS. This is indeed the position taken
bythe proponents of the genome design hypothesis dis-cussed in the
preceding section. However, a simplerexplanation that appears to be
better compatible withthe data has been proposed by Lynch as part
of the non-adaptive theory for the evolution of
complexity[42,49,50,168,169]. A simple estimate based on the
num-ber of nucleotide sites required for accurate intron exci-sion
during splicing (that is, the donor and acceptorsites and the
branching point motif ) shows that thepower of purifying selection
is sufficient to eliminate themajority of introns only in
populations with a large ef-fective population size (Ne) such as
found in many uni-cellular eukaryotes (Ne ~ 107 - 108) [50,170] but
not inthe relatively small populations of vascular plants
andvertebrates (Ne ~ 105-106 and 104-105,
respectively)[50,170,171]. Numerical simulations based on this
esti-mate reveal a phase transition-like shift from intron-richto
intron-poor genomes [50,168,169] which roughlymatches the observed
distribution of intron densities(see Figure 2).This non-adaptive
population genetic perspective on the
evolution of introns and eukaryotic gene architecture
iscompatible with the results of empirical reconstructionaccording
to which the general (perhaps counter-intuitive)trend is evolution
of intron-poor genomes in multiplelineages from intron-rich
ancestors (see Figure 2). This
-
0 10 20 300
1
2
3
4
5
6
7
8x 10
4
expression levelcategory
Intron
length
Figure 10 Total intron length as a function of expression level
category. Intron length is measured in nucleotides. Expression
levels arebinned into 30 categories, with higher categories
matching higher expression levels, as described previously [167].
Each point is the mean valuefor all genes in the given expression
category, and the error bar indicates the standard deviation of the
mean.
Rogozin et al. Biology Direct 2012, 7:11 Page 15 of
28http://www.biology-direct.com/content/7/1/11
evolutionary trend appears to be a form of ‘genomicstreamlining’
occurring in evolutionarily successfullineages that reach high
effective population sizes whichprevent effects of genetic drift
and eliminate even slightlydeleterious features such as introns.
Conversely, theapparent bursts of intron gain linked to the origin
ofmajor groups of eukaryotes such as the Metazoa wouldcoincide with
population bottlenecks which are typical ofsuch transitional epochs
[42,49,50,172]. The non-adaptivepopulation genetic concept is also
compatible with thefinding that intron-rich organisms possess much
weakerdonor splice signal than intron-poor organisms: the pres-sure
of purifying selection in intron-rich lineages is insuffi-cient to
strictly maintain the consensus nucleotides at thedonor sites [51].
A more direct analysis that compared therates of
consensus-to-variant and variant-to-consensussubstitutions in the
donor sites of three intron-richlineages supported the existence of
purifying selectionagainst variants that, however, is too weak to
maintain theconsensus in most of the introns [52].A major
consequence of the inability of purifying
selection in small populations to eliminate introns or
tomaintain strong donor splice signals is the accumulationof
aberrant splice variants. Such error-prone splicingcould eventually
give rise to functional alternative spli-cing. Notably, the latest
scenario of intron gain and lossin widespread eukaryotic genes
includes only intron-richintermediates on the path of evolution
from the LECAto mammals (see above; Figure 7), with the
implication
that this line of descent never went through a stage ofstrong
purifying selection allowing continuous evolutionof alternative
splice variants [53].Although the non-adaptive population genetic
theory
appears to be the best available conceptual frameworkfor the
evolution of eukaryotic gene architecture, spli-cing and introns,
at least two notable problems remainoutstanding. First, it is
unclear why the acceptor splicesignal does not follow the same
trend as the donor siteand is stronger in intron-rich multicellular
eukaryotesthan it is in intron-poor unicellular forms although
theobserved positive correlation between the strength ofthe donor
splicing signal and the combined strength ofthe branch point signal
+ acceptor splice signal [117]might explain this incongruence.
Second, the preserva-tion of at least a few introns even in the
most intron-poor organisms remains enigmatic because at face
valuethe non-adaptive scenario would predict complete lossof
introns and accordingly the spliceosome in multiplelineages.
Evolution of alternative splicing in coding andnon-coding
regions of eukaryote genesIn multicellular organisms, particularly
animals, AS is amajor mechanism for regulating gene expression
andfunction [173-176]. Large-scale studies based on map-ping of
expressed sequence data on genomic sequencesand RNAseq surveys have
shown that more than 90% ofhuman and over 40% of Arabidopsis
thaliana and rice
-
Rogozin et al. Biology Direct 2012, 7:11 Page 16 of
28http://www.biology-direct.com/content/7/1/11
genes are capable of producing multiple diverse mRNAmolecules
through alternative splicing [177-183].Alternative splicing has
been identified in many
eukaryotic groups; however, it remains unclear whetherfrequent
alternative splicing emerged early in eukaryoticevolution [176,184]
because ancestral splice signals wereweak and failed to provide for
highly accurate splicing,or has evolved more recently and
independently in mul-tiple lineages via mutation of strong
ancestral splice sig-nal in multi-intron genes [33]. As pointed out
in theprevious section, the non-adaptive population geneticmodel
that is in excellent agreement with the empiricalreconstructions of
eukaryote gene architecture evolutionimplies that AS evolved
already at the earliest stages ofeukaryote evolution through
accumulation of aberrantsplice variants under conditions of weak
purifying selec-tion. A further implication of this scenario is
that ini-tially all alternative transcripts were
non-functionalwhereas functional AS evolved gradually and
independ-ently in multiple lineages, primarily those that havenever
gone through population bottlenecks leading toextensive loss of
introns and tightening of splice signals.The impact of alternative
splicing on protein function
has been studied in great detail and is generally recog-nized as
a major source of protein diversity that greatlyexpands the
repertoire of protein function [173-175]. Asystematic comparison of
9 animal genomes from nema-todes to mammals revealed that
intron-flanking domainsexpanded faster than other protein domains
[185]. Intri-guingly, such mobile domains exhibited a strong
prefer-ence for phase 1 introns [185-188] in contrast to thegeneral
excess of phase 0 among introns (Figure 6). Thisfinding suggests
that evolution of introns flanking mo-bile domains is fundamentally
different from the evolu-tion of introns in conserved portions of
genes but thenature of these differences remains to be
elucidated[185,187,188].Evolutionary conservation of alternative
splicing is a
controversial matter. Only limited conservation of
alter-natively spliced (cassette) exons within mammals andwithin
dipterans has been detected [189-193]. However,a strikingly
different pattern has been reported for Cae-norhabditis nematodes:
more than 92% of cassette exonsfound in C. elegans are conserved in
C. briggsae and/orC. remanei [194], The differences in conservation
be-tween lineages might reflect differences in the fractionsof
functional alternative transcripts but possibly also dif-ferences
in intron length and the strength of splicing sig-nals
[194].Evolution of alternative splicing has been also analyzed
in the context of splicing signals [195]. The GT di-nucleotide
in the first two intron positions is the mostconserved element of
the U2 donor splice signal. How-ever, in a small fraction of donor
signals (
-
Rogozin et al. Biology Direct 2012, 7:11 Page 17 of
28http://www.biology-direct.com/content/7/1/11
nested within introns. In addition, the possibility hasbeen
discussed that introns might act as ‘catalysts’ ofevolution by
facilitating intergenic recombination (thiscould be considered a
variation on the theme of genericnon-coding DNA functions).
Experimentally demon-strated and potential functions of introns
have beenreviewed in detail [214,215]. Here we do not attempt
acomprehensive coverage of this subject but rather brieflydiscuss
several aspects that appear directly relevant forunderstanding
evolution of introns and eukaryote genestructure.
Functions of introns associated with splicingSplicing occurs
before mature mRNAs are transportedfrom the nucleus to the cytosol
by the nuclear exportsystem. Numerous studies indicate that
splicing andmRNA export are directly coupled (see reviews
[32,35]).Evidence of such coupling initially came from the
obser-vation that mRNAs generated by splicing are more effi-ciently
exported than their identical counterpartstranscribed from a
complementary DNA [216]. This ef-fect of splicing on export was
explained by the findingthat spliced mRNAs (but not cDNA
transcripts) areassembled into a distinct mRNP complex that
promotesefficient export [32,35,216]. This complex, or at leastsome
of its components, has been subsequently shownto assemble adjacent
to newly formed exon–exon junc-tions [217]. The increased export
efficiency of thespliced mRNP is thought to be due to recruitment
of themRNA export factor ALY to the mRNA during the spli-cing
reaction [218,219]. The splicing factor UAP56,which interacts
directly with ALY, plays a role in recruit-ing ALY to the spliced
mRNA [220-222]. In a subse-quent step, a hand-off occurs in which
the ALY/TAPinteraction is established, thus delivering the mRNP
tothe nuclear pore for export [221]. The numerous eukar-yotes that
possess only a few introns in the entire gen-ome nevertheless
retain a full-fledged or partiallydegraded spliceosome machinery
[8,65,223], suggestingthe possibility that the spliceosome might
have functionsother than splicing as such, perhaps primarily
nucleocy-toplasmic trafficking. However, the transport mechan-isms
for numerous intron-less transcripts are not wellcharacterized, and
the possibility remains that intron-less RNAs are recruited to the
export machinery via aspliceosome-independent route [32,35].
Compatible withthis hypothesis, UAP56 is required for export of
bothspliced and intronless mRNAs [220-222,224]. In meta-zoan
intronless mRNAs, specific mRNA sequence ele-ments are required for
export, and some of theseelements associate with members of the SR
family ofsplicing factors which are thought to mediate export ofthe
intronless mRNA [225,226]. The SR proteins couldeither recruit the
conserved export machinery or play a
direct role in export [226]. In both yeast and metazoansthe
export of intronless mRNAs also could be coupledto polyadenylation
[32,35,226,227]. It has been shownthat in mammalian neurons some
retained introns arecoupled with targeting of mRNA sequences to
dendrites,apparently via so called ID sequences that represent
adistinct class of SINE retrotransposons resident in theretained
introns [228]. Thus, functionally relevant reten-tion of intronic
sequence might be a more generalphenomenon than previously
suspected.The speed of splicing could be another important
mechanism of gene expression regulation [27,28]. Ana-lysis of
minor, U12 introns (see above) suggested thattheir positions are
conserved in orthologous genes fromhuman and Arabidopsis to an even
greater extent thanthe positions of the major, U2 introns [29]. The
U12introns, especially conserved ones, are concentrated
in5'-portions of plant and animal genes, whereas the U12to U2
conversion occurs preferentially in the 3'-portionsof genes. These
results are compatible with the hypoth-esis that the high level of
conservation of U12 intronpositions and their persistence in
genomes, despite theunidirectional U12 to U2 conversion, have to do
withthe role of the slowly excised U12 introns in down-regulation
of gene expression [27-29,229].As already pointed out above,
introns in yeast riboso-
mal protein genes substantially affect the expression ofthese
genes and contribute to the organismal fitness andstress response
via mechanisms that are not yet wellunderstood [57]. These seminal
findings indicate that inmany cases the regulatory functions of
introns could bespecific to a class of genes or even an individual
gene.This conclusion is compatible with the results of anearlier
study which has shown that yeast spliceosomecan distinguish between
different transcripts includingrelated ones, such as paralogous
ribosomal proteingenes, thus providing a distinct regulation mode
for ex-pression of specific proteins [230].
Introns as functionally important non-coding
DNAsequencesCompared to prokaryotes, eukaryotes possess a
muchgreater number of multidomain proteins that substan-tially
contribute to the functional complexity of theeukaryotic cell
[187,188,231-234]. Moreover, a strikingfeature of eukaryotic
protein architectures is the widespread of the so-called
promiscuous domains that com-bine with other domains much more
often thanexpected by chance [234,235]. The ‘exon theory’
positsthat exon shuffling via recombination within introns isan
important route of evolution that in particular is re-sponsible for
the diversity of the domain architectures ofmultidomain proteins
[39,40,236]. In the specific case ofvertebrate membrane receptor
proteins, this hypothesis
-
Rogozin et al. Biology Direct 2012, 7:11 Page 18 of
28http://www.biology-direct.com/content/7/1/11
seems to be compatible with empirical observations:these
proteins consist of multiple modules each of whichtypically is
encoded by an individual exon [185,187,188].However, in other
classes of proteins, there is no strongpreference for intron
location between domains, so exonshuffling is unlikely to be a
major, general mechanism ofmultidomain protein evolution
[43,135,185,187,188,234].Introns have the potential to serve as
“enhancers” of
meiotic crossing-over occurring within protein-codinggenes
because the probability of crossing over betweensegments of a
coding sequence (exons) separated by longintrons greatly increases
compared to the same codingsequences in the absence of an intron
[214,237]. This mei-otic recombination between exons of two alleles
of thesame gene is likely to be a major factor of protein
evolu-tion through combining mutations from different
alleles,“trying out” different combinations and avoiding
accumu-lation of deleterious mutations within the same
allele[1,214,237].Trans-splicing is a special form of RNA
processing
whereby exons from two different primary RNA tran-scripts are
joined end-to-end and ligated. The most com-mon form of
trans-splicing is spliced-leader (SL) trans-splicing where the
leader is donated by a short SL RNA.The SL trans-splicing is
widespread among some unicellu-lar eukaryotes, in particular
trypanosomes [238]. Otherthan trypanosomes, the only organisms
known to heavilyrely on SL trans-splicing for gene expression are
nema-todes. More than half of the pre-mRNAs in the Caenorha-biditis
nematodes are trans-spliced to one of two shortleader RNAs, SL1 or
SL2. This process occurs at the 5'ends of pre-mRNAs, and it is
essential for the efficientprocessing of polycistronic pre-mRNAs
[35,239-242]. Thepatchy distribution of trans-splicing suggests
that SLtrans-splicing has evolved repeatedly among
eukaryoticlineages and SL precursor RNAs have readily evolvedfrom
ubiquitous small nuclear RNAs that are involved inconventional
splicing [243]. Several cases of trans-splicingbetween different
pre-mRNAs (no SL RNAs are involved)have been identified in
tunicates, mammals, flies andplants (reviewed by
[214,242,244,245].
Functional elements and genes within intronsSome introns contain
various regulatory elements aswell as sequences involved in
chromatin structure for-mation such as scaffold-matrix attachment
regions, al-though it remains uncertain whether intron
sequencesshow any substantial enrichment for regulatory
andstructural elements compared to other non-coding DNA[214,246].
Some long introns, especially those in 5’-terminal regions of
coding sequences, might be enrichedfor various regulatory elements,
and consequently, couldbe subject to purifying selection
[160,247-253]. Longintrons in several genes of Oikopleura have been
shown
to contain key developmental regulators [131], and simi-lar
observations have been reported for genes involvedin development of
diverse metazoans [254-257] as wellas associated “bystander” genes
that are not known to bedirectly involved in development
[258-261].Many introns contain within their sequences various
non-coding RNA genes, especially numerous genes forsnoRNAs
[262,263] and precursors of microRNAs[264,265]. Specifically, some
short animal introns withhairpin formation potential, known as
mirtrons, can bespliced and debranched into pre-miRNAs
[266-268].These pre-miRNAs are then cleaved by the RNase IIIenzyme
Dicer and incorporated into typical miRNA si-lencing complexes
[268,269].A small fraction of introns contain nested protein-
coding genes [270]. Comparative analysis of these nestedgenes in
vertebrates, fruit flies and nematodes revealedsubstantially higher
rates of gain of intron-embeddedgenes compared to loss [271].
However, the accumula-tion of nested gene structures is likely to
represent anincrease of organizational complexity of animal
genomesvia a neutral process given that there seem to be
nofunctional links between the intron-contained genes andthe ‘host’
genes [271]. Effectively, it seems that intronsserve as neutral
substrate that can be randomly colo-nized by various genes.
Molecular mechanisms of intron loss and gainMechanisms of intron
loss and gain remain poorly under-stood. A plausible, common
mechanism for intron losscould be homologous recombination between
cDNAs thatare produced by reverse transcription and the
genomiccopies of the respective genes [65,67,107-110,112].
Introngain/loss events must be associated with a transient phaseof
segregating alleles either carrying or lacking the intronwithin
natural populations [49]. Until now, only 25 tran-sient
intraspecific intron presence-absence polymorphismshave been
reported, one in Drosophila teissieri [272] and24 in Daphnia pulex
[70,96]. In Daphnia, recently gainedintron sequences were
frequently associated with shortrepeats, suggesting a role for
double-strand break repair inintron gain [96]. Analysis of several
closely-related fungirevealed 74 presence-absence polymorphisms of
introns[273]. Examination of the positions of these introns
hassuggested that extensive intron trans