This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
ELUCIDATING the MECHANISMS of TRANSPOSABLE ELEMENTS using EXPERIMENTAL and
BIOINFORMATIC APPROACHES: the hAT SUPERFAMILY of TRANSPOSABLE ELEMENTS in the
GENOME of AEDES AEGYPTI and TE DISPLAYER
by
Rebecca Rooke – complete as registered on ROSI
A thesis submitted in conformity with the requirements for the degree of Masters of Science
Graduate Department of Cell and Systems Biology University of Toronto
Elucidating the Mechanisms of Transposable Elements using
Experimental and Bioinformatic Approaches: The hAT
Superfamily of Transposable Elements in the Genome of Aedes
aegypti and TE Displayer
Rebecca Rooke
Masters of Science
Cell and Systems Biology University of Toronto
2011
Abstract
Transposable elements (TEs) are found in nearly all eukaryotic genomes and are a
major driving force of genome evolution. The hAT superfamily of TEs are found in a
variety of organisms, including plants, fungi, insects and animals. To date, only 14 hAT
TEs in the Aedes aegypti genome have been annotated as having a hAT transposase
coding sequence. In this study, extensive bioinformatic approaches have been
employed to find hAT TEs that encode transposases in the A. aegypti genome. A total
of six newly-identified TEs belonging to the hAT superfamily were discovered in the A.
aegypti genome. Furthermore, a computer program called TE Displayer was developed
to analyze TEs in genome sequences. TE Displayer detects TE-derived polymorphisms
in genome datasets and presents the results on a virtual gel image. TE Displayer
enables researchers to compare TE profiles in silico and provides a reference profile for
experimental analyses.
iii
Acknowledgments
First and foremost, I would like to thank my supervisor, Dr. Guojun Yang, for introducing
me to and guiding me through the exciting world of transposable elements. Your
constant enthusiasm about your research was nothing short of contagious. I appreciate
all the time and effort you gave me throughout these past two years to help me become
a better biologist.
I would also like to thank the members of my committee, Dr. George Espie and
Dr. Marla Sokolowski, for their valuable guidance and suggestions.
I could not have successfully completed my MSc without the academic, mental,
and emotional support of Amy Wong and Matt Janicki. You are both phenomenal people
who were always there to encourage and motivate me, laugh and joke with me, and you
provided me with a necessary fun and whacky world outside of the lab.
Lastly, I would like to thank my family for their support, motivation, and
encouragement. Thank you, Angela, for editing my thesis. You are my role model and
inspiration, not only in the world of academia, but in life as well. Thank you Mom and
Dad, for allowing me to choose my own path and for supporting me with every step I
took.
Funding: National Sciences and Engineering Research Council (RGPIN371565 to G.Y.); Canadian Foundation for Innovation (24456 to G.Y.); Ontario Research Fund; University of Toronto.
iv
Table of Contents
Acknowledgments ........................................................................................................... iii
Table of Contents ............................................................................................................ iv
List of Tables.................................................................................................................. vii
List of Figures ............................................................................................................... viii
List of Appendices ........................................................................................................... xi
Publications.................................................................................................................... xii
Glossary ........................................................................................................................ xiii
Chapter 1 Introduction to Transposable Elements ........................................................... 1
1 Transposable Elements (TEs) ..................................................................................... 1
1.1 TE Classification ................................................................................................... 1
1.2 Miniature Inverted Repeat Transposable Elements (MITEs) ................................. 5
1.3 Recently and Currently Active MITEs .................................................................... 6
1.4 Elucidating how MITEs Achieve High Copy Numbers ........................................... 9
1.5 Significance of TEs ............................................................................................. 11
Chapter 2 Elucidating the Transposase Sources for the Transposition of hAT MITEs ... 13
2 Introduction to hAT TEs ............................................................................................. 13
3.1 Determining and Cloning hAT MITEs .................................................................. 15
3.2 Finding TEs using a Top-Down Approach ........................................................... 16
3.3 Determining Candidate Transposases for the Transposition of hAT MITEs ........ 18
3.3.1 Retrieving All Putative hAT Transposases ............................................... 18
3.3.2 Identifying Recently Active Putative Transposases .................................. 19
3.3.3 Linking hAT MITEs with Putative Transposases ...................................... 19
3.3.4 Identifying Coding Sequences of Putative hAT Transposases ................. 19
v
3.3.5 Phylogenetic and Conserved Domain Analysis of Known and Putative hAT Transposases ........................................................................................... 20
3.4 Synthesizing and Cloning of Transposases ........................................................ 21
4.1.1 Finding MITE Members Belonging to the hAT Superfamily of TEs ........... 25
4.1.2 Finding TEs Encoding Putative hAT Tranposases ................................... 26
4.1.3 Analysis of hATTPases and their copies in the A. aegypti genome .......... 30
4.1.4 The Buster and Ac families of the hAT Superfamily ................................. 33
4.1.5 Conserved Domains in Known and Putative Transposase Sequences in the A. aegypti genome .............................................................................. 36
4.1.6 Linking MITEs to Putative hAT Transposases .......................................... 39
4.1.7 Finding TEs using a Top-Down Approach ................................................ 43
Rooke, R. & G. Yang (2010) TE displayer for post genomic analysis of transposable elements. Bioinformatics, 27(2): 286-287
My contributions to this publication include: troubleshooting glitches in the
software; making the computer program more aesthetically-pleasing and easy to use;
inserting user-controlled options into the software, such as changing background and
font color; testing the program with numerous different databases; inspecting all output
to insure the software is generating expected results. Furthermore, I wrote and
submitted the publication (with editing from Dr. Guojun Yang) and generated all figures
for the manuscript. Compared to the publication, the thesis contains an expanded
introduction.
Janicki, M., Rooke, R. & G. Yang. In press. Bioinformatics and genomic analysis of transposable elements in eukaryotic genomes. Chromosome Research. DOI 10.1007/s10577-011-9230-7
My contributions to this publication include thoroughly editing the manuscript prior
to submission. Following submission, the first author and I were responsible for
addressing the reviewer’s comments and suggestions and editing the manuscript
accordingly.
xiii
Glossary
hAT: named after the hobo, Activator, and Tam3 transposable elements
MAK: MITE Analysis Kit
MITE: Miniature inverted-repeat transposable element
TD: Transposon display
TE: Transposable element
TIR: Terminal inverted repeat
TSD: Target site duplication
1
Chapter 1 Introduction to Transposable Elements
1 Transposable Elements (TEs)
Barbara McClintock first described transposable elements (TEs) in the Zea maize
genome in the 1940s (1, 2) . Since their discovery, TEs have been found in nearly every
eukaryotic and prokaryotic organism studied to date, with only a few exceptions
(Plasmodium falciparum and Bacillus subtilis) (3, 4). TEs are so abundant in some
genomes that they can comprise over 85% of the DNA (5). Furthermore, TEs are
estimated to have increased the maize genome two- to five- fold (5, 6), where a single
class of TEs comprises approximately 50% of the total genome (7). Although the effect
of TEs on genome structure and function is continually being investigated, it is well-
accepted that TEs shape the size and structure of genomes and are significant players
in genome evolution (8). Therefore, understanding TEs—their transposition activity,
structure, and replication—is essential to elucidating how genomes have evolved both
structurally and functionally.
1.1 TE Classification
TEs can be divided into two major classes: class I (or retrotransposable elements) and
class II (or DNA transposable elements). The two classes of TEs differ with respect to
their mode of transposition. Class I TEs transpose via an RNA intermediate using a
mechanism commonly referred to as ―copy-and-paste‖. In comparison, class II TEs
transpose using a ―cut-and-paste‖ mechanism with only DNA as intermediates (9, 10).
Due to their different modes of transposition, class I elements are commonly found in
2
Figure 1: Graphical representation of the transposition of Class I TEs. The TE is transcribed into RNA and then reverse-transcribed into cDNA. The cDNA is inserted into the genome at a different location than the original element.
high copy numbers in their host genomes, whereas class II elements are often found in
low copy numbers (11).
Class I TEs contribute to the major repetitive portions of large genomes (12-14).
For example, a single family of class I TEs comprises nearly 35% of the human genome
(15). The transposition mechanism of class I elements begins with the synthesis of RNA
transcripts using the genomic TE copy as a template. The RNA transcripts are
subsequently reverse transcribed into DNA by a TE-encoded reverse transcriptase and
inserted into the genome at a different location (Figure 1). As a result of this ―copy-and-
paste‖ transposition mechanism, each transposition event produces one additional copy
Donor DNA with Class I element
Transcription
RNA
Reverse Transcription
cDNA
Recipient DNA with Class I element
Donor DNA with Class I element
Figure 1: Graphical representation of the transposition of Class I TEs. The TE is transcribed into RNA and then reverse-transcribed into cDNA. The cDNA is inserted into the genome at a different location than the original element.
3
of the TE (10).
Class I elements are divided into five orders, based on their insertion mechanism
and overall organization and enzymology: LTR (long terminal repeats), DIRS
(Dictyostelium intermediate repeat sequence), PLE (Penelope-like elements), LINE
(long interspersed nuclear element), and SINE (short interspersed nuclear element).
These orders are further divided into superfamilies based on the sizes of their target site
duplications (TSDs)—a short direct repeat sequence generated upon TE insertion—and
their protein coding domains (10).
Class II TEs are found in most eukaryotes and are the major class of TEs in
prokaryotes. Most TEs belonging to this class have terminal inverted repeats (TIRs) that
range in size from 11 base pairs to several hundred base pairs (11). Many class II TEs
encode a transposase enzyme that recognizes and binds to TIRs and excises the
original TE from its existing location and insert it elsewhere in the genome (Figure 2). It
is estimated that sequences derived from class II TEs constitute at least 1% of the
human genome (16).
Due to the nonreplicative transposition mechanism of class II TEs, an increase in
copy number is achieved by utilizing the host machinery. In one instance, a class II TE
can be duplicated if a transposition event occurs during DNA replication. In this case, if
the class II TE transposes from a replicated chromatid to an unreplicated site, the
element will have duplicated itself in the genome. In another instance, a class II TE can
be duplicated by gap repair through homologous recombination if the TE is present on
the homologous chromosome or a sister chromatid. This results in the restoration of the
TE at its original site (17).
4
Class II elements can be divided into two subclasses based on the number of
DNA strands that are cut during transposition. Subclass I elements cut both DNA
strands, while elements belonging to subclass II only cut one of the DNA strands.
Subclass I elements are further divided into two orders: TIR and Crypton. Elements
belonging to the TIR order are characterized by their TIRs which vary in length. This
order is separated into nine superfamilies based on the size of their TSDs and the
sequence of their TIRs: Tc1-Mariner, hAT, Mutator, Merlin, Transib, P, PiggyBac, PIF-
Harbinger, and CACTA. The Crypton order only contains one superfamily of the same
name which contains elements that lack TIRs but generate TSDs upon insertion (10).
Subclass II elements are also divided into two orders: Helitron and
Maverick/Polintrons (10, 18). Both orders contain a single superfamily of the same
name. Elements in the superfamily Helitron are proposed to replicate via a rolling-circle
Donor Site with Class II element
Excision of TE
Recipient DNA with Class I element
Donor Site
Insertion into Different Site
Figure 2: Graphical representation of the transposition of Class II TEs. The TE is excised from its location and re-inserted elsewhere in the genome.
5
mechanism and do not generate TSDs (10). Alternatively, elements in the superfamily
Maverick/Polintron bear long TIRs and generate TSDs that are 6 bps in length (17-19).
1.2 Miniature Inverted Repeat Transposable Elements (MITEs)
Both class I and class II TEs contain autonomous and nonautonomous elements.
Autonomous elements are elements that encode the enzyme(s) necessary for their
transposition, while nonautonomous elements do not. Despite their differences,
autonomous and nonautonomous elements within the same superfamily may have
strong sequence similarity and often contain the same crucial characteristics required
for transposition (i.e. TIRs) (10). Some nonautonomous elements, such as the Dc
element, are generated by point mutations or deletions from the autonomous element,
rendering their transposase gene inactive, but maintaining enough sequence similarity
to be recognized by transposase produced by the autonomous element(s) (20).
Therefore, nonautonomous elements rely on transposases from autonomous TEs for
their transposition.
Miniature inverted repeat transposable elements (MITEs) are a type of
nonautonomous element that have TIRs and generate TSDs upon insertion. The first
MITE was discovered in maize while analyzing insertions in the waxy gene (21). The
MITE did not share sequence similarity with any known TE at the time and was present
in over 10 000 copies in the maize genome (22). MITEs are typically short (usually <500
bps in length), often located in or near genes (23-25) and are often found in high copy
numbers in the genomes in which they reside, despite lacking a transposase coding
sequence. Unlike other nonautonomous elements, the majority of MITEs are not
deletion derivatives of autonomous elements (26, 27). Two hypotheses exist to explain
6
the origin of MITEs: (a) a MITE arises from the fortuitous placement of TIR-like
sequences or solo TIRs that are recognized by an autonomous TE (28, 29) or (b) MITEs
are relics of past TEs whose autonomous elements have been degraded in the genome
or have not reached fixation within the population (27).
To date, MITEs have been found in organisms spanning all five kingdoms. They
are found in a diverse range of species, including Arabidopsis thaliana (30), Xenopus
laevis (31), Caenorhabditis elegans (32), Aedes aegypti (33), teleost fish (34), archaea
species (35) and humans (16, 36). In some species, MITEs make up a significant
portion of the genome. For example, rice (Oryza sativa) has a genome composed of
approximately 4% MITEs and MITE-derived sequences (37) and MITEs constitute 1-2%
of the C. elegans genome (38). Furthermore, approximately 16% of the yellow fever
mosquito’s (Aedes aegypti) total genome is composed of MITEs, the highest genome
percentage known so far (39).
1.3 Recently and Currently Active MITEs
In 2003, the first active MITE, named mPing, was identified in natural rice plants (40),
tissue culture (24), and plants derived from anther calli (23). It was later discovered that
mPing is active in plants derived from seeds treated with hydrostatic pressure (41) and
in recombinant inbred lines (42). In transgenic Arabidopsis plants and introgressed rice
plants, the transposase from Ping and Pong were demonstrated to mobilize mPing (42,
43). Although mPing is a deletion derivative of Ping, Pong encodes similar proteins to
Ping and is able to transpose mPing elements via cross-mobilization (24).
7
Since the discovery of mPing’s transposition activity, other active MITEs have
been identified. The MITEs dTstu1 and dTstu1-2 were shown to be active in potato
when a somaclonal variant, called Java kids purple (JKP), was generated from leaf
protoplast of the potato cultivar 72218. It was shown that dTsu1 excised from the
flavonoid 3’,5’-hydroxylase gene, thereby restoring the gene’s function and producing a
differently coloured tuber. Further investigation revealed that a dTstu1-like MITE,
dTstu1-2, was present in an allele in JKP, but was absent in every allele of the locus in
72218, indicating a new insertion event (44).
Similarly, the Arachis hypogaea MITE (AhMITE1) in the VL1 peanut mutant also
showed activity following stressful conditions to its host. When VL1 peanut mutants
were subjected to mutagenesis, the resulting plants differed phenotypically from VL1
mutants, in that they became resistant to late leaf spot (LLS) and susceptible to rust.
Molecular analysis showed that the phenotypic changes were due to the excision of
AhMITE1 from a pre-determined site. MITEs can be activated by mutagenesis (45) and
tissue culture stresses (23) and AhMITE1 follows this pattern in VL1 peanut mutant
plants.
Another known active MITE family, called mimp, was characterized in the
genome of the ascomycete fungus Fusarium oxysporum (46). The two subclasses of
mimp, referred to as mimp1 and mimp2, have 27 bp TIRs that share sequence similarity
to the autonomous element impala. Furthermore, both mimp and impala generate a
―TA‖ TSD upon insertion (47). Phenotypic assays that were performed to test the
functional link between impala and mimp showed that impala is responsible for mimp1
excision in different strains of F. oxysporum. Although the origin of mimp1 is still
8
unknown, it is speculated to either be a deletion derivative of impala or to have been
formed de novo (47).
Tc7 is a 921 bp MITE found in the genome of C. elegans (32). The terminal 38
bps of Tc7 have high sequence similarity to the terminal 38 bps of the autonomous
element Tc1. Like mimp and impala, Tc1 and Tc7 have the same TSDs (―TA‖). Using
Southern blotting, it was determined that Tc7 actively transposes in the germline of
mutator strains. Further analyses revealed that Tc1 is responsible for the transposition
of Tc7 and that Tc7 is not a deletion derivative of any known Tc1 element in C. elegans.
It was determined that Tc1 and Tc7 have similar transposition efficiencies and it is still
unclear why Tc7 copy numbers have not increased in mutator lines when Tc1 copy
numbers have (48).
In addition to MITEs that have been shown to be active, there are also MITEs
that are presumed to be currently or recently active. Most of these MITEs were
discovered using computational means and are predicted to be recently or currently
active based mostly on length and sequence conservation amongst members in the
genome. Recently active MITEs are highly homogenous in length and sequence,
especially in the TIRs and TSDs, as they have not yet accumulated mutations (49, 50).
For example, Nehza is thought to have recently transposed in the genomes of
Anabaena variabilis and Nostoc sp. Nehza is a MITE that is 132-171 bps in length, has
18bp TIRs, and generates 10 bp TSDs upon insertion. A total of eight copies of Nehza
in A. variabilis and two copies in Nostoc sp. are thought to have been recently active,
due to the highly conserved lengths and TIR sequences. Nehza is speculated to have
9
been cross-mobilized by the transposase ISNpu3 due to the fact that they share almost
identical TIR sequences (51).
Another family of MITEs, T2-MITEs, is speculated to be currently or recently
active in Xenopus tropicalis. TS clustering is a novel strategy that involves analyzing the
differences in short terminal sequences and can identify MITEs with weak TIR base-
matching. Using TS clustering, a total of 19 242 T2-MITEs were classified into 16 major
subfamilies. Analyses of subfamilies A1, B3 and C showed that they contained
members with highly conserved TSD sequences and contained completely identical
copies. Therefore, it was postulated that these subfamilies may be currently active or
recently active. However, no transposase source has been identified as being
potentially responsible for the transposition of T2-MITEs (52).
1.4 Elucidating how MITEs Achieve High Copy Numbers
Although MITEs do not encode a transposase enzyme, they are often found in high
copy numbers in the genomes which they reside. For example, in some rice strains
mPing can be present up to 1000-fold more than its autonomous partner Ping (25). It is
well-known that the DNA structure of MITEs plays a key role in their transposition.
Studies have shown that TIRs are extremely important in transposition, as they are
recognized and bound by transposase enzymes (53–59). However, the mechanism
through which MITEs achieve such high copy numbers, despite lacking a transposase
coding region, was unknown until recently.
In 2009, a breakthrough study by Yang and colleagues suggested mechanisms
that may explain why MITEs are so successful in achieving high copy numbers in
10
genomes. Rice Mariner-like transposons, called Osmars, were predicted to be the
transposase source of Stowaway MITEs (called Ost5, Ost8, etc.) in rice due to similar
TIR sequences and the same TSD sequence. To test this, a yeast assay was performed
in which two plasmids were co-transformed into yeast cells. One plasmid contained the
transposase source, while the other plasmid contained an ade2 gene interrupted by a
MITE. Transposition of the MITE was detected based on the recovery of the ADE2 gene
when yeast cells were plated on media lacking adenine (60).
In this study, six of the seven Osmar transposases showed activity, with the
highest excision frequency occurring between the Osmar14 transposase (Osm14) and
the Stowaway MITE Ost35. Site-directed mutagenesis of the elements revealed that the
Ost35 MITE contains multiple motifs throughout its internal region that promotes
excision by transposase. Surprisingly, the Osm14 3’ subterminal region contains a
repressive motif that dramatically decreases transposition efficiency (60).
It has been postulated that class II elements persist in genomes across
generations via the relaxation of transposase-DNA binding specificity, thereby softening
the effect of detrimental mutations (27, 61, 62). This theory is supported by the fact that
Osmar transposases are able to cross-mobilize distantly related elements and have
weak DNA-transposase binding specificity (60, 63). Therefore, MITEs may parasitize
these transposases and increase their copy numbers through internal enhancement
motifs, thereby ensuring their persistence in the genome.
11
1.5 Significance of TEs
In the past, TEs were considered to be ―parasitic‖ DNA that invaded genomes through
transposition (64). However, continual analyses of genomes began to shed light on the
prevalence of TEs across multiple organisms and their influence in these genomes.
Despite the improved understanding of TEs since their discovery, it still remains unclear
to what extent they contribute to genome diversity, evolution, and complexity.
The fact that TEs were once considered parasitic is not surprising, considering
that TE proliferation and transposition have the potential to cause harmful effects on
genomes. TEs are capable of causing mutations either by inserting themselves into
genes, or by their imprecise excision from genic regions, leaving what is known as a TE
―footprint‖. For example, the insertion of a P element and copia element into the white
locus in D. melanogaster resulted in a white eye phenotype, reflecting a lack of
pigmentation (65). Furthermore, TE transposition can affect the host at a genome-wide
level. For example, in D. melanogaster larvae, the excision of P elements can cause
massive chromosome breakage, thought to result in temperature-dependent lethality
and sterility (66). However, although mutations induced by TEs can be harmful, it has
also been suggested that these TE-induced mutations can benefit populations through
increased mutation rates, thereby enhancing adaptation to different environments (67).
Despite the harmful effects that TEs may have on their host, there are also
examples of TEs providing direct benefits to their hosts. In D. melanogaster, for
example, certain class I elements have adopted a role similar to that of telomerases.
The transposition of these class I elements, such as HET-A and TART, replaces
damaged chromosome ends thereby maintaining constant chromosome size (68–70). It
12
has also been suggested that endogenous class I elements may play a role in repairing
double-strand chromosome breaks through reverse transcriptase-mediated events (67,
71, 72).
In shaping the biological properties of the organisms that carry them, TEs can be
useful tools for biotechnological applications such as insertional mutagenesis,
transgenesis, and phylogenetic markers (6, 73–75). Even though TEs were discovered
over 60 years ago in the maize genome, active TEs are continually being discovered
and characterized. Active TEs are at the core of TE-derived genome evolution and can
result in an increase in genome size (76), chromosomal rearrangements (66, 77, 78),
and disrupting or altering gene expression (65, 79–86). Therefore, in-depth
investigations of TEs that are potentially and currently active at genome-wide scales
and the consequences of their activity are critical to understanding genome evolution.
13
Chapter 2 Elucidating the Transposase Sources for the Transposition of hAT
MITEs
2 Introduction to hAT TEs
The first TE ever discovered was the Ac element in maize, which belongs to the hAT
superfamily of TEs (87). The class II hAT superfamily is so named after the hobo
elements in Drosophila melanogaster, Activator (Ac) elements in maize, and Tam3
elements in Antirrhinum majus (88–90). hAT TEs are present in the genomes of a
variety of organisms including plants, mammals, fungi, amphibians, nematodes and fish
[see (91) for review]. Furthermore, hAT TEs are also found in humans, where they are
the most abundant class II TE, comprising approximately 195 Mb of the human genome
(92).
hAT TEs have also undergone molecular domestication, a process defined as a
TE-derived coding sequence resulting in a functional host protein (93). For example, a
gene in A. thaliana is derived from the transposase sequence of the hAT TE,
Daysleeper, and is speculated to act as a transcriptional regulator that is necessary for
plant development (94). Similarly, the DREF gene in D. melanogaster is a chimeric
gene that recruited a transposase DNA-binding domain from a hAT TE. The DREF gene
is involved in multiple cellular activities in D. melanogaster including DNA replication,
cell growth and differentiation (95, 96).
The elements in the hAT superfamily are characterized by generating 8 bp TSDs
upon insertion and having 5-27 bp TIRs, with limited interfamily sequence similarity (97).
Furthermore, both autonomous and nonautonomous elements are found in the hAT
14
superfamily. For autonomous hAT elements, the transposases have four amino acid
motifs: a zinc finger domain near the N-terminus; a DNA-binding domain; a catalytic
domain; and an insertion domain (88, 98–100). The end region of the catalytic domain is
often referred to as the hAT dimerization domain, as it is commonly conserved in hAT
transposases and plays a role in oligomerization. However, crystal structure analyses of
a hAT transposase suggest that multiple regions may be involved in oligomerization
(100).
Recent evidence suggests that the hAT superfamily can be divided into two
families of TEs based on transposase sequences and target-site selection: the Ac family
and the Buster family. The majority of members in the Ac family have a consensus TSD
sequence of 5’-nTnnnnAn-3’. In contrast, members of the Buster family have a TSD
consensus sequence of 5’-nnnTAnnn-3’. The most significant amino acid variation
between the two families lies in the DNA-binding and insertion domains (101).
TEs are a major contributing factor to the variability and biodiversity of insect
populations [see (102) for review]. The yellow fever mosquito, A. aegypti, is commonly
found in close proximity to human populations and is a major vector of yellow fever,
dengue fever, and chikungunya fever (103–105). Approximately 30,000 people die
every year in Africa and South America as a result of yellow fever (104). In 2007, the A.
aegypti genome was sequenced, revealing that approximately 47% of the genome is
composed of TEs (39). A total of 21 MITE families are present in the A. aegypti genome
related to the hAT superfamily (39). Uncovering which transposases are involved in the
activity of hAT MITE members could elucidate the mechanisms involved in the evolution
and biodiversification of the A. aegypti genome.
15
3 Methods
3.1 Determining and Cloning hAT MITEs
In order to identify hAT MITEs present in the A. aegypti genome, the bioinformatics tool
MAK was used (106). The Member function of MAK was run using the consensus
sequences of all 21 hAT MITE families from TEfam (http://tefam.biochem.vt.edu) as a
query database (39) (Supplementary Table 1). The output of the Member function
consisted of the nucleotide sequences of every member of each MITE family present in
the A. aegypti genome. A ClustalW alignment was performed for all the members of
each MITE family and a 90% consensus sequence was generated (107). Primers were
designed using the consensus sequence to amplify MITEs for each family
(Supplementary Table 2). Due to mutations in TIR sequences amongst members of
certain MITE families, more than one set of primers was often needed.
The A. aegypti genomic DNA was extracted using the protocol described in
Rivero et al. (2004) with the following modifications: a fresh pupa was used instead of
an adult mosquito; samples were incubated at 55ºC for 4 hours after protease addition;
the suspensions were extracted with a single phenol-chloroform step; and no RNAse
was added (108). PCR was carried out using Pfu DNA polymerase (Fermentas Life
Sciences, Burlington, ON), each primer set, and A. aegypti genomic DNA as a template
Once the full coding sequence of the transposase was synthesized, the
fragment was gel-extracted (Qiagen). The ends of the transposase, as well as the
transpososase source plasmid, were digested using the appropriate restriction enzymes
for 3-4 hours at 37ºC (Figure 5). After digestion, the products were column-purified
(Sigma). Once the transposase coding sequence was cloned into the plasmid,
transformation, ligation and verification were performed as described above (Section
3.1: Determining and Cloning hAT MITEs ). If the transposase sequence had single
Figure 4: An illustration of the primers designed for a hypothetical hAT transposase with two exons and one intron. Green arrows, primers corresponding to exon #1; orange arrows, primers corresponding to exon #2; TGATCA, SpeI site; GTCGAC, SalI site.
23
nucleotide point mutations, it was repaired with PCR-based gene synthesis using
primers bearing the correct sequence.
3.5 Yeast Excision Assays
To make yeast competent cells (Strain DG2523) for transformation, a yeast colony was
inoculated in 5 mL of YPD broth at 30ºC with shaking overnight until a 1:10 dilution of
the culture reached an OD600 of 0.2-0.4. The culture was transferred to 50 mL of YPD
with a starting OD600 of 0.2 and incubated with shaking at 30ºC until an OD600 of 0.5-0.8
was reached. Cells were pelleted by centrifugation for 5 min. at 4000rpm. The
supernatant was discarded and the pellet was resuspended in 25 mL of sterile water.
Figure 5: Illustration of transposase source plasmid. Amp, ampicillin resistance gene; ARS H4, autonomous replication sequence of H4 gene; CEN6, centromere of yeast chromosome 6; cyc1 ter, termination of yeast cyclin gene cyc1; OriEC, E. coli replication origin; Pgal1, yeast gal1 promoter. Illustration adapted from Yang et al. (2009).
24
Cells were pelleted by centrifugation for 5 min. at 4000rpm. The supernatant was
discarded and the pellet was resuspended in 1 ml of 100 mM lithium acetate. Cells were
pelleted by centrifugation for 2 min. at 7000rpm. The supernatant was discarded and
the cells were resuspended in 450 μL of 100 mM lithium acetate [adapted from (121)].
The co-transformation of yeast cells using a transposase and MITE plasmid were
executed as follows: 25 μL of yeast competent cells were mixed with 2.9 μL of carrier
DNA (salmon sperm, 5 mg/mL), 60 ng of transposase vector, 60 ng of pooled MITE
vectors, and 200 μL of PEG buffer (40% PEG, 100 mM LiAc, 10 mM Tris-pH 8.0, 1 mM
EDTA). Tubes were incubated at 42ºC for 45 min. The cells were pelleted by
centrifugation at 7000 rpm for 20 sec. and the supernatant was discarded. Cells were
re-suspended in 50 μL of sterile water and were plated on media lacking histidine and
uracil [adapted from (121)]. Plates were incubated at 30˚C until colonies formed and
then placed at room temperature. After approximately 2 weeks, the colonies were
either: (i) streaked on media lacking adenine or (ii) inoculated in 2 mL of media lacking
histidine and uracil at 24ºC or 30ºC for approximately a week and plated on media
lacking adenine. The plates with media lacking adenine were incubated at 30ºC and
inspected regularly for colony formation.
For a positive control, yeast colonies were co-transformed with two plasmids
(abbreviated pOst35 and pOsm14Tp) which contain the MITE Ost35 and the
transposase Osm14 from rice. These elements were previously shown to undergo
transposition in the same yeast assay (60). For the negative control, yeast cells were
co-transformed with the pOst35 and an empty transposase source plasmid.
25
4 Results
4.1 Computational Analyses
4.1.1 Finding MITE Members Belonging to the hAT Superfamily of TEs
A total of 5,026 members were retrieved from the MAK Member function. A summary of
each hAT MITE family is described in Table 1. The hAT MITE family TF000576 has the
most MITE members, with a total of 526 retrieved; the hAT MITE family TF000720 has
the least, with only five complete members retrieved. TF000715 has 39 clades with
identical members, the most of any hAT MITE family. The highest number of identical
sequences in a single clade varies between two to six for most hAT MITE families.
However, the hAT MITE family TF000708 has a single clade with 25 identical members.
26
Table 1: Summary of output retrieved from MAK’s Member function.
4.1.2 Finding TEs Encoding Putative hAT Tranposases
Collectively, the output from both the Anchor and TP_TE functions resulted in an output
of approximately 5,000 DNA sequences of putative TEs encoding transposases (Error!
Reference source not found.A). After the output from Anchor and TP_TE was
processed by removing redundant sequences and sequences that had high difference
values (as described in Methods: Section 3.2.1), approximately 400 putative
MITE Family # Members # Clades with Identical
Members
Highest # of Identical Sequences in a Single
Clade
TF000722 5 1 2
TF000576 526 23 3
TF000700 253 19 6
TF000703 197 12 6
TF000706 243 3 2
TF000708 287 23 25
TF000714 256 2 2
TF000715 275 39 4
TF000717 230 6 3
TF000718 280 16 3
TF000719 234 8 3
TF000720 295 7 2
TF000724 250 3 3
TF000725 272 7 2
TF001258 240 2 2
TF001274 175 1 2
TF001275 249 4 4
TF001302 248 16 5
TF001310 187 1 2
TF001312 68 3 2
TF001332 256 3 6
27
transposase sequences remained (Error! Reference source not found.B). These
results were further narrowed down by isolating the sequences with ends matching best
to known hAT MITE ends (Error! Reference source not found., I & ii) and by isolating
representative sequences with identical copies (Error! Reference source not found.,1
& 2). To select the best candidate TEs for encoding a hAT transposase, a BLASTX
search was performed against known hAT transposases (Error! Reference source not
found.C). The sequences that encoded amino acid sequences similar to known hAT
transposases were isolated, resulting in a total of 56 putative transposase sequences
(Error! Reference source not found.D). The TSDs on the 5’ and 3’ ends of a
sequence are typically identical. To refine the search, sequences with non-identical
predicted TSD similarity were removed, resulting in a total of 23 sequences (Error!
Reference source not found.E), referred to hereafter as hATTPases. A summary of
the 23 hATTPases is shown in Error! Reference source not found.. Based on the
manual annotation of the hATTPases and their similarity to hAT MITE ends, 14
hATTPase sequences were selected as candidates for experimental analyses (Error!
Reference source not found.F).
28
~5000 Sequences
56 Sequences
23 Sequences (hATTPases)
14 Sequences
Choose sequences with best annotations and
highest simialrity to MITE ends for experimental
analyses
A
~400 Sequences
B
i
D
E
F
C
Manually inspect TSDs and remove all sequences
whose TSDs are not 100% identical
Annotate
Anchor and TP_TE output
Remove redundant sequences and sequences with
high flanking sequence difference values
2
BLASTX all Sequences against
known hAT Transposases
Isolate candidates from clades
with highly conserved
sequences
Make tree
Align sequencesIsolate 3' and 5' ends from
sequences and hAT MITE
Align sequences
Isolate putative transposases
whose ends had highest
similarity to MITES
1
ii
Figure 6: A schematic representation of how the best candidate hAT transposase sequences were selected.
29
Table 2: A summary of the 23 hATTPases. Their accession and position in the A. aegypti genome is shown, along with their size in bps and TSD sequence.
Name Accession Position Size (bps) TSD Sequence TIR Sequence
4.1.3 Analysis of hATTPases and their copies in the A. aegypti genome
To better understand the relationships between the hATTPases, their DNA sequence
copies were aligned and a neighbor-joining tree was generated (Figure 7). hATTPase10
has the most copies in the A. aegypti genome of an element that encodes a full or
partial hAT transposase, with a total of five. Some hATTPases are present in a single
copy in the A. aegypti genome, such as hATTPase5, hATTPase3, and hATTPase22.
Furthermore, a maximum-likelihood tree was generated using the amino acid
sequences of every annotated hATTPase (Figure 8). Interestingly, four distinct clades
are evident in the tree. Clades I and IV are highly supported, with node values of 95 and
96, respectively; while clades II and III have weaker support with node values of 73 and
69, respectively.
31
AAGE02005313
AAGE02005388
hATTPase10
AAGE02005396
AAGE02004097
hATTPase8
AAGE02017306
hATTPase5
hATTPase1
AAGE02014391
AAGE02009620
hATTPase11
hATTPase15
hATTPase20
AAGE02003016
AAGE02000252
hATTPase7
AAGE02021183
hATTPase3
hATTPase4
AAGE02001305
hATTPase2
AAGE02024385
AAGE02001290
hATTPase22
hATTPase21
hATTPase9
AAGE02022887
hATTPase13
hATTPase16
hATTPase6
hATTPase12
AAGE02001220
hATTPase14
AAGE02014073
hATTPase17
hATTPase19
hATTPase18
AAGE02002382
hATTPase23
AAGE02003553
Figure 7: A neighbor-joining tree of the DNA sequences of hATTPases and their copies.
32
hATTPase6
hATTPase12
hATTPase5
hATTPase19
hATTPase23
hATTPase18
hATTPase17
hATTPase13
hATTPase16
hATTPase2
hATTPase4
hATTPase3
hATTPase15
hATTPase20
hATTPase11
hATTPase8
hATTPase10
hATTPase7
hATTPase1
hATTPase14
hATTPase9
hATTPase21
hATTPase22
98
95
96
100
98
81
61
86
73
99
97
91
69
69
100
93
Figure 8: A maximum likelihood phylogenetic tree of the 23 hATTPase transposase amino acid sequences (50% majority rule consensus). Numbers next to the nodes show quartet puzzling reliability based on 10,000 puzzling steps, a measure of nodal support similar to bootstrapping that is produced by TREE-PUZZLE
I
II
III
IV
33
4.1.4 The Buster and Ac families of the hAT Superfamily
To examine which hATTPases belong to the Buster and Ac families of hAT TEs, a
maximum-likelihood tree was generated using all annotated hATTPase amino acid
sequences and all available TE protein sequences described in Arensburger et al.
(2011) ( Figure 9). The amino acid alignment is shown in Supplementary Figure 1.
According to Arensburger et al. (2011), the hAT superfamily is divided into two families:
Buster and Ac; Tip TEs could not be placed in either family, nor into a third family, due
to the small sample size used in the study (101).
Based on Figure 9, hATTPase1, hATTPase7, hATTPase8, hATTPase10, and
hATTPase14 belong to the Ac family, while hATTPase2-4, hATTPase13, hATTPase16-
19, and hATTPase23 belong to the Buster family. The Tip proteins are clustered into a
separate clade with hATTPase5, hATTPase6, hATTPase11, hATTPase12,
hATTPase15, and hATTPase20, potentially indicating that these transposase
sequences represent a third, separate family in the hAT superfamily of TEs. Lastly,
hATTPase9, hATTPase21, and hATTPase22 cluster into a fourth, highly-supported
clade with no other known hAT transposase sequences.
Furthermore, it is clear that some hATTPases are distinctly different from, albeit
related to, any known hAT TE in the A. aegypti genome. For example, although
hATTPase5, hATTPase6 and hATTPase12 is clustering with the Tip TEs, none show
sequence similarity to the Tip transposase in A. aegypti (with only 12%, 10%, and 10%
sequence identities to AeTip2) Furthermore, the hATTPase9, hATTPase21, and
hATTPase22 are not clustered within the Ac or Buster family, nor are they clustered
with the Tip TEs Figure 9).
34
The Buster and Ac families of hAT TEs have TSD consensus sequences (101).
To determine whether the TSDs of the hATTPases and their copies share the same
consensus sequences as their respective families, sequence frequency logos were
generated (119). The hATTPases were separated into Buster and Ac families based on
the clustering shown in Figure 9. As seen in the sequence frequency logos, the majority
of the TSD sequences belonging to the Buster family have a ―T‖ at position 4 and ―A‖ at
position 5, as expected. Furthermore, the majority of the TSD sequences belonging to
the Ac family have a ―T‖ at position 2. However, although the majority of TSDs also
have an ―A‖ at position 7, ―G‖ also occurs frequently at that position (Figure 10).
35
Figure 9: A maximum likelihood phylogenetic tree of amino acid transposase sequences from Arensburger et al. (2011) and amino acid sequences of annotated hATTPases (50% majority rule consensus). Numbers next to most
nodes show quartet puzzling reliability based on 10,000 puzzling steps, a measure of nodal support similar to bootstrapping produced by TREE-PUZZLE.
hATTPase8
AeHermes2
hATTPase10
Activator
Tam3
CxKink3
CxKink4
hATTPase7
CxKink2
CxKink5
CxKink7
CxKink8
hobo
Hermes
Homer
hermit
AeHermes1
hATTPase14
tol2
VihAT2
DrAc2
DrAc1
hopper
Restless
hATTPase1
Herves
hATTPase15
hATTPase11
AeTip2
hATTPase20
hAT12HM
IpTip100
hATTPase5
hATTPase12
hATTPase6
AeBuster2
hAT5XT
DrBuster2
CsBuster1
hATTPase4
AeBuster1
hATTPase2
hAT2XT
hAT2DR
SPIN Md
SPIN Xt
SPIN MI
SPIN Og
SPIN Et
AeBuster3
hATTPase3
TcBuster1
AeBuster5
hATTPase16
AeBuster7
hATTPase13
hAT5DR
MyotishAT
SpBuster2
SpBuster1
MIBuster1
hATTPase23
hATTPase18
hATTPase17
AeBuster4
hATTPase19
hATTPase22
hATTPase21
hATTPase9
99
97
92
82
79
55
91
95
87
89
67
86
81
51
100
98
93
86
87
99
83
96
57
64
59
52
76
74
85
67
54
68
64
98
91
93
91
74
61
63
76
Ac
Tip
Buster
36
Figure 10: Sequence frequency logos of the TSD sequences for hATTPases and their copies belonging to the Buster and Ac families.
4.1.5 Conserved Domains in Known and Putative Transposase Sequences in the A. aegypti genome
There are currently 14 known hAT TEs in the A. aegypti genome that encode a
hAT transposase sequence, 10 of which encode intact transposase proteins
(http://tefam.biochem.vt.edu). The 10 intact sequences were analyzed for conserved
domains, to see which domains are common across known hAT transposase
sequences in the A. aegypti genome (Figure 11). Only 6 of the intact transposases have
hAT family dimerization domains and 4 have zinc finger domains. Two of the intact
transposases, AeHerves2 and AeHerves3, have a transposase domain of unknown
function called the DUF659 domain. This domain is also found in the harrow TE in
Drosophila (122).
Ac
Buster
37
The same conserved domain search was performed for all 23 annotated
hATTPase sequences. hATTPase1 has three domains: zinc finger, hAT family
dimerization and DUF659. hATTPase4, hATTPase7, hATTPase11, hATTPase12,
hATTPase18, and hATTPase23 have the hAT family dimerization domain while
hATTPase8, hATTPase10, hATTPase15 and hATTPase20 have the zinc finger domain.
The rest of the hATTPases--hATTPase2, hATTPase3, hATTPase5, hATTPase6,
and hATTPase21—do not have any apparent conserved domains.
38
Figure 11: A schematic representation of known intact hAT transposase sequences in A. aegypti (from TEfam) and annotated hATTPases that have conserved sequence domains. Grey lines, transposase sequence; blue, hAT family dimerization domain; red, zinc finger domain; green, DUF659 domain of unknown function.
39
4.1.6 Linking MITEs to Putative hAT Transposases
The 23 hATTPases were manually annotated and the DNA sequences with the least
number of mutations in the coding regions and/or those that had similar terminal
sequences to hAT MITEs were chosen for experimental analyses. These include:
hATTPase10, hATTPase13hATTPase16, hATTPase18, hATTPase19, and
hATTPase23. Figure 12 illustrates which hATTPases chosen for experimental analyses
have DNA terminal sequences which match best with the hAT MITE families; Figure 13
shows the alignments of the end sequences.
There are some MITE families that have multiple identical copies in the A.
aegypti genome. For example, the MITE family TF000708 has 25 identical copies
(Table 1); however, no element encoding a hATTPase bears similar end sequences as
the TF000708 MITE family. Compared to studies performed on rice, where almost every
Stowaway MITE family has TIR sequence similarity to the autonomous Osmar TEs,
there are 14 MITE families in A. aegypti that do not have similar ends to any TE
encoding a hATTPase or autonomous hAT TEs in the genome (123).
40
hAT TPase1 TF000722
hATTPase8 TF000576
hATTPase10 TF000708
hATTPase13 TF000714
hATTPase15 TF000715
hATTPase16 TF000717
hATTPase18 TF000718
hATTPase23 TF000706
hATTPase19 TF000700
hATTPase2 TF000703
hATTPase7 TF000719
hATTPase4 TF000720
hATTPase20 TF000724
hATTPase5 TF000725
TF001258
TF001274
TF001275
TF001302
TF001310
TF001312
TF001332
Figure 12: Figure illustrating which hATTPases DNA sequences have ends that are similar in sequence to the ends of each MITE family. Red lines, match MITE family TF000722; Blue line, match MITE family TF000576; green lines, match MITE family TF000718; yellow lines, match MITE family TF000706; purple lines, match MITE family TF001275; grey lines, match MITE family TF000715.
41
42
Figure 13: Alignment of the end sequences of hAT MITE families that match best with the end sequences of the hATTPases DNA sequences
43
4.1.7 Finding TEs using a Top-Down Approach
To find nonautonomous TEs that have not yet been identified and that are potentially
cross-mobilized by the hATTPases, the Topdown function of MAK was run using all
hATTPases DNA sequences as query sequences. Extensive sequence similarity
analysis revealed the existence of TEs that have not been recognized or identified in the
A. aegypti genome. Supplementary Table 4 shows the consensus sequence for each
new TE family. A total of three new TE families were found, all of which generate 8 bp
TSDs.
The new TE families were named according to the hATTPase-coding elements
that were used as the query sequence to find them. An alignment of the end sequences
of each new TE family with their respective hATTPase-coding elements shows that the
hATTPases-coding elements have highly similar sequences to the MITE families
(Figure 14). The hATTE1 family has the fewest members, with only 11. The hATTE2
family has 121 total members and is separated into 10 subfamilies. Translated
sequence searches revealed that all 10 subfamilies have amino acid sequence
similarities to the autonomous AeBuster1 TE. Specifically, all subfamilies of hATTE2
show amino acid sequence similarity to the end regions of the AeBuster1 transposase,
with E-values lower than 2e-11. The hATTE2G subfamily showed the most sequence
similarity with an overall query coverage of 34%. Lastly, the hATTE8 family is separated
into three subfamilies and has a total of 62 members.
44
4.2 Experimental Analyses
4.2.1 Cloning MITEs
A total of 41 MITE primers sets were designed for 21 hAT MITE families. Cloning was
attempted for members belonging to each MITE family; however, due to ligation
reaction difficulties, only certain MITEs were successfully cloned. Table 3 summarizes
how many individual MITE sequences were successfully cloned into the MITE plasmid.
Figure 14: Alignment of the 5’ and 3’ ends of the three TE families found from TopDown and the hATTPases-coding elements used to find them.
45
Table 3: The number of individual hAT MITE sequences that were cloned into the donor plasmid for each hAT MITE family.
4.2.2 Candidate hAT Transposase Analysis and Cloning
Only one transposase coding sequence was successfully cloned and repaired:
hATTPase16. The hATTPase2 transposase was successfully cloned but was unable to
be repaired and joined. To repair the hATTPase16 transposase, two substitutions and
two insertions were fixed using PCR. Furthermore, the native transposase-coding
sequence had a single intron, flanked by the splice sites ―GT‖ and ―AG‖. The 2572 bp
long transposon has perfect 12 bp TIRs, composed of ―CCAGTGTTTCCC‖ (Error!
MITE Family # MITEs Cloned
TF00072 2
TF000576 6
TF000700 1
TF000703 8
TF000706 2
TF000708 2
TF000714 1
TF000715 1
TF000717 2
TF000718 1
TF000719 0
TF000720 1
TF000724 0
TF000725 0
TF001258 0
TF001274 0
TF001275 0
TF001302 1
TF001310 0
TF001312 0
TF001332 1
MITE Family # MITEs Cloned
TF00072 2
TF000576 6
TF000700 1
TF000703 8
TF000706 2
TF000708 2
TF000714 1
TF000715 1
TF000717 2
TF000718 1
TF000719 0
TF000720 1
TF000724 0
TF000725 0
TF001258 0
TF001274 0
TF001275 0
TF001302 1
TF001310 0
TF001312 0
TF001332 1
46
Reference source not found.). The TE encodes a transposase that is 595 amino acids
long and belongs to the Buster family of TEs.
4.2.3 Yeast Excision Assays with the Putative hAT Transposase hATTPase16
All cloned hAT MITEs were pooled at equimolar concentrations for yeast excision
assays with the candidate transposase, hATTPase16. The plasmid used for cloning
MITEs contains a Ura3 gene, while the plasmid used for cloning transposases contains
a His3 gene. This enables yeast cells that contain both plasmids to grow on media
lacking histidine and uracil. An example of how plates containing media lacking histidine
and uracil appear after yeast cells are plates is seen in Figure 15. As expected, the
positive, negative and experimental conditions resulted in colony formation on media
lacking histidine and uracil. All transformation reactions that resulted in colony formation
on media lacking histidine and uracil for the positive, negative, and experimental
conditions (as seen in Figure 15) were streaked on media lacking adenine. When
colonies are streaked on media lacking adenine, only cells that have an intact ade2
gene (referred to as ade2 revertants) can grow.
The colonies that were incubated at 30ºC on media lacking histidine and uracil
were streaked on media lacking adenine. As expected, each yeast colony streaked from
the positive control produced ade2 revertants and no colonies grew on the negative
control. Furthermore, no colonies carrying the hATTPase16 and a MITE yielded any
ade2 revertants (Figure 16).
When colonies that were incubated in liquid media lacking histidine at 25ºC and
at 30ºC were subsequently plated on media lacking adenine (Figure 17 & Figure 18,
47
respectively), each colony streaked from the positive control produced ade2 revertants
and no colonies grew on the negative control. No colonies carrying the hATTPase16
and a MITE yielded any ade2 revertants.
Figure 15: Example of yeast colonies growing on media lacking histidine and uracil. All transformation reactions that resulted in colony formation for all three conditions, as shown above, were plated on media lacking adenine.
48
Figure 16: Yeast on media lacking adenine. Plates were streaked with colonies incubated at 30ºC on media lacking histidine and uracil. Sections on plates are representative of a single streaked colony. Red arrow, colony.
49
Figure 17: Yeast on media lacking adenine. Plates were spread with yeast cells from colonies incubated at 25ºC in liquid media lacking histidine and uracil. Red arrow, colony
50
Figure 18: Yeast on media lacking adenine. Plates were spread with yeast cells from colonies incubated at 30ºC in liquid media lacking histidine and uracil. Red arrow, colony
51
5 Discussion
Using bioinformatic approachs, TEs encoding hAT transposases can be predicted.
Furthermore, TEs that were recently active can be predicted based on the presence of
multiple conserved copies of that TE in the genome. Since transposases recognize and
bind to the terminal regions of TEs during transposition (111), a prediction of which
transposase protein is responsible for the transposition of which TE(s) can be made
based on the sequence similarity of the terminal regions between two elements.
In this study, a total of 23 TEs were selected as candidates to encode full or
partial transposases belonging to the hAT superfamily. Extensive computational
analysis of these transposases was performed and the copies of each candidate hAT
transposase were retrieved. It was discovered that the candidate hATTPase10 has the
highest copy number of any known hAT transposase in the A. aegypti genome (39).
Amino acid sequence analysis grouped the candidate transposases into four
distinct clades. To determine if any candidates belonged to the Ac and Buster families,
phylogenetic analysis was performed using the amino acid sequences of the annotated
hATTPases as well as hAT transposase sequences from multiple organisms [described
in (101)]. Similarly, four distinct clades were formed. As expected, the Buster and Ac
families formed two separate clades. However, the Tip TEs formed a separate third
clade with high support. This was also seen in the analyses done by Arsenburger et al.
(2011) where the three Tip TEs used in their study remained separate from the Ac and
Buster families. In their study, it was concluded that there was an insufficient sample
52
size for the placement of the Tip TEs into either of the two families or into a separate
third family (101).
In this study, six hATTPases were also placed in the same clade as the Tip TEs.
Although the sample size is still small, these results indicate that the Tip TEs may form
a third family in the hAT superfamily of TEs. Another clade, formed by three
hATTPases, is separate from the Buster and Ac families, as well as the Tip TEs. These
three hATTPases cannot be placed into either the Ac or Buster family, into the Tip
clade, or into a separate clade due to the small sample size. Furthermore, a total of six
hATTPases don’t have any strong sequence similarity to any known hAT transposase in
A. aegypti, making these sequences newly described as encoding full or partial
for the separation and visualization of specific TEs in a genome. The process starts with
the extraction and digestion of genomic DNA with a restriction enzyme to generate DNA
fragments of different sizes. Adapters are then ligated to the ends of the digested
genomic DNA fragments (Error! Reference source not found.A-B). A pre-amplification
PCR is performed using a primer that is complementary to the adapter sequence and
another primer that is complementary to the TE sequence (Error! Reference source
not found.C). A second, selective PCR reaction is performed using the pre-
amplification products as templates with a nested primer set (Error! Reference source
not found.D). Selective amplification PCR products are analyzed by polyacrylamide gel
or capillary electrophoresis. DNA fragments can be extracted and sequenced if desired.
When the TE family being analyzed has a high copy number in a genome, one or more
selective nucleotide(s) can be added to the 3’ end of the adapter primer used in the
selective amplification reaction to reduce the number of bands per lane (125). The
resulting products consist of DNA fragments containing part of the TE and a flanking
genomic region outside of the TE. These fragments are then resolved on a
polyacrylamide gel, where each band indicates a transposable element at a specific
57
location in the genome (Error! Reference source not found.E). The copy number of
the TE family in a genome can be determined and an active TE can be revealed
through the detection of an insertion event within a genome.
Although TD is often used to study TEs experimentally, discovering and
analyzing TEs computationally has become common in TE research, and is made
possible by the abundance of genome sequencing efforts being performed on a variety
of different organisms. Furthermore, most TEs have recognizable structural signatures,
making their identification and annotation possible. In lieu of this, multiple computer
programs have been developed to find and analyze TEs in different genomes.
A novel bioinformatics tool has been developed that transforms TD into a
computational program. The program, called TE Displayer, was generated using
Practical Extraction and Report Language (PERL) with a Graphical User Interface and
runs in the Windows and Linux operating systems Using TE Displayer, a user can
choose genome databases and define parameters including an adapter oligo length
(bp), a restriction enzyme recognition site sequence for genomic DNA digestion,
selective base(s), the sequences of the pre-amplification TE primer and the sequence of
the selective amplification TE primer. In addition, a user can specify the allowed number
of mismatches for the pre-amplification PCR primer to anneal to its targets and choose
a DNA size ladder and color(s) for the virtual gel image. The output of TE Displayer
includes a detailed description of each fragment in text format and a graphical
representation of the fragments on a virtual gel image (Figure 20). TE Displayer was
tested using TEs in the Aedes aegypti, Drosophila melanogaster, Caenorhabditis
58
elegans, Arabidopsis thaliana, and Oryza sativa genomes and all of the output from
these analyses is consistent with the analysis through manual inspection.
59
Figure 19: A schematic representation of Transposon Display. (A) Genomic DNA is extracted; (B) DNA is digested with MseI and adapters are ligated to the ends; (C) Pre-amplification PCR is performed; (D) Selective PCR is performed; (E) Products are run on a polyacrylamide gel. Blue boxes-adaptors; grey arrows-pre-amplification primers; black arrows-selective amplification primers.
60
7 Methods
7.1 Algorithm
The algorithm was implemented with PERL. The BLAST search is performed with the
standalone program package 2.2.22 with an E-value of 10,000. The graphical interface
is implemented with Perl/Tk modules. The GD-2.43 module was used for the generation
Figure 20: Screen-shot of the bioinformatics program, TE Displayer
61
of the virtual gel images. BioPerl modules such as Bio::Tools::Run::StandAloneBlast
and Bio::SearchIO are used to perform BLAST searches and parse the output. TE
Displayer has been tested on Linux and Windows (XP, Vista, Windows 7) standalone
systems and the SciNet high performance system (University of Toronto).
7.2 Implementation
The parameters required to perform TE Displayer include: a restriction enzyme site, a
pre-amplification primer sequence, a selective amplification primer sequence, and an
adaptor size. Parameters that are not required, but can be used, if desired, include a
selective base (A, T, C, or G) and different nucleotide mismatch values (up to a total of
five mismatches) between the pre-amplification primer and the genomic sequence.
When TE Displayer is implemented, a BLAST search is performed using the pre-
amplification primer as the query sequence and the genomic sequence as the subject
database (Figure 21, i). A 5 kb flanking sequence is retrieved from the pre-amplification
primer sequence, which is subsequently searched for the nearest enzyme restriction
site to the pre-amplification primer (as specified by the user) (Figure 21, ii & iii).
Following this, the region between the pre-amplification primer and the closest
restriction enzyme site is scanned for the selective-amplification primer sequence
(Figure 21, iv). If the selective-amplification primer sequence is found (in the correct
orientation) in this region, the size of the conceptual amplicon is calculated as the size
of the selective-amplification primer, the size of the adaptor, and the region between
them (Figure 21, v). Every location that contains the target TE sequence is processed in
this manner and a conceptual amplicon is produced.
62
i
ii
iii
iv
v
Figure 21: Diagram of TE Displayer algorithm (see Methods: Implementation). Red arrowhead, pre-amplification primer; Black arrowhead, selective-amplification primer. Adapted from Rooke & Yang (2010).
63
7.3 Output
The output of TE Displayer includes a text output that contains the genomic location,
amplicon size, selective base, and number of mismatches for each amplicon.
Furthermore, the amplicons are displayed as ―bands‖ in a lane on a virtual gel-image.
The migration of the virtual bands are calculated using the formula D1/D2=S2/S1, where
D is distance and S is size. All output from TE Displayer is consistent with the output
from manual inspection.
7.4 Parameters Used for Testing TE Displayer
The primer sequences used to find hAT TEs in different genomes are outlined in Table
4. Primers were generated from consensus sequences corresponding to each element
ID found on TEfam (http://tefam.biochem.vt.edu/tefam/) or Repbase
(http://www.girinst.org/repbase/index.html). The adopter size was 10 bps, primer
mismatch value was 5, and the restriction enzyme used was MseI with a recognition site
of TTAA.
Table 4 hAT primer sequences and genomes used to generate output for hAT elements
TE Displayer was implemented using various genomes, including A. thaliana, C.
elegans, Oryza sativa, A. aegypti, and D. melanogaster. To compare different TE
profiles in these five genomes, primer sequences were developed for different hAT TEs
found in each organism. As expected, the banding pattern for each hAT family is
different across each organism, reflecting the different sizes and copy number of the TE
elements in each genome (Figure 22A). The copy numbers and sizes of each band
were consistent with manual inspection of the genomic sequences.
65
Since TE Displayer can be used to look at the same TE family in different
genomes, primer sequences were designed for the MITE element, mPing. TE Displayer
was implemented using mPing primer sequences and two different rice genome
sequences: O. sativa var. japonica and O. sativa var indica. As shown on the virtual gel
image, O. sativa var japonica has 33 virtual gel bands and O. sativa var indica has 9
(Figure 22B). This is consistent with previous experimental data (25) that shows
significantly more mPing elements in japonica variety compared to indica variety.
Figure 22: TE Displayer virtual gels. (A) hAT families in different species. Lane 1: A.thaliana; lane 2: C.elegans; lane 3: rice; lane 4: A.aegypti; lane 5: D.melanogaster. (B) mPing elementsin rice. Lane 1: O.sativa var. indica; lane 2: O.sativa var. japonica. (C)TF000720 family in A.aegypti with different allowed primer mismatches. Lane 1: no mismatches; lane 2: 1 mismatch; lane 3: 2 mismatches. (D) TF000700 family in A.aegypti with different selective bases. Lane 1: no selective base; lane 2: A; lane 3: C; lane 4: T; lane 5: G. Adapted from Rooke & Yang (2010).
66
To illustrate TE Displayers ability of to reduce the specificity of the pre-
amplification primer using mismatch nucleotide(s), a virtual gel image was generated
displaying the TF000720 TE family and using the A. aegypti genome as the database.
When no mismatches were permitted, only four bands appeared on the virtual gel
image. With one and two mismatches permitted, 33 and 37 bands appeared,
respectively (Figure 22C). As expected, the higher the number of mismatches allowed,
the less specific the pre-amplification primer search is, and the more bands are found
and appear on the virtual gel.
Selective base(s) are a valuable tool to reduce the number of bands per lane on
the virtual gel image. This is often necessary to resolve bands from genomes that have
a high copy number of the TE family of interest. When no selective bases are used to
search the A. aegypti genome for the TE family TF000700, a total of 78 bands appear
on the virtual gel image. When A, C, T, or G is used as the selective base, a total of 23,
14, 22, and 19 bands appear, respectively (Figure 22D).
9 Discussion
With the ever-increasing number of whole-genome sequences becoming available in
public databases, there is an increased need for bioinformatics tools that are capable of
processing and analyzing the large amount of data. In TE research, bioinformatics tools
capable of identifying, annotating, and analyzing TEs in genomic databases are
advantageous, if not necessary.
Discrepancies between TE Displayer output and that seen on a TD gel may be
found. For example, the experimental and in silico TE profiles of mPing are similar, but
67
not exactly the same [see (24)]. Discrepancies may be a result of: (i) transposition
activity of the TE of interest; (ii) incomplete genome sequences, resulting in fewer bands
seen on the virtual gel image; (iii) sequencing and genome-assembly errors, resulting in
incorrect band sizes; (iv) non-specific amplification during experimental analysis,
resulting in the appearance of non-target sequences.
TE Displayer enables a researcher to create a virtual gel image that mimics the
experimental outcome of TD, as well as providing detailed text output about band sizes
and genomic coordinates. Currently, TE transposition can be inferred from an individual
by the appearance of novel bands on a TD gel (24, 127). That being said, TE Displayer
can similarly be used to detect transposition events by comparing TE profiles across
different individuals, tissues, or generations. In addition, TE Displayer allows
researchers to compare computational TE profiles with that of experimental TE profiles,
enabling them to detect genome assembly and sequencing errors, and provides
researchers with an initial idea of what to expect on a TD gel.
68
Chapter 4 Concluding Remarks
Often considered ―parasitic‖, TEs are now known to have a beneficial role in
some instances for the genomes in which they reside. Multiple examples of molecular
domestication illustrate how TEs and TE-derived sequences can become essential
components of genomes, regulating gene expression and becoming crucial for proper
host development (93). Furthermore, TEs have been attributed as major drivers in
vertebrate diversity, and may play an important role in speciation (128). Although TE
movement throughout a genome can cause mutations either through their direct
insertion or from TE footprints at the location of excision, these mutations have been
attributed to enlarging genetic variation in populations (129).
TEs are a major driving force of genome evolution, despite the fact that the
majority of TEs are not active. In studying genome sequences and sizes, researchers
have revealed some interesting findings about eukaryotic genomes, including the fact
that an organism’s morphological complexity and genome size are not correlated and
that most eukaryotic DNA is comprised mostly of non-coding regions [see (130)for
review]. Moreover, it is becoming increasingly evident that TEs are the major contributor
to eukaryotic genome size, with total TE content and genome size having shown a
strong positive correlation (76, 131) Even though TEs were discovered over 60 years
ago in the maize genome, TEs continue to be discovered in a diversity of genomes.
Therefore, the importance of revealing which TEs are potentially and currently active on
a genome-wide scale and what consequences arise from their transposition is important
69
for understanding genome evolution. Identifying and grasping the entirety of active TEs
will provide a better understanding of genome structure and evolution.
70
References
1. McClintock B (1948) Mutable loci in maize. Carnegie Institute Washington Year Book
47:155-169.
2. McClintock B (1947) Cytogenetic studies of maize and Neurospora. Carnegie Institute
Washington Year Book 46:146-152.
3. Gardner MJ et al. (2002) Genome sequence of the human malaria parasite Plasmodium
falciparum. Nature 419:498-511.
4. Kunst F et al. (1997) The complete genome sequence of the gram-positive bacterium
Bacillus subtilis. Nature 390:249-256.
5. Schnable PS et al. (2009) The B73 maize genome: complexity, diversity, and dynamics.
Science 326:1112-1115.
6. SanMiguel P, Bennetzen JL (1998) Evidence that a recent increase in maize genome size
was caused by the massive amplification of intergene retrotransposons. Annals of Botany
82:37-44.
7. SanMiguel P et al. (1996) Nested retrotransposons in the intergenic regions of the maize
genome. Science 274:765-768.
8. Biemont C (2010) A brief history of the status of transposable elements: from junk DNA
to major players in evolution. Genetics 186:1085-1093.
9. Finnegan DJ (1989) Eukaryotic transposable elements and genome evolution. Trends in
Genetics 5:103-107.
10. Wicker T et al. (2007) A unified classification system for eukaryotic transposable
elements. Nature Reviews Genetics 8:973-982.
11. Bennetzen JL (2000) Transposable element contributions to plant gene and genome
evolution. Plant Molecular Biology 42:251-269.
12. Kumar A, Bennetzen JL (1999) Plant retrotransposons. Annual Review of Genetics
33:479-532.
13. Han JS, Boeke JD (2005) LINE-1 retrotransposons: modulators of quantity and quality of
mammalian gene expression? Bioessays 27:775-784.
14. Sabot F, Schulman AH (2006) Parasitism and the retrotransposon life cycle in plants: a
hitchhiker’s guide to the genome. Heredity 97:381-388.
15. Cordaux R, Batzer MA (2009) The impact of retrotransposons on human genome
evolution. Nature Reviews Genetics 10:691-703.
71
16. Smit AF, Riggs AD (1996) Tiggers and DNA transposon fossils in the human genome.
Proceedings of the National Academy of Science 93:1443-1448.
17. Feschotte C, Pritham EJ (2005) Non-mammalian c-integrases are encoded by giant
transposable elements. Trends in Genetics 21:551-552.
18. Kapitonov VV, Jurka J (2006) Self-synthesizing DNA transposons in eukaryotes.
Proceedings of the National Academy of Sciences 103:4540-4540.
19. Pritham EJ, Putliwala T, Feschotte C (2007) Mavericks, a novel class of giant transposable
elements widespread in eukaryotes and related to DNA viruses. Gene 390:3-17.
20. display.cgi?uids=1334917 Available at:
http://www.hubmed.org/display.cgi?uids=1334917 [Accessed August 16, 2011].
21. Bureau TE, Wessler SR (1992) Tourist: a large family of small inverted repeat elements
frequently associated with maize genes. Plant Cell 4:1283-1294.
22. Bureau TE, Wessler SR (1994) Mobile inverted-repeat elements of the Tourist family are
associated with the genes of many cereal grasses. Proceedings of the National Academy of
Sciences 91:1411-1415.
23. Kikuchi K, Terauchi K, Wada M, Hirano HY (2003) The plant MITE mPing is mobilized
in anther culture. Nature 421:167-170.
24. Jiang N et al. (2003) An active DNA transposon family in rice. Nature 421:163-167.
25. Naito K et al. (2006) Dramatic amplification of a rice transposable element during recent
domestication. Proceedings of the National Academy of Sciences 103:17620-17625.
26. Jiang N, Feschotte C, Zhang X, Wessler SR (2004) Using rice to understand the origin and
amplification of miniature inverted repeat transposable elements (MITEs). Current
Opinion in Plant Biology 7:115-119.
27. Feschotte C, Swamy L, Wessler SR (2003) Genome-wide analysis of mariner-like
transposable elements in rice reveals complex relationships with stowaway miniature
inverted repeat transposable elements (MITEs). Genetics 163:747-758.
28. MacRae AF, Clegg MT (1992) Evolution of Ac and Dsl elements in select grasses
(Poaceae). Genetica 86:55-66.
29. Tsubota SI, Huong DV (1991) Capture of flanking DNA by a P element in Drosophila
melanogaster: creation of a transposable element. Proceedings of the National Academy of
Sciences 88:693 -697.
30. Surzycki SA, Belknap WR (1999) Characterization of Repetitive DNA Elements in
Arabidopsis. J Mol Evol 48:684-691.
72
31. Unsal K, Morgan GT (1995) A novel group of families of short interspersed repetitive
elements (SINEs) in Xenopus: evidence of a specific target site for DNA-mediated
transposition of inverted-repeat SINEs. Journal of Molecular Biology 248:812-823.
Supplementary Figure 1: The amino acid alignment of annotated hATTPases and transposase protein sequences from Arsenburger et al. (2011). Alignments were generated from M-COFFEE.
Supplementary Figure 2: Sequence of the putative hAT transposase, hATTP16. Underlined sequence is the coding region. Insertion locations that were repaired are denoted by asterisks (*). Substitutions that were repaired are denoted by red residues. Grey background-intron; yellow background-TIRs
97
TE Family Name Clades # Members Consensus Sequence Size (bp)