- 1 - Automated analysis of ARISA data using ADAPT system Robert Schmieder 1,2 , Matthew Haynes 3 , Elizabeth Dinsdale 3 , Forest Rohwer 3 , and Robert Edwards 1,3,4§ 1 Department of Computer Science, San Diego State University, San Diego, CA, USA 2 Computational Science Research Center, San Diego State University, San Diego, CA, USA 3 Department of Biology, San Diego State University, San Diego, CA, USA 4 Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL, USA § Corresponding author Email addresses: RS: [email protected]MH: [email protected]ED: [email protected]FR: [email protected]RE: [email protected]
34
Embed
Automated analysis of ARISA data using ADAPT system · - 1 - Automated analysis of ARISA data using ADAPT system Robert Schmieder1,2, Matthew Haynes3, Elizabeth Dinsdale3, Forest
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
- 1 -
Automated analysis of ARISA data using ADAPT system
Robert Schmieder1,2, Matthew Haynes3, Elizabeth Dinsdale3, Forest Rohwer3, and Robert Edwards1,3,4§
1Department of Computer Science, San Diego State University, San Diego, CA, USA
2Computational Science Research Center, San Diego State University, San Diego, CA,
USA
3Department of Biology, San Diego State University, San Diego, CA, USA
4Mathematics and Computer Science Division, Argonne National Laboratory, Argonne,
BS69, and Syntrophus aciditrophicus SB). Additionally, the analysis identified a range of
pathogenic bacterial species present on the two islands (Table 3). The microbial
- 15 -
communities of the coral reefs near the two Northern Line Islands showed a high
percentage of heterotrophic organisms: 72.95% for Kiritimati and 70.32% for Tabuaeran
(Figure 4A). In addition, the Kiritimati coral reef water sample showed a higher
abundance of pathogens (53.02%) compared to non-pathogens (32.65%) than in the
Tabuaeran water sample with 20.13% pathogenic and 66.51% non-pathogenic species
(Figure 4B).
Discussion ARISA is fast, cheap, reproducible, and accurate. The sensitivity, in this case the ability
to detect a single-nucleotide difference in sequence length, of ARISA is high compared to
other fingerprinting techniques. The ARISA technique allows the detection of different
fragment lengths in organisms displaying 99% 16S rRNA gene similarity, suggesting a
higher level of resolution than other techniques [5, 26]. ARISA has previously been used
to analyze the genetic structure of several bacterial communities including samples from
freshwater, bacterioplankton, and different soils. In addition to the community or
environmental studies, the technique was used for clinical applications. The ease of use
and the rapid detection and identification of pathogens present in a sample may be
promising for molecular diagnostics of infectious diseases. Rapid detection is particularly
useful because it would allow quick initiation of the appropriate antimicrobial therapy.
Although ARISA has many advantages over similar techniques, it also has drawbacks
that may limit its widespread application. Some of the drawbacks are unique to ARISA,
but most are common to all environmental sampling techniques. The most abundant
bacteria within a sample result in the strongest signal (highest peaks). Therefore, the
abundance of bacteria present in lesser numbers or near the detection limit may not be
- 16 -
measured accurately. Consequently, peaks representing fragments of organisms with low
abundance may be classified as noise or not even detected. Second, environmental
samples are random sampling of the entire microbial community. Hence, taxonomic
profiles obtained for different samples from the same environment may show some
degree of variation. Third, PCR amplification is a stochastic process that is susceptible to
biases. PCR-based techniques such as ARISA may be biased by uneven amplification
efficiencies of the PCR fragments due to sequence heterogeneity [27, 28], differential
extractability of cells in communities [29], or differences in the amount of DNA per cell.
These errors are common to all methods that use PCR amplification, such as T-RFLP and
16S rRNA sequence analysis. As shown in Table 2, some Archaea and Bacteria contain
multiple ITS regions, which can either be exact copies or vary slightly within one
organism. Multiple exact copies of the same ITS region may increase the intensity of a
PCR fragment of a certain length. In the analysis of the ARISA data sets, this may distort
the results (e.g. higher peaks).
An alternative to ARISA for analyzing the taxonomic composition of environmental
samples is the 16S rRNA sequence analysis. Recently, DNA pyrosequencing techniques
have been used to sequence hypervariable regions of the 16S rRNA gene to investigate
the microbial diversity in different samples [30]. Pyrosequencing overcomes the low
throughput problem of conventional capillary sequencing, but is expensive both for
sequencing and data analysis.
ARISA is more reproducible than conventional gel-based techniques like T-RFLP. For
example, the instrumental automatism of ARISA guarantees its reproducibility among
different laboratories or operators. Additionally, ADAPT requires PCR primers and the
- 17 -
size standard in order to perform the analysis. This allows a higher level of comparability
between laboratories using different primer sets and size standards.
Previously established ITS region resources are not maintained anymore (e.g. RISSC
collection [31] and IWoCS database [32]), which causes the information to be outdated
and/or inaccurate. The development of the ADAPTdb database was driven by the need
for a maintained ITS region data resource that can be used for the analysis of ARISA data
sets and provides metadata for additional analysis. The ADAPTdb database contains 16S-
ITS-23S regions and metadata of their source organisms, including taxonomy, trophic
classification, and pathogenicity. ADAPT uses the ADAPTdb database to automatically
analyze ARISA data sets.
The usability of ADAPT has been demonstrated in several practical applications.
Although all the samples showed more heterotrophic than autotrophic organisms, the
composition of the reference ITS region database used by ADAPT skews these results:
approximately 92% of the database entries are classified as heterotrophic while only 8%
are classified as autotrophic. Therefore, the results may show a higher proportion of
heterotrophs than the samples actually contain. However, in the case the Line Island
analysis, the ARISA analysis supported the previous metagenomics analyses that showed
proportionally high numbers of heterotrophs on degraded reefs near Kiritimati and
Tabuaeran than on the healthy reef near Kingman [25]. The results also agree with the
previous results from Dinsdale et al. (2008) that showed there are more potential
pathogens at Kiritimati than on the other islands and that Kingman has the lowest
proportion of potential pathogens. In that study, the authors suggested that the increase of
human disturbance coincides with an increase of pathogenic species, some of which are
- 18 -
pathogenic for corals and fish. A higher number of pathogenic species may therefore
damage the coral health and reduce the number of small fish.
Fragment length binning, used to consider inexact matches during the mapping of input
data to the database is necessary to account for variability in fragment lengths caused by
different sequencing machines and methods used to generate the ARISA profiles.
However, it is not readily apparent what are appropriate bins. For example, prior studies
used different bin values in different fragment or ITS length intervals. Fuhrman and
Hewson (2004) used a bin value of three for lengths up to 500 bp and seven for lengths
longer than 500 bp [33]. Two years later, they used a bin value of three for lengths up to
700 bp, five for 700 bp to 1000 bp and ten for lengths longer than 1,000 bp [6]. Hewson
et al. (2006) used bin values of three for lengths from 400 bp to 700 bp, five from 700 bp
to 1,000 bp and eleven for lengths longer than 1,000 bp in their studies [34]. Ruan et al
(2006) suggested a binning strategy with variable bin values less than a maximal bin
value. The maximal bin values they used are three for lengths up to 600 bp, five for
lengths up to 698 bp and seven for lengths longer than 698 bp [35]. The ADAPT web
interface allows the user to set different bin values for length intervals of < 600 bp, from
600 bp to 899 bp, and > 899 bp. The standard bin values set as defaults in the web
interface are five (equivalent to window of ±2) for lengths smaller than 600 bp and seven
(equivalent to window of ±3) for lengths greater than or equal to 600 bp. For analysis
without length binning, the length window values can be set to 0 for each interval,
providing only exact length matches against the database entries of ADAPTdb.
Accurate identification of ITS regions depends on the correct annotation of flanking 16S
and 23S rRNA gene regions. Commonly, the mapping of ITS regions to a database is
- 19 -
done from the length of the ITS regions calculated from the stop of the 16S to the start of
the 23S rRNA gene. This approach requires a conserved length of the flanking regions
where the primers bind (inside the 16S and 23S rRNA genes) and accurate annotation of
the ends of the genes. Using the fragment length of the PCR product (determined by the
primers) instead overcomes limitations of annotation errors, as imposed by the fragment
length of wrongly annotated gene positions. Including the PCR primer sets used for the
generation of the ARISA data set gives the advantage that miss-annotations of flanking
rRNA regions will not change the outcome of the analysis of ADAPT.
The commonly used primer sets for ARISA [36, 37] might not allow amplification of all
16S-ITS-23S regions in ADAPTdb. The primer sets are designed to recognize the highly
conserved sequences in the flanking regions of the 16S and 23S rRNA genes. The
number of exact matches of the primer set to the 16S and 23S regions in the database
were compared (Table 4). The primer set 1406f/23Sr showed the best result with exact
matches against 87% of the regions, followed by the ITSF/ITSReub primer set with 36%
exact matches. We found no exact matches of the S-D-Bact-1522-b-S-20/L-D-Bact-132-
a-A-18 primer set against the 16S-ITS-23S regions in the database. Of course, allowing
mismatches as may occur during PCR amplification in the primer sequences showed
higher numbers of matching regions for all three primer sets (data not shown).
Jensen et al. (1993) reported that around 85% of the species they analyzed produced
more than one peak in the ARISA profile [38]. Most of the genomes contain one or two
sets of 16S and 23S rRNA genes. We found that five of the bacterial genomes in the
NCBI Genome database contain 12 and another five 13 copies of the 16S and 23S rRNA
gene. Only a few genomes were annotated with different numbers of 16S and 23S rRNA
- 20 -
genes. Since 16S and 23S rRNA gene products are subunits of the same enzyme it is
unlikely that the number of copies varies for the two genes and this is more likely to
either incorrect or incomplete annotations. Because of interoperonic length variation, a
single organism can contribute more than one peak to a sample. This can cause problems
if the analysis does not account for multiple different ITS regions in one organism and
identifies the different ITS regions as individual organisms. Therefore, using only
completely sequenced microbial genomes for the analysis can significantly reduce this
limitation of ARISA. All (annotated) 16S and 23S rRNA gene regions from complete
genomes (including the ITS regions) in ADAPTdb were retrieved from the external
databases and are used to compensate for multiple peaks in a single organism. The
taxonomic composition of the ARISA sample is calculated in ADAPT based on the
fragment lengths that are present in the complete genome for a specific organism.
Organisms that can be detected in ARISA data sets are only those that are represented in
the reference database used to analyze ARISA data sets. Therefore, it is important to
provide as many regions as possible in the database that is used for the analysis. As the
number of sequenced organisms is continuously increasing, it becomes possible to fill in
the gaps of missing regions in the ADAPTdb database. Keeping ADAPTdb up-to-date
with information of newly sequenced organisms requires regular updates. The newly
developed ADAPTdb is equipped with an automated update function that ensures weekly
updates from data resources such as the NCBI Genome database and the SEED database.
In the future, including data resources with ITS regions from organisms whose genomes
are not fully sequenced may prove advantageous.
- 21 -
Conclusions ADAPT provides a free, platform-independent tool to the research community that
enables the users to automatically analyze ARISA data sets via an easy-to-use web
interface. Environmental samples from the Line Islands were analyzed using these tools
to recapitulate the trophic differences seen in other studies and attributed to human
impact. ADAPT and ADAPTdb enable rapid assessment of environmental communities
by ARISA.
Availability and requirements • Project name: ADAPT
• Project home page: http://edwards.sdsu.edu/adapt
• Operating system(s): Web service, platform independent
• Programming language: Perl
• Restrictions to use by non-academics: None
Authors' contributions RS has designed and implemented the database and web application, tested the program,
and drafted the initial manuscript. ED collected the water samples. MH extracted the
DNA, processed the samples to generate the ARISA data sets, and participated in the
GUI component design. FR and RE conceived of the study, and participated in its design
and coordination. All authors read and approved the final version of the manuscript.
Acknowledgements We thank Lutz Krause, Ramy K. Aziz, and Peter Salamon for helpful discussions. This
work was supported by grants DBI 0850356 Advances in Bioinformatics from the
- 22 -
National Science Foundation, the Gordon and Betty Moore Foundation, and the Canadian
Institute for Advanced Research.
- 23 -
References 1. Fisher MM, Triplett EW: Automated approach for ribosomal intergenic
spacer analysis of microbial diversity and its application to freshwater bacterial communities. Appl Environ Microbiol 1999, 65(10):4630-4636.
2. Yannarell AC, Triplett EW: Geographic and environmental sources of variation in lake bacterial community composition. Appl Environ Microbiol 2005, 71(1):227-239.
3. Brown MV, Schwalbach MS, Hewson I, Fuhrman JA: Coupling 16S-ITS rDNA clone libraries and automated ribosomal intergenic spacer analysis to show marine microbial diversity: development and application to a time series. Environ Microbiol 2005, 7(9):1466-1479.
4. Brown MV, Fuhrman JA: Marine bacterial microdiversity as revealed by internal transcribed spacer analysis. Aquat Microb Ecol 2005, 41(1):15-23.
5. Danovaro R, Luna GM, Dell'anno A, Pietrangeli B: Comparison of two fingerprinting techniques, terminal restriction fragment length polymorphism and automated ribosomal intergenic spacer analysis, for determination of bacterial diversity in aquatic environments. Appl Environ Microbiol 2006, 72(9):5982-5989.
6. Fuhrman JA, Hewson I, Schwalbach MS, Steele JA, Brown MV, Naeem S: Annually reoccurring bacterial communities are predictable from ocean conditions. Proc Natl Acad Sci U S A 2006, 103(35):13104-13109.
7. Hewson I, Capone DG, Steele JA, Fuhrman JA: Influence of Amazon and Orinoco offshore surface water plumes on oligotrophic bacterioplankton diversity in the west tropical Atlantic. Aquatic Microbial Ecology 2006, 43(1):11-22.
8. Leuko S, Goh F, Allen MA, Burns BP, Walter MR, Neilan BA: Analysis of intergenic spacer region length polymorphisms to investigate the halophilic archaeal diversity of stromatolites and microbial mats. Extremophiles 2007, 11(1):203-210.
9. Fuhrman JA, Steele JA, Hewson I, Schwalbach MS, Brown MV, Green JL, Brown JH: A latitudinal diversity gradient in planktonic marine bacteria. Proc Natl Acad Sci U S A 2008, 105(22):7774-7778.
10. Ranjard L, Poly F, Lata JC, Mougel C, Thioulouse J, Nazaret S: Characterization of bacterial and fungal soil communities by automated ribosomal intergenic spacer analysis fingerprints: biological and methodological variability. Appl Environ Microbiol 2001, 67(10):4479-4487.
11. Kirk JL, Beaudette LA, Hart M, Moutoglis P, Klironomos JN, Lee H, Trevors JT: Methods of studying soil microbial diversity. J Microbiol Methods 2004, 58(2):169-188.
- 24 -
12. Jones CM, Thies JE: Soil microbial community analysis using two-dimensional polyacrylamide gel electrophoresis of the bacterial ribosomal internal transcribed spacer regions. J Microbiol Methods 2007, 69(2):256-267.
13. Baudart J, Lemarchand K, Brisabois A, Lebaron P: Diversity of Salmonella strains isolated from the aquatic environment as determined by serotyping and amplification of the ribosomal DNA spacer regions. Appl Environ Microbiol 2000, 66(4):1544-1552.
14. Clementino MM, de Filippis I, Nascimento CR, Branquinho R, Rocha CL, Martins OB: PCR analyses of tRNA intergenic spacer, 16S-23S internal transcribed spacer, and randomly amplified polymorphic DNA reveal inter- and intraspecific relationships of Enterobacter cloacae strains. J Clin Microbiol 2001, 39(11):3865-3870.
15. Gonzalez-Escalona N, Jaykus LA, DePaola A: Typing of Vibrio vulnificus strains by variability in their 16S-23S rRNA intergenic spacer regions. Foodborne Pathog Dis 2007, 4(3):327-337.
16. Boyer SL, Flechtner VR, Johansen JR: Is the 16S-23S rRNA internal transcribed spacer region a good tool for use in molecular systematics and population genetics? A case study in cyanobacteria. Mol Biol Evol 2001, 18(6):1057-1069.
17. Xu D, Cote JC: Phylogenetic relationships between Bacillus species and related genera inferred from comparison of 3' end 16S rDNA and 5' end 16S-23S ITS nucleotide sequences. Int J Syst Evol Microbiol 2003, 53(Pt 3):695-704.
19. Overbeek R, Disz T, Stevens R: The SEED: a peer-to-peer environment for genome annotation. Commun ACM 2004, 47(11):46-51.
20. Sayers EW, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Edgar R, Federhen S et al: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 2009, 37(Database issue):D5-15.
23. The SEED Viever [http://seed-viewer.theseed.org/]
24. Lproks.txt file with pathogenesis information [ftp://ftp.ncbi.nih.gov/genomes/Bacteria/lproks_0.txt]
25. Dinsdale EA, Pantos O, Smriga S, Edwards RA, Angly F, Wegley L, Hatay M, Hall D, Brown E, Haynes M et al: Microbial ecology of four coral atolls in the Northern Line Islands. PLoS ONE 2008, 3(2):e1584.
- 25 -
26. Jaspers E, Overmann J: Ecological significance of microdiversity: identical 16S rRNA gene sequences can be found in bacteria with highly divergent genomes and ecophysiologies. Appl Environ Microbiol 2004, 70(8):4831-4839.
27. Suzuki M, Rappe MS, Giovannoni SJ: Kinetic bias in estimates of coastal picoplankton community structure obtained by measurements of small-subunit rRNA gene PCR amplicon length heterogeneity. Appl Environ Microbiol 1998, 64(11):4522-4529.
28. Polz MF, Cavanaugh CM: Bias in template-to-product ratios in multitemplate PCR. Appl Environ Microbiol 1998, 64(10):3724-3730.
29. Polz MF, Harbison C, Cavanaugh CM: Diversity and heterogeneity of epibiotic bacterial communities on the marine nematode Eubostrichus dianae. Appl Environ Microbiol 1999, 65(9):4271-4275.
30. Sogin ML, Morrison HG, Huber JA, Mark Welch D, Huse SM, Neal PR, Arrieta JM, Herndl GJ: Microbial diversity in the deep sea and the underexplored "rare biosphere". Proc Natl Acad Sci U S A 2006, 103(32):12115-12120.
31. Garcia-Martinez J, Bescos I, Rodriguez-Sala JJ, Rodriguez-Valera F: RISSC: a novel database for ribosomal 16S-23S RNA genes spacer regions. Nucleic Acids Res 2001, 29(1):178-180.
33. Hewson I, Fuhrman JA: Richness and diversity of bacterioplankton species along an estuarine gradient in Moreton Bay, Australia. Appl Environ Microbiol 2004, 70(6):3425-3433.
34. Hewson I, Fuhrman JA: Improved strategy for comparing microbial assemblage fingerprints. Microb Ecol 2006, 51(2):147-153.
35. Ruan Q, Steele JA, Schwalbach MS, Fuhrman JA, Sun F: A dynamic programming algorithm for binning microbial community profiles. Bioinformatics 2006, 22(12):1508-1514.
36. Cardinale M, Brusetti L, Quatrini P, Borin S, Puglia AM, Rizzi A, Zanardini E, Sorlini C, Corselli C, Daffonchio D: Comparison of different primer sets for use in automated ribosomal intergenic spacer analysis of complex bacterial communities. Appl Environ Microbiol 2004, 70(10):6147-6156.
37. Jones SE, Shade AL, McMahon KD, Kent AD: Comparison of primer sets for use in automated ribosomal intergenic spacer analysis of aquatic bacterial communities: an ecological perspective. Appl Environ Microbiol 2007, 73(2):659-662.
38. Jensen MA, Webster JA, Straus N: Rapid identification of bacteria on the basis of polymerase chain reaction-amplified ribosomal DNA spacer polymorphisms. Appl Environ Microbiol 1993, 59(4):945-952.
- 26 -
Figures Figure 1 - Basic approach of ARISA After extracting the DNA from the sample (A1), the ITS region between the 16S and 23S
rRNA genes is amplified using PCR with fluorescently labeled primers (A2). The
resulting PCR products vary in their length, since the length of the ITS region can be
different for different species. The amplified DNA is then run on a sequencing machine
and the length of the ITS-region is then measured from its trace output (A3). The
prokaryotic 16S, 23S and 5S rDNA arranged in a tandem repetitive cluster (B). NTS
stands for non-transcribed spacer, ETS for external transcribed spacer and ITS for
internal transcribed spacer.
Figure 2 - ADAPT web interface The interactive web interface facilitates data input and parameter setup as shown in the
screenshot, as well as navigation through the output of the analysis results.
Figure 3 - Trophic (A) and pathogenic (B) composition of organism entries in the ADAPTdb database
Figure 4 - ADAPT chart output of the trophic (A) and pathogenic (B) fractions for each ARISA sample B05_KIRB3_003 represents samples taken near Kiritimati and A05_TABB1_001 the
samples taken near Tabuaeran.
- 27 -
Tables Table 1 - Trophic type classification for phyla in ADAPTdb