Next Generation Sequencing, Tiling Arrays and Predictive Sequence Analysis for Transcriptome Analysis Gunnar R¨ atsch Friedrich Miescher Laboratory Max Planck Society, T¨ ubingen, Germany 9 th Course in Bioinformatics and Systems Biology for Molecular Biologists (March 24, 2009) c Gunnar R¨ atsch (FML, T¨ ubingen) Methods for Transcriptome Analysis Bertinoro Systems Biology 1 / 89
221
Embed
Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Next Generation Sequencing, Tiling Arrays and
Predictive Sequence Analysis for
Transcriptome Analysis
Gunnar Ratsch
Friedrich Miescher Laboratory
Max Planck Society, Tubingen, Germany
9th Course in Bioinformatics and Systems Biologyfor Molecular Biologists (March 24, 2009)
Discovery of the Nuclein(Friedrich Miescher, 1869) fml
Discovery of Nuclein:
from lymphocyte & salmon
“multi-basic acid” (≥ 4)
Tubingen, around 1869
“If one . . . wants to assume that a single substance . . . is the specificcause of fertilization, then one should undoubtedly first and foremostconsider nuclein” (Miescher, 1874)
Discovery of the Nuclein(Friedrich Miescher, 1869) fml
Discovery of Nuclein:
from lymphocyte & salmon
“multi-basic acid” (≥ 4)
Tubingen, around 1869
“If one . . . wants to assume that a single substance . . . is the specificcause of fertilization, then one should undoubtedly first and foremostconsider nuclein” (Miescher, 1874)
Whole-genome Tiling Arrays Technology and Limitations
Tiling Array Analysis Challenges (III) fmlTranscript normalization assumes constant transcript intensitiesy i (median estimate) [Zeller et al., 2008c]
Learns intensity deviation from transcript intensity δi := yi − y i
Takes probe sequence xi (positional information on mono-, di-and tri-mer occurrence) as input for regression.Models probe sequence effect depending on yi : f (xi , yi) ≈ δi
Whole-genome Tiling Arrays Technology and Limitations
Tiling Array Analysis Challenges (III) fmlTranscript normalization assumes constant transcript intensitiesy i (median estimate) [Zeller et al., 2008c]
Learns intensity deviation from transcript intensity δi := yi − y i
Takes probe sequence xi (positional information on mono-, di-and tri-mer occurrence) as input for regression.Models probe sequence effect depending on yi : f (xi , yi) ≈ δi
0
5
10
Log-
inte
nsity
transcript
transcript intensityfold difference δ between observed and transcript intensity
Whole-genome Tiling Arrays Technology and Limitations
Tiling Array Analysis Challenges (III) fmlTranscript normalization assumes constant transcript intensitiesy i (median estimate) [Zeller et al., 2008c]
Learns intensity deviation from transcript intensity δi := yi − y i
Takes probe sequence xi (positional information on mono-, di-and tri-mer occurrence) as input for regression.Models probe sequence effect depending on yi : f (xi , yi) ≈ δi
0
5
10
Log-
inte
nsity
transcript
transcript intensityfold difference δ between observed and transcript intensity
Whole-genome Tiling Arrays Technology and Limitations
Tiling Array Analysis Challenges (III) fmlTranscript normalization assumes constant transcript intensitiesy i (median estimate) [Zeller et al., 2008c]
Learns intensity deviation from transcript intensity δi := yi − y i
Takes probe sequence xi (positional information on mono-, di-and tri-mer occurrence) as input for regression.Models probe sequence effect depending on yi : f (xi , yi) ≈ δi
0
5
10
Log-
inte
nsity
f (x)1
f (x)q
f (x)Q. .
.. .
.Discretize y into Q = 20quantiles and estimateQ independent functionsf1(x), . . . , fQ(x)
Whole-genome Tiling Arrays Identification of Expression Differences
Identification of Expression Changes fml1 Map tiling probes to annotated transcripts (define probe sets)
2 Use standard microarray tools to analyze gene expression
Gene expression values are typically computed using robust“summarization methods” that account for probe noise[e.g. Irizarry et al., 2003]
Significant expression changes are typically identified with a statisticaltest. Results have to be corrected for multiple testing[e.g. Storey and Tibshirani, 2003]
Advantages of tiling arrays:
Annotations change, only remapping is needed to obtainexpression measurements for the latest annotation.
Expression can be measured per exon, not only per gene.
Expression can be measured for introns (⇒ detect retention).
Whole-genome Tiling Arrays Identification of Expression Differences
Identification of Expression Changes fml1 Map tiling probes to annotated transcripts (define probe sets)
2 Use standard microarray tools to analyze gene expression
Gene expression values are typically computed using robust“summarization methods” that account for probe noise[e.g. Irizarry et al., 2003]
Significant expression changes are typically identified with a statisticaltest. Results have to be corrected for multiple testing[e.g. Storey and Tibshirani, 2003]
Advantages of tiling arrays:
Annotations change, only remapping is needed to obtainexpression measurements for the latest annotation.
Expression can be measured per exon, not only per gene.
Expression can be measured for introns (⇒ detect retention).
Whole-genome Tiling Arrays De Novo Transcript Discovery
Transfrag Method / Affymetrix TARs fml
1 Identify “positive probes” in local neighborhood. Smooth datalocally, across replicates (Pseudomedian1
[Royce et al., 2007a])Two approaches:
define an ad hoc threshold on smoothed signal intensity (e.g.90th signal percentile) [Kampa et al., 2004]
estimate a threshold from negative bacterial control probes toadjust an empirical false discovery rate [He et al., 2007]
2 Combine positive probes into “transfrags” in case of a run ofconsecutive positive probes (minRun) interrupted by a limitednumber of negative probes (maxGap) [Bertone et al., 2004, Kampa et al., 2004]
Whole-genome Tiling Arrays De Novo Transcript Discovery
Transfrag Method / Affymetrix TARs fml
1 Identify “positive probes” in local neighborhood. Smooth datalocally, across replicates (Pseudomedian1
[Royce et al., 2007a])Two approaches:
define an ad hoc threshold on smoothed signal intensity (e.g.90th signal percentile) [Kampa et al., 2004]
estimate a threshold from negative bacterial control probes toadjust an empirical false discovery rate [He et al., 2007]
2 Combine positive probes into “transfrags” in case of a run ofconsecutive positive probes (minRun) interrupted by a limitednumber of negative probes (maxGap) [Bertone et al., 2004, Kampa et al., 2004]
Whole-genome Tiling Arrays De Novo Transcript Discovery
Transfrag Method / Affymetrix TARs fml
1 Identify “positive probes” in local neighborhood. Smooth datalocally, across replicates (Pseudomedian1
[Royce et al., 2007a])Two approaches:
define an ad hoc threshold on smoothed signal intensity (e.g.90th signal percentile) [Kampa et al., 2004]
estimate a threshold from negative bacterial control probes toadjust an empirical false discovery rate [He et al., 2007]
2 Combine positive probes into “transfrags” in case of a run ofconsecutive positive probes (minRun) interrupted by a limitednumber of negative probes (maxGap) [Bertone et al., 2004, Kampa et al., 2004]
Apply statistical test for significant expression changeto signal from transcriptionally active regions (TARs)defined by previous segmentation [Zeller et al., 2009].
Illumina Sequencing fmlSolexa released a sequencing machine in 2006Fragment sizes from 28− 75Probes are fixed to a glass plate “flow cell”Reagents are directed through flow cell
SOLiD Sequencing fmlSequencing by ligation: Fragments ligated to “beads”PCR, beads enriched with fragments, ends of the templatesmodified to allow for an attachment to the slideBeads are deposited onto a glass slideDi-base probes compete for ligation to the sequencing primer
Short Reads Assembly fmlRead assembly problemFor a set of reads stemming from a reference genome find maximallyoverlapping parts in order to reconstruct the genomic sequence
Classical assembly: ⇒ Too inefficient for short reads1 Overlap phase: Every read is compared with every other read and the overlap
graph is computed
2 Layout phase: Pairs are determined that position every read in the assembly
3 Consensus phase: Multi-alignment of all the placed reads is produced toobtain the final sequence
New techniques: Plethora of tools available (EULER, VELVET,SHARCGS, SSAKE/VCAKE, . . . )
Short Reads Assembly fmlRead assembly problemFor a set of reads stemming from a reference genome find maximallyoverlapping parts in order to reconstruct the genomic sequence
Classical assembly: ⇒ Too inefficient for short reads1 Overlap phase: Every read is compared with every other read and the overlap
graph is computed
2 Layout phase: Pairs are determined that position every read in the assembly
3 Consensus phase: Multi-alignment of all the placed reads is produced toobtain the final sequence
New techniques: Plethora of tools available (EULER, VELVET,SHARCGS, SSAKE/VCAKE, . . . )
Short Reads Assembly fmlRead assembly problemFor a set of reads stemming from a reference genome find maximallyoverlapping parts in order to reconstruct the genomic sequence
Classical assembly: ⇒ Too inefficient for short reads1 Overlap phase: Every read is compared with every other read and the overlap
graph is computed
2 Layout phase: Pairs are determined that position every read in the assembly
3 Consensus phase: Multi-alignment of all the placed reads is produced toobtain the final sequence
New techniques: Plethora of tools available (EULER, VELVET,SHARCGS, SSAKE/VCAKE, . . . )
Short Reads Assembly fmlRead assembly problemFor a set of reads stemming from a reference genome find maximallyoverlapping parts in order to reconstruct the genomic sequence
Classical assembly: ⇒ Too inefficient for short reads1 Overlap phase: Every read is compared with every other read and the overlap
graph is computed
2 Layout phase: Pairs are determined that position every read in the assembly
3 Consensus phase: Multi-alignment of all the placed reads is produced toobtain the final sequence
New techniques: Plethora of tools available (EULER, VELVET,SHARCGS, SSAKE/VCAKE, . . . )
Nodes represent k-mers smaller than read lengthA k-mer can refer to thousands of reads containing itRead errors or ambiguities lead to branching of pathsEach node also stores the reverse complement
Nodes represent k-mers smaller than read lengthA k-mer can refer to thousands of reads containing itRead errors or ambiguities lead to branching of pathsEach node also stores the reverse complement
vet uses slightly more memory, it is significantly faster and pro-duces larger contigs, without mis-assembly. Furthermore, it cov-ers a large area of the genome with high precision.
We also tried using SHARCGS (Dohm et al. 2007) andEULER (Pevzner et al. 2001) but were not able to make theseprograms work with our data sets. This is probably due to differ-ences in the expected input, particularly in terms of coveragedepth and read length.
DiscussionWe have developed Velvet, a novel set of de Bruijn graph-basedsequence assembly methods for very short reads that can bothremove errors and, in the presence of read pair information, re-solve a large number of repeats. With unpaired reads, the assem-bly is broken when there is a repeat longer than the k-mer length.With the addition of short reads in read pair format, many ofthese repeats can be resolved, leading to assemblies similar todraft status in bacteria and reasonably long (∼5 kb) SCSCs ineukaryotic genomes.
For the latter genomes, the short readcontigs will probably have to be combinedwith long reads or other sequencing strate-gies such as BAC or fosmid pooling. Simu-lations of Breadcrumb produced virtuallyidentical N50 lengths on both a continuous5-Mb region and a discontinuous 5-Mb re-gion made up of random 150-kb BACs, with
twofold variation in BAC concentration(data not shown). This approach wouldthen require merging local assemblies.
Sequence connected supercontigshave considerably more informationthan gapped supercontigs, in that the se-quence content separating the definitivecontigs is an unresolved graph. One caneasily imagine methods that can excludethe presence of a novel sequence in theSCSC completely by considering thepotential paths in the unresolved se-quence regions, in contrast to tradi-tional supercontigs, where one cannever make such a claim. In addition,the unresolved regions will often be dis-persed repeats, and as such the classifi-cation of such regions as repeats is moreimportant than their sequence contentfor many applications.
It is important to emphasize thatassembly is not a solved problem, in par-ticular with very short reads, and therewill continue to be considerable algo-rithmic improvements. Velvet can al-ready convert high-coverage very shortreads into reasonably sized contigs withno additional information. With addi-tional paired read information to resolvesmall repeats, almost complete genomescan be assembled. We believe the Velvetframework will provide a rich set of dif-ferent algorithmic options tailored todifferent tasks and thus provide a plat-
form for cheap de novo sequence assemblies, eventually for allgenomes.
MethodsVelvet parametersVelvet was implemented in C and tested on a 64-bit Linux ma-chine.
The results of Velvet are very sensitive to the parameter k asmentioned previously. The optimum depends on the genome,the coverage, the quality, and the length of the reads. One ap-proach consists in testing several alternatives in parallel and pick-ing the best.
Another method consists in estimating the expected num-ber X of times a unique k-mer in a genome of length G is observedin a set of n reads of length l. We can link this number to thetraditional value of coverage, noted C, with the relations:
E!X" =n!l − k + 1"
G − k + 1≈
nG !l − k + 1" = C
l − k + 1l
Figure 6. Breadcrumb performance on simulated data sets. As in Figure 3, we sampled 5-Mb DNAsequences from four different species (E. coli, S. cerevisiae, C. elegans, and H. sapiens, respectively) andgenerated 50! read sets. The horizontal lines represent the N50 reached at the end of Tour Bus (seeFig. 3) (broken black line) and after applying a 4! coverage cutoff (broken red line). Note how thedifference in N50 between the graph of perfect reads and that of erroneous reads is significantlyreduced by this last cutoff. (Black curves) The results after the basic Breadcrumb algorithm; (red curves)the results after super-contigging.
Table 3. Comparison of short read assemblers on experimental Streptococcus suis Solexareads
AssemblerNo. ofcontigs N50
Averageerror rate Memory Time Seq. Cov.
Velvet 0.3 470 8661 bp 0.02% 2.0G 2 min 57 sec 97%SSAKE 2.0 265 1727 bp 0.20% 1.7G 1 h 47 min 16%VCAKE 1.0 7675 1137 bp 0.64% 1.8G 4 h 25 min 134%
Short read de novo assembly using de Bruijn graphs
Genome Research 827www.genome.org
Cold Spring Harbor Laboratory Press on March 21, 2009 - Published by genome.cshlp.orgDownloaded from
vet uses slightly more memory, it is significantly faster and pro-duces larger contigs, without mis-assembly. Furthermore, it cov-ers a large area of the genome with high precision.
We also tried using SHARCGS (Dohm et al. 2007) andEULER (Pevzner et al. 2001) but were not able to make theseprograms work with our data sets. This is probably due to differ-ences in the expected input, particularly in terms of coveragedepth and read length.
DiscussionWe have developed Velvet, a novel set of de Bruijn graph-basedsequence assembly methods for very short reads that can bothremove errors and, in the presence of read pair information, re-solve a large number of repeats. With unpaired reads, the assem-bly is broken when there is a repeat longer than the k-mer length.With the addition of short reads in read pair format, many ofthese repeats can be resolved, leading to assemblies similar todraft status in bacteria and reasonably long (∼5 kb) SCSCs ineukaryotic genomes.
For the latter genomes, the short readcontigs will probably have to be combinedwith long reads or other sequencing strate-gies such as BAC or fosmid pooling. Simu-lations of Breadcrumb produced virtuallyidentical N50 lengths on both a continuous5-Mb region and a discontinuous 5-Mb re-gion made up of random 150-kb BACs, with
twofold variation in BAC concentration(data not shown). This approach wouldthen require merging local assemblies.
Sequence connected supercontigshave considerably more informationthan gapped supercontigs, in that the se-quence content separating the definitivecontigs is an unresolved graph. One caneasily imagine methods that can excludethe presence of a novel sequence in theSCSC completely by considering thepotential paths in the unresolved se-quence regions, in contrast to tradi-tional supercontigs, where one cannever make such a claim. In addition,the unresolved regions will often be dis-persed repeats, and as such the classifi-cation of such regions as repeats is moreimportant than their sequence contentfor many applications.
It is important to emphasize thatassembly is not a solved problem, in par-ticular with very short reads, and therewill continue to be considerable algo-rithmic improvements. Velvet can al-ready convert high-coverage very shortreads into reasonably sized contigs withno additional information. With addi-tional paired read information to resolvesmall repeats, almost complete genomescan be assembled. We believe the Velvetframework will provide a rich set of dif-ferent algorithmic options tailored todifferent tasks and thus provide a plat-
form for cheap de novo sequence assemblies, eventually for allgenomes.
MethodsVelvet parametersVelvet was implemented in C and tested on a 64-bit Linux ma-chine.
The results of Velvet are very sensitive to the parameter k asmentioned previously. The optimum depends on the genome,the coverage, the quality, and the length of the reads. One ap-proach consists in testing several alternatives in parallel and pick-ing the best.
Another method consists in estimating the expected num-ber X of times a unique k-mer in a genome of length G is observedin a set of n reads of length l. We can link this number to thetraditional value of coverage, noted C, with the relations:
E!X" =n!l − k + 1"
G − k + 1≈
nG !l − k + 1" = C
l − k + 1l
Figure 6. Breadcrumb performance on simulated data sets. As in Figure 3, we sampled 5-Mb DNAsequences from four different species (E. coli, S. cerevisiae, C. elegans, and H. sapiens, respectively) andgenerated 50! read sets. The horizontal lines represent the N50 reached at the end of Tour Bus (seeFig. 3) (broken black line) and after applying a 4! coverage cutoff (broken red line). Note how thedifference in N50 between the graph of perfect reads and that of erroneous reads is significantlyreduced by this last cutoff. (Black curves) The results after the basic Breadcrumb algorithm; (red curves)the results after super-contigging.
Table 3. Comparison of short read assemblers on experimental Streptococcus suis Solexareads
AssemblerNo. ofcontigs N50
Averageerror rate Memory Time Seq. Cov.
Velvet 0.3 470 8661 bp 0.02% 2.0G 2 min 57 sec 97%SSAKE 2.0 265 1727 bp 0.20% 1.7G 1 h 47 min 16%VCAKE 1.0 7675 1137 bp 0.64% 1.8G 4 h 25 min 134%
Short read de novo assembly using de Bruijn graphs
Genome Research 827www.genome.org
Cold Spring Harbor Laboratory Press on March 21, 2009 - Published by genome.cshlp.orgDownloaded from
Transcript identification with artifi-cially generated reads from two iso-forms. The first isoform’s average readcoverage is constant 10, while the sec-ond one’s is varied (x-axis). The systemaccurately determines the transcripts in-cluding their abundance (y-axis) shown inblue and green.
Compute test-statistic for each normalized, log-transformedprobe Xijk, which is the hybridization intensity for probe i undercondition j in replicate kXijk|µ2
i ∼ N(µij, σ2i ) estimate every σ2
i to approximate posteriordistribution of µij
Use a formula akin to a t-statistic, which uses not onlyinformation for probe i to estimate standard deviation but poolsinformation from all probes for higher sensitivityCombine information from neighboring probes (movingaverage/sliding window or an HMM)
Summary & Conclusions fmlMethods for characterizing transcriptomes
in different organismsunder different conditions (development/environment)
Gene finding methodsImproved accuracy due to novel inference methodsLimitations: no alternative transcripts or expression information
Analysis of tiling array data & short readsIdentification of alternative and differential splicingSegmentation of tiling array data to identify transcribed regionsRead alignments difficult, but very promising data
Combination of predictions and transcriptome measurementsLead to improved gene findingCondition specific transcriptome predictionsHelp to uncover the full complexity of transcriptomes
Summary & Conclusions fmlMethods for characterizing transcriptomes
in different organismsunder different conditions (development/environment)
Gene finding methodsImproved accuracy due to novel inference methodsLimitations: no alternative transcripts or expression information
Analysis of tiling array data & short readsIdentification of alternative and differential splicingSegmentation of tiling array data to identify transcribed regionsRead alignments difficult, but very promising data
Combination of predictions and transcriptome measurementsLead to improved gene findingCondition specific transcriptome predictionsHelp to uncover the full complexity of transcriptomes
Summary & Conclusions fmlMethods for characterizing transcriptomes
in different organismsunder different conditions (development/environment)
Gene finding methodsImproved accuracy due to novel inference methodsLimitations: no alternative transcripts or expression information
Analysis of tiling array data & short readsIdentification of alternative and differential splicingSegmentation of tiling array data to identify transcribed regionsRead alignments difficult, but very promising data
Combination of predictions and transcriptome measurementsLead to improved gene findingCondition specific transcriptome predictionsHelp to uncover the full complexity of transcriptomes
Summary & Conclusions fmlMethods for characterizing transcriptomes
in different organismsunder different conditions (development/environment)
Gene finding methodsImproved accuracy due to novel inference methodsLimitations: no alternative transcripts or expression information
Analysis of tiling array data & short readsIdentification of alternative and differential splicingSegmentation of tiling array data to identify transcribed regionsRead alignments difficult, but very promising data
Combination of predictions and transcriptome measurementsLead to improved gene findingCondition specific transcriptome predictionsHelp to uncover the full complexity of transcriptomes
J. Behr, G. Schweikert, J. Cao, F. De Bona, G. Zeller, S. Laubinger, S. Ossowski,K. Schneeberger, D. Weigel, and G. Ratsch. Rna-seq and tiling arrays for improved genefinding. URL http:
//www.fml.tuebingen.mpg.de/raetsch/lectures/RaetschGenomeInformatics08.pdf.Oral presentation at the CSHL Genome Informatics Meeting, September 2008.
M. Bergkessel, G. Wilmes, and C. Guthrie. Snapshot: Formation of mrnps. Cell, 136, January2009.
Paul Bertone, Viktor Stolc, Thomas E Royce, Joel S Rozowsky, Alexander E Urban, XiaoweiZhu, John L Rinn, Waraporn Tongprasit, Manoj Samanta, Sherman Weissman, MarkGerstein, and Michael Snyder. Global identification of human transcribed sequences withgenome tiling arrays. Science, 306(5705):2242–6, Dec 2004. doi: 10.1126/science.1103388.
RM Clark, G Schweikert, C Toomajian, S Ossowski, G Zeller, P Shinn, N Warthmann, TT Hu,G Fu, DA Hinds, H Chen, KA Frazer, DH Huson, B Scholkopf, M Nordborg, G Ratsch,JR Ecker, and D Weigel. Common sequence polymorphisms shaping genetic diversity inarabidopsis thaliana. Science, 317(5836):338–342, 2007. ISSN 1095-9203 (Electronic). doi:10.1126/science.1138632.
A. Coghlan, T.J. Fiedler, S.J. McKay, P. Flicek, T.W. Harris, D. Blasiar, The nGASPConsortium, and L.D. Stein. ngasp: the nematode genome annotation assessment project.BMC Bioinformatics, 2008. submitted.
Lior David, Wolfgang Huber, Marina Granovskaia, Joern Toedling, Curtis J Palm, Lee Bofkin,Ted Jones, Ronald W Davis, and Lars M Steinmetz. A high-resolution map of transcriptionin the yeast genome. Proc Natl Acad Sci USA, 103(14):5320–5, Apr 2006. doi:10.1073/pnas.0601091103.
F. De Bona, S. Ossowski, K. Schneeberger, and G. Ratsch. Qpalma: Optimal splicedalignments of short sequence reads. Bioinformatics, 24:i174–i180, 2008.
Jiang Du, Joel S Rozowsky, Jan O Korbel, Zhengdong D Zhang, Thomas E Royce, Martin HSchultz, Michael Snyder, and Mark Gerstein. A supervised hidden markov model frameworkfor efficiently segmenting tiling array data in transcriptional and chip-chip experiments:systematically incorporating validated biological knowledge. Bioinformatics, 22(24):3016–24,Dec 2006. doi: 10.1093/bioinformatics/btl515.
R Durbin, S Eddy, A Krogh, and G Mitchison. Biological Sequence Analysis: Probabilisticmodels of protein and nucleic acids. Cambridge University Press, 1998.
J. Eichner. Analysis of alternative transcripts in arabidopsis thaliana with whole genome arrays.Master’s thesis, University of Tubingen, Sand 13, 72076 Tubingen, Germany, June 2008.
J. Eichner, G. Zeller, S. Laubinger, D. Weigel, and G. Ratsch. Analysis of alternative transcriptsin arabidopsis thaliana with whole genome arrays. forthcoming, March 2009.
Housheng He, Jie Wang, Tao Liu, X Shirley Liu, Tiantian Li, Yunfei Wang, Zuwei Qian, HaixiaZheng, Xiaopeng Zhu, Tao Wu, Baochen Shi, Wei Deng, Wei Zhou, Geir Skogerbø, andRunsheng Chen. Mapping the c. elegans noncoding transcriptome with a whole-genometiling microarray. Genome Research, 17(10):1471–7, Oct 2007. doi: 10.1101/gr.6611807.
Wolfgang Huber, Joern Toedling, and Lars M Steinmetz. Transcript mapping with high-densityoligonucleotide tiling arrays. Bioinformatics, 22(16):1963–70, Aug 2006. doi:10.1093/bioinformatics/btl289. URLhttp://bioinformatics.oxfordjournals.org/cgi/content/full/22/16/1963.
Rafael A Irizarry, Benjamin M Bolstad, Francois Collin, Leslie M Cope, Bridget Hobbs, andTerence P Speed. Summaries of affymetrix genechip probe level data. Nucleic AcidsResearch, 31(4):e15, Feb 2003.
Hongkai Ji and Wing Hung Wong. Tilemap: create chromosomal map of tiling arrayhybridizations. Bioinformatics, 21(18):3629–36, Sep 2005a. doi:10.1093/bioinformatics/bti593. URLhttp://bioinformatics.oxfordjournals.org/cgi/content/full/21/18/3629.
Hongkai Ji and Wing Hung Wong. Tilemap: create chromosomal map of tiling arrayhybridizations. Bioinformatics, 21(18):3629–3636, Sep 2005b. ISSN 1367-4803 (Print). doi:10.1093/bioinformatics/bti593.
W Evan Johnson, Wei Li, Clifford A Meyer, Raphael Gottardo, Jason S Carroll, Myles Brown,and X Shirley Liu. Model-based analysis of tiling-arrays for chip-chip. Proc Natl Acad Sci US A, 103(33):12457–12462, Aug 2006. ISSN 0027-8424 (Print). doi:10.1073/pnas.0601180103.
Dione Kampa, Jill Cheng, Philipp Kapranov, Mark Yamanaka, Shane Brubaker, Simon Cawley,Jorg Drenkow, Antonio Piccolboni, Stefan Bekiranov, Gregg Helt, Hari Tammana, andThomas R Gingeras. Novel rnas identified from an in-depth analysis of the transcriptome ofhuman chromosomes 21 and 22. Genome Research, 14(3):331–42, Mar 2004. doi:10.1101/gr.2094104. URL http://genome.cshlp.org/cgi/content/full/14/3/331.
Todd C Mockler, Simon Chan, Ambika Sundaresan, Huaming Chen, Steven E Jacobsen, andJoseph R Ecker. Applications of dna tiling arrays for whole-genome analysis. Genomics, 85(1):1–15, Jan 2005. doi: 10.1016/j.ygeno.2004.10.005.
Kasper Munch, Paul P Gardner, Peter Arctander, and Anders Krogh. A hidden markov modelapproach for determining expression from genomic tiling micro arrays. BMC Bioinformatics,7:239, Jan 2006. doi: 10.1186/1471-2105-7-239.
Ossowski. Next generation sequencing. Oral presentation at the PhD Symposium in Tubingen,Germany, November 2007.
S. Ossowski, K. Schneeberger, R. Clark, C. Lanz, N. Warthmann, and D. Weigel. Sequencing ofnatural strains of arabidopsis thaliana with short reads. Genome Research, 18(2024–2033),2008.
E Purdom, K M Simpson, M D Robinson, J G Conboy, A V Lapuk, and T P Speed. Firma: amethod for detection of alternative splicing from exon array data. Bioinformatics, 24(15):1707–14, Aug 2008. doi: 10.1093/bioinformatics/btn284. URLhttp://bioinformatics.oxfordjournals.org/cgi/content/full/24/15/1707.
G. Ratsch and S. Sonnenburg. Accurate splice site detection for Caenorhabditis elegans. InK. Tsuda B. Schoelkopf and J.-P. Vert, editors, Kernel Methods in Computational Biology.MIT Press, 2004.
G. Ratsch, S. Sonnenburg, and B. Scholkopf. RASE: recognition of alternatively spliced exonsin C. elegans. Bioinformatics, 21(Suppl. 1):i369–i377, June 2005.
Thomas E Royce, Nicholas J Carriero, and Mark B Gerstein. An efficient pseudomedian filter fortiling microrrays. BMC Bioinformatics, 8:186, Jan 2007a. doi: 10.1186/1471-2105-8-186.
Thomas E Royce, Joel S Rozowsky, and Mark B Gerstein. Assessing the need forsequence-based normalization in tiling microarray experiments. Bioinformatics, 23(8):988–97, Apr 2007b. doi: 10.1093/bioinformatics/btm052. URLhttp://bioinformatics.oxfordjournals.org/cgi/content/full/23/8/988.
Manoj Pratim Samanta, Waraporn Tongprasit, Himanshu Sethi, Chen-Shan Chin, and ViktorStolc. Global identification of noncoding rnas in saccharomyces cerevisiae by modulating anessential rna processing pathway. Proc Natl Acad Sci USA, 103(11):4192–7, Mar 2006. doi:10.1073/pnas.0507669103.
G. Schweikert, G. Zeller, A. Zien, J. Behr, C.S. Ong, P. Philips, A. Bohlen, R. Bohnert, F. DeBona, S. Sonnenburg, and G. Ratsch. mGene: Accurate computational gene finding withapplication to nematode genomes. under revision for Genome Research, March 2009.
S. Sonnenburg, G. Ratsch, A. Jagota, and K.-R. Muller. New methods for splice-siterecognition. In Proc. International Conference on Artificial Neural Networks, 2002.
S Sonnenburg, G Schweikert, P Philips, J Behr, and G Ratsch. Accurate splice site predictionusing support vector machines. BMC Bioinformatics, 8 Suppl 10:S7, 2007. ISSN 1471-2105(Electronic). doi: 10.1186/1471-2105-8-S10-S7.
Soren Sonnenburg, Alexander Zien, and Gunnar Ratsch. ARTS: Accurate Recognition ofTranscription Starts in Human. Bioinformatics, 22(14):e472–480, 2006.
John D Storey and Robert Tibshirani. Statistical significance for genomewide studies. Proc NatlAcad Sci USA, 100(16):9440–5, Aug 2003. doi: 10.1073/pnas.1530509100.
Charles W Sugnet, Karpagam Srinivasan, Tyson A Clark, Georgeann O’brien, Melissa S Cline,Hui Wang, Alan Williams, David Kulp, John E Blume, David Haussler, and Manuel Ares.Unusual intron conservation near tissue-regulated exons found by splicing microarrays. PLoSComput Biol, 2(1):e4, Jan 2006. doi: 10.1371/journal.pcbi.0020004.
Marc Sultan, Marcel H Schulz, Hugues Richard, Alon Magen, Andreas Klingenhoff, MatthiasScherf, Martin Seifert, Tatjana Borodina, Aleksey Soldatov, Dmitri Parkhomchuk, DominicSchmidt, Sean O’Keeffe, Stefan Haas, Martin Vingron, Hans Lehrach, and Marie-LaureYaspo. A global view of gene activity and alternative splicing by deep sequencing of thehuman transcriptome. Science, 321(5891):956–960, 2008. ISSN 1095-9203 (Electronic).doi: 10.1126/science.1160342.
I. Sutskever. Arachne: A whole genome shotgun assembler. oral presentation, 2008.
E.T. Wang, R. Sandberg, S. Luo, I. Khrebtukova, L. Zhang, C. Mayr, S.F. Kingsmore, G.P.Schroth, and C.B. Burge. Alternative isoform regulation in human tissue transcriptomes.Nature, 456(7221):470–476, 2008. ISSN 1476-4687 (Electronic). doi: 10.1038/nature07509.
Junshi Yazaki, Brian D Gregory, and Joseph R Ecker. Mapping the genome landscape usingtiling array technology. Current Opinion in Plant Biology, 10(5):534–42, Oct 2007. doi:10.1016/j.pbi.2007.07.006. URL http:
G Zeller, RM Clark, K Schneeberger, A Bohlen, D Weigel, and G Ratsch. Detectingpolymorphic regions in arabidopsis thaliana with resequencing microarrays. Genome Res, 18(6):918–929, 2008a. ISSN 1088-9051 (Print). doi: 10.1101/gr.070169.107.
G. Zeller, S. Henz, S. Laubinger, D. Weigel, and G. Ratsch. Transcript normalization andsegmentation of tiling array data. In Proc. PSB 2008. World Scientific, 2008b.
G Zeller, S Henz, C Widmer, T Sachsenberg, G Ratsch, D Weigel, and S Laubinger.Stress-induced changes in the arabidopsis thaliana transcriptome analyzed using wholegenome tiling arrays. Plant J, Feb 2009. doi: 10.1111/j.1365-313X.2009.03835.x.
Georg Zeller, Stefan R Henz, Sascha Laubinger, Detlef Weigel, and Gunnar Ratsch. Transcriptnormalization and segmentation of tiling array data. Pacific Symposium on BiocomputingPacific Symposium on Biocomputing, pages 527–38, Jan 2008c.
D.R. Zerbino and E. Birney. Velvet: Algorithms for de novo short read assembly using de bruijngraphs. Genome Research, 18:828–829, 2008.
A. Zien, G. Ratsch, S. Mika, B. Scholkopf, T. Lengauer, and K.-R. Muller. Engineering SupportVector Machine Kernels That Recognize Translation Initiation Sites. BioInformatics, 16(9):799–807, September 2000.