Top Banner
Next-Gen Sequencing Bioinformatics Support GPCL-BAC Rick Jordan, Programmer/Analyst J. Lyons-Weiler, Sci. Director September 26, 2008
33

Next-Gen Sequencing Bioinformatics Support GPCL-BAC Rick Jordan, Programmer/Analyst J. Lyons-Weiler, Sci. Director September 26, 2008.

Apr 01, 2015

Download

Documents

Drake Cable
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Next-Gen Sequencing Bioinformatics Support GPCL-BAC Rick Jordan, Programmer/Analyst J. Lyons-Weiler, Sci. Director September 26, 2008.

Next-Gen Sequencing Bioinformatics Support

GPCL-BAC

Rick Jordan, Programmer/Analyst J. Lyons-Weiler, Sci. Director

September 26, 2008

Page 2: Next-Gen Sequencing Bioinformatics Support GPCL-BAC Rick Jordan, Programmer/Analyst J. Lyons-Weiler, Sci. Director September 26, 2008.

Process

• GPCL-BAC Director & Analyst meet w/PI– Discuss Data Analysis Needs & Study Design

• PI Decides on Use of BAC or “Go it Alone”– “Go It Alone” -> data (.sff files)– “Use the BAC”

• data analysis $ estimate

• annotation, assembly, & analysis + data

• PI reviews Preliminary Research Report w/Analyst

• After final analysis, PI receives Report & Data

• Often the Analysis will be tailored to the application

Page 3: Next-Gen Sequencing Bioinformatics Support GPCL-BAC Rick Jordan, Programmer/Analyst J. Lyons-Weiler, Sci. Director September 26, 2008.

de novo Analysis FlowchartData/Reads exported to

data rig

454 GS FLX Image filesSequences

Sequence processing

dataRunParams.parse

Image processing

analysisParams.parse

Signal processing

.sff files

454RuntimeMetrics.csv

454QualityFilterMetrics.csv

454BaseCallerMetrics.csv

Assembler

Analysis & Annotation

Assembler

GS FLX System

GS or Lasergene

Page 4: Next-Gen Sequencing Bioinformatics Support GPCL-BAC Rick Jordan, Programmer/Analyst J. Lyons-Weiler, Sci. Director September 26, 2008.

Image processing

Page 5: Next-Gen Sequencing Bioinformatics Support GPCL-BAC Rick Jordan, Programmer/Analyst J. Lyons-Weiler, Sci. Director September 26, 2008.

Lasergene SeqBuilder

• Reference sequence e.coli K12

Page 6: Next-Gen Sequencing Bioinformatics Support GPCL-BAC Rick Jordan, Programmer/Analyst J. Lyons-Weiler, Sci. Director September 26, 2008.

Signal Processing

Page 7: Next-Gen Sequencing Bioinformatics Support GPCL-BAC Rick Jordan, Programmer/Analyst J. Lyons-Weiler, Sci. Director September 26, 2008.

de novo Genome Assembly

• Two software packages currently used:

– GS FLX Assembler (Newbler algorithm) Can be used for all experiments

– Lasergene (SeqMan Pro) Single-end experiments only

Page 8: Next-Gen Sequencing Bioinformatics Support GPCL-BAC Rick Jordan, Programmer/Analyst J. Lyons-Weiler, Sci. Director September 26, 2008.

GS de novo Assembler

• Input: .sff files and per-base quality scores• Output: Consensus sequence, assembled de

novo

• Main processing steps:– Identify pairwise overlaps between reads– Construct multiple alignments of contigs– Generate consensus basecalls of contigs– Output contig consensus sequences and quality

scores, along with ACE file of multiple alignments and assembly metrics files

From 454 Sequencing GS-FLX Data Analysis Software Manual, Dec 2007

Page 9: Next-Gen Sequencing Bioinformatics Support GPCL-BAC Rick Jordan, Programmer/Analyst J. Lyons-Weiler, Sci. Director September 26, 2008.
Page 10: Next-Gen Sequencing Bioinformatics Support GPCL-BAC Rick Jordan, Programmer/Analyst J. Lyons-Weiler, Sci. Director September 26, 2008.

e.g. Graphic Figure of the Assembly (Lasergene 7.2)

Page 11: Next-Gen Sequencing Bioinformatics Support GPCL-BAC Rick Jordan, Programmer/Analyst J. Lyons-Weiler, Sci. Director September 26, 2008.

GS Reference Mapper

• Generates the consensus DNA sequence by mapping, or alignment, of the reads to a reference sequence

• Provides a list of high-confidence mutations (individual bases or blocks of bases that differ between the consensus DNA sequence of the sample and the reference sequence)

From 454 Sequencing GS-FLX Data Analysis Software Manual, Dec 2007

Page 12: Next-Gen Sequencing Bioinformatics Support GPCL-BAC Rick Jordan, Programmer/Analyst J. Lyons-Weiler, Sci. Director September 26, 2008.
Page 13: Next-Gen Sequencing Bioinformatics Support GPCL-BAC Rick Jordan, Programmer/Analyst J. Lyons-Weiler, Sci. Director September 26, 2008.

Genome Annotation (sequence functional classes)

Zuber et al. (2007)

Page 14: Next-Gen Sequencing Bioinformatics Support GPCL-BAC Rick Jordan, Programmer/Analyst J. Lyons-Weiler, Sci. Director September 26, 2008.

Gene annotation with SeqManPro

Project

Page 15: Next-Gen Sequencing Bioinformatics Support GPCL-BAC Rick Jordan, Programmer/Analyst J. Lyons-Weiler, Sci. Director September 26, 2008.

e.g. Diagrams

Smith et al. (2007)

Page 16: Next-Gen Sequencing Bioinformatics Support GPCL-BAC Rick Jordan, Programmer/Analyst J. Lyons-Weiler, Sci. Director September 26, 2008.

Impacted Pathways

# Input

# Pathway %Pathway

Pathway Impact # Genes Genes Genes Genes corrected

Rank Name Score In Pathway on Chip on Chip in Input p-value p-value

1 Phosphatidylinositol signaling system 10.508 55 4 46 7.273 0.007995 0.007995

2 ECM-receptor interaction 6.746 62 4 57 6.452 0.016721 0.016721

3 Wnt signaling pathway 6.731 113 6 92 5.31 0.005133 0.005133

4 B cell receptor signaling pathway 6 59 4 47 6.78 0.008623 0.008623

5 Melanogenesis 5.765 86 4 63 4.651 0.023291 0.023291

6 Gap junction 5.422 76 4 63 5.263 0.023291 0.023291

7 GnRH signaling pathway 5.026 84 4 68 4.762 0.029813 0.029813

8 Focal adhesion 4.954 163 5 140 3.067 0.095078 0.095078

9 Long-term potentiation 4.868 62 3 49 4.839 0.052339 0.052339

10 Olfactory transduction 4.673 27 2 21 7.407 0.050233 0.050233

11 Calcium signaling pathway 4.644 164 5 131 3.049 0.076528 0.076528

Page 17: Next-Gen Sequencing Bioinformatics Support GPCL-BAC Rick Jordan, Programmer/Analyst J. Lyons-Weiler, Sci. Director September 26, 2008.

e.g. Pathway view

Page 18: Next-Gen Sequencing Bioinformatics Support GPCL-BAC Rick Jordan, Programmer/Analyst J. Lyons-Weiler, Sci. Director September 26, 2008.
Page 19: Next-Gen Sequencing Bioinformatics Support GPCL-BAC Rick Jordan, Programmer/Analyst J. Lyons-Weiler, Sci. Director September 26, 2008.

e.g. COGS table

Smith et al. (2007)

Page 20: Next-Gen Sequencing Bioinformatics Support GPCL-BAC Rick Jordan, Programmer/Analyst J. Lyons-Weiler, Sci. Director September 26, 2008.

e.g. Sequencing statistics table

Marcy et al. (2007)

Page 21: Next-Gen Sequencing Bioinformatics Support GPCL-BAC Rick Jordan, Programmer/Analyst J. Lyons-Weiler, Sci. Director September 26, 2008.

Base Caller Metrics

Page 22: Next-Gen Sequencing Bioinformatics Support GPCL-BAC Rick Jordan, Programmer/Analyst J. Lyons-Weiler, Sci. Director September 26, 2008.

Quality Filter Metrics

Page 23: Next-Gen Sequencing Bioinformatics Support GPCL-BAC Rick Jordan, Programmer/Analyst J. Lyons-Weiler, Sci. Director September 26, 2008.

Runtime Metrics

Page 24: Next-Gen Sequencing Bioinformatics Support GPCL-BAC Rick Jordan, Programmer/Analyst J. Lyons-Weiler, Sci. Director September 26, 2008.

Quality measures by region

Q Score - TCA - Region 4

0

1000000

2000000

3000000

4000000

5000000

6000000

7000000

8000000

0 10 15 20 25 30 35 40

Q Score

Numb

er of ba

ses

TCA ATG

Q Score - ATG - Region 4

0

50000

100000

150000

200000

250000

300000

0 12 17 22 27 32 37

Q Score

Numb

er of

bases

Q Score - TCA - Region3

0100000020000003000000400000050000006000000700000080000009000000

Q Score

Numb

er of ba

ses

Q Score - TCA - Region2

0500000

100000015000002000000250000030000003500000400000045000005000000

Q Score

Numb

er of

bases

Q Score - ATG - Region2

0

50000

100000

150000

200000

250000

300000

350000

400000

Q Score

Numb

er of

bases

Q Score - ATG - Region3

0

50000

100000

150000

200000

250000

Q Score

Numb

er of

bases

Q Score - TCA - Region1

0200000400000600000800000

100000012000001400000160000018000002000000

0 10 15 20 25 30 35 40

Q Score

Numb

er of

bases

Q Score - ATG - Region1

0

50000

100000

150000

200000

250000

Q Score

Numb

er of

bases

Page 25: Next-Gen Sequencing Bioinformatics Support GPCL-BAC Rick Jordan, Programmer/Analyst J. Lyons-Weiler, Sci. Director September 26, 2008.

Read lengths by region

Length - TCA - Region 4

0

2000

4000

6000

8000

10000

12000

41 62 82 102 122 142 162 182 202 222 242 262 282 315

Length of read

Numb

er of

read

TCA ATG

Length - ATG - Region 4

0

100

200

300

400

500

60 213 234 255

Length of read

Numb

er of

read

Length - TCA - Region3

01000

20003000

40005000

60007000

80009000

43 63 83 103 123 143 163 183 203 223 243 263 283 303

Length of read

NUmb

er of

read

Length - AGT - Region 3

0

50

100

150

200

250

300

350

50 80 194 200 214 232 237 245 254 259

Length of read

Numb

er of

read

Length - TCA - Region2

0

200

400

600

800

1000

1200

36 57 77 97 117 137 157 177 197 217 237 257 277 297 317

Length of Read

Numb

er of

read

Length - ATG - Region2

0

100

200

300

400

500

600

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

Length of read

Numb

er of re

ad

Length - TCA - Region1

0

100

200

300

400

500

600

700

38 58 78 98 118 138 158 178 198 218 238 258 278 298 319

Length of read

Numb

er of

read

Length - ATG - Region1

0

50

100

150

200

250

300

350

60 211 238 257

Length of read

Numb

er of re

ad

Page 26: Next-Gen Sequencing Bioinformatics Support GPCL-BAC Rick Jordan, Programmer/Analyst J. Lyons-Weiler, Sci. Director September 26, 2008.

e.g. Blast results

Page 27: Next-Gen Sequencing Bioinformatics Support GPCL-BAC Rick Jordan, Programmer/Analyst J. Lyons-Weiler, Sci. Director September 26, 2008.

e.g. Predicted nucleotide and protein alignment

Raymond et al. (2007)

Page 28: Next-Gen Sequencing Bioinformatics Support GPCL-BAC Rick Jordan, Programmer/Analyst J. Lyons-Weiler, Sci. Director September 26, 2008.

e.g. Predicted protein alignment

Raymond et al. (2007)

Page 29: Next-Gen Sequencing Bioinformatics Support GPCL-BAC Rick Jordan, Programmer/Analyst J. Lyons-Weiler, Sci. Director September 26, 2008.

Grant TextNext Generation Sequence Bioinformatics Analysis.The Bioinformatics Analysis Core is sufficiently endowed with software and human resources to

conduct the analysis of data from resequencing and de novo sequencing studies. Software acquisitions include the default Genome Sequencer modules and the recently acquired specialized Lasergene 7.2 software by DNA*. One BAC staff member is dedicated to the analysis of long-read NextGen sequencing data and is responsible for generating research reports for each project.

Genome Sequencer FLX System SoftwareThe FLX System Software includes modules for each stage in the analysis. All raw data are

accessible, and the system also offers a variety of third party software packages for niche applications.

Data QA/QCThe Core uses a variety of data quality control measures including consensus accuracy and quality

scores including per base (Q20+) and per genome (%Bases Q20+; the proportion of an assembled genome with base call accuracy of >99%).

The Core has also acquired licenses required to execute the full suite of Lasergene applications to round out the core’s Genome Annotation capabilities. In addition to the sequence assembler/SNP discover algorithms in SeqMan Pro, and the visualization and sequence editing modules (SeqBuilder), the Lasergene suite adds the capacity for gene finding (GeneQuest) and protein structure analysis & prediction (Protean).

The variety of file types that the core is expected to handle is greatly aided by Laser Genes’s EditSeq and by the much-improved interoperability of SeqMan Pro (which can import .sff, .fna, .fas and .qual files).

Page 30: Next-Gen Sequencing Bioinformatics Support GPCL-BAC Rick Jordan, Programmer/Analyst J. Lyons-Weiler, Sci. Director September 26, 2008.

Research Report Components• Tables

– Base Call Metrics– Quality Filter Metrics– Run Time Metric Tables– Quality Score

• per base (Q40+) • per genome (%Bases Q40+; the proportion of an assembled genome with base call accuracy of >99%).

– Quality Measure Distributions (By region)– Read Length Measure Distributions– Overall Sequence Statistics Tables– Blast tables– COGs Table

• Figures– Assembly Figures– Alignment Diagrams– Gene Functional Categories Diagrams– Genome View Diagrams– Nucleotide Alignment Diagrams– Predicted Protein Alignment Diagrams– Gene Ontology Functional Class Diagrams/Charts– Pathway Views– COGs Figures

• Methods Text– Manuscripts– Proposals

• Letter of Support

Page 31: Next-Gen Sequencing Bioinformatics Support GPCL-BAC Rick Jordan, Programmer/Analyst J. Lyons-Weiler, Sci. Director September 26, 2008.

Application Areas• Ancient DNA• ChIP-seq/Methylation/Epigenetics• Eukaryotic Whole Genome Sequencing• Expression tags• Genetic variation detection• HIV sequencing• Metagenomics and Microbial Diversity• Mitochondria/viruses/plastids/plasmids• Prokaryotic Whole Genome Sequencing• Sequence Capture/Target Region Resequencing• Small RNAs• Somatic variation detection• Transcriptome Sequencing

Roche 454/GS-FLX Web Site

Page 32: Next-Gen Sequencing Bioinformatics Support GPCL-BAC Rick Jordan, Programmer/Analyst J. Lyons-Weiler, Sci. Director September 26, 2008.

de novo Analysis FlowchartData/Reads exported to

data rig

454 GS FLX Image filesSequences

Sequence processing

dataRunParams.parse

Image processing

analysisParams.parse

Signal processing

.sff files

454RuntimeMetrics.csv

454QualityFilterMetrics.csv

454BaseCallerMetrics.csv

Assembler

Analysis & Annotation

Assembler

GS FLX System

GS or Lasergene

Page 33: Next-Gen Sequencing Bioinformatics Support GPCL-BAC Rick Jordan, Programmer/Analyst J. Lyons-Weiler, Sci. Director September 26, 2008.

Final Service Product

• Pre-analysis output files– dataRunParams.parse– 454 BaseCallerMetrics.csv– 454 QualityFilterMetrics.csv– 454 RuntimeMetricsAll.csv

• Post-analysis output files– .sff files (for each region)– Research report (.ppt)– Additional text editing