Top Banner
10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics
50

10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.

Jan 04, 2016

Download

Documents

Sheena Thomas
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.

10 Billion Piece Jigsaw Puzzles

John Cleary

Real Time Genomics

Page 2: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.
Page 3: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.
Page 4: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.
Page 5: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.

Genome

Exome

Transcriptome

Metagenome

Page 6: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.

Differences between …

• Individuals in populations

• Child and parents

• Cancer and host genome

• Large pedigrees of animals

• Bacterial populations inside individuals

• Bacterial populations in the world

Page 7: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.

Real world problems …

• What is wrong with this new born child?

• Why are these cells cancerous and what should we do about it?

• We have 6,000 individuals in 1,500 families with cleft-palate – what causes this?

Page 8: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.

Real world problems …

• There is a hard to treat infectious disease in a hospital ward – where did it come from and is it the same as the one at another hospital?

• Is this water safe to drink?

• …

Page 9: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.

Human Genome

3 billion

nucleotides

Exome

30 million

nucleotides

Page 10: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.

Shapes of the Jigsaw Pieces

Page 11: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.

Differences between humangenomes - SNPs

A C G T T A G T G A

A C G T T A G T G A

A C G T T C G T G A

A C G T T G G T G A

~ 1 / 1,0003,000,000 nt

Page 12: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.

Differences between humangenomes - MNPs

A C G T T A G T G A

A C G T T A G T G A

A C G T T C A G A

A C G T T G T G A

Page 13: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.

Differences between humangenomes - indels

A C G T T A G T G A

A C G T T A G T G A

A C G T T G T G A

A C G T T G G T G A

~ 1 / 10,000 300,000

Page 14: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.

Differences between humangenomes - inserts

A C G T T A G T G A

A C G T T A G T G A

Up to 1,000,000 nt total 3,000,000 nt

T T A G G A C C C A

Page 15: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.

REF: aatgttttctcagaatgtggagaaccttggtgcggacgatgcgcaat_atagggtgggtaccgtccggatac_gctgc______aat______ctgcaatgggaacgacatgatacaatcctgacgggcggtatagaggttctgttgcgtagttagtgttcgtgctggSIM: T AAGAATSIM: T AAGAATCALL: T GCALL: T TREAD: ATGTTTTCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GC READ: ATGTTTTCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GC READ: TTCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AA READ: TCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG READ: CTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______A READ: AATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______AATAAT READ: ATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AA-______GAATAATC READ: ATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______AATAATC READ: GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCA READ: GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCA READ: GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCA READ: TGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAAT READ: GAACCTTGGTGCGGACGATGCGCAATTATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAAT READ: AACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGG READ: AACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGG READ: CTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAA READ: CTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAA READ: TGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACA READ: TGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAAT READ: GCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATC READ: CAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTG READ: _ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGG READ: TAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGG READ: GGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCG READ: TGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTA READ: GGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTAT READ: GTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAG READ: TACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGA READ: CGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGT READ: TTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTT READ: CGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTG READ: TGCAAGAAT______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCGTAGT READ: AC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCG READ: AT______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCGTAGTTAGTGTT READ: ______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCGTAGTTAGTGTTCG

Page 16: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.

Solving the Jigsaw

• Indexing

• Alignment

• SNP/MNP/Indel calling

Mapping

Page 17: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.

Indexing

A C G T T A G T G A A G

A C G T T C G T G A A G

A C G TT C G TG A A G

A C G TT A G TG A A G

4.5 billion

Page 18: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.

Aligning

A C G T T A G T G A A G

A C G T T C G T G A A G

1.6 billion

Page 19: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.

Cutting Edge Run

• Human genome (3 billion nt)

• 1 billion reads of 100 ntcoverage of 30

• Indexing + Aligning in 27 minutes

Page 20: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.

i7 Quad Core

Page 21: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.

2 sockets X 4 cores X 2 hyperthreads = 16

48 GB RAM

10 computers

1 TB disk/genome = 500GB + 200GB + 200GB + 0.3GB

X thousands of genomes

Page 22: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.

Shapes of the Jigsaw Pieces

Page 23: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.

Paired End Reads

100 nt 100 nt100 - 1,000 nt

IndexAlign

IndexAlign

Match

Page 24: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.

Solving the Jigsawwithout the picture

• Indexing

• Alignment

Assembly

Page 25: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.

Assembly

T A G T G A A G A A T T

A C G T T C G T G A A G

A C G TT C G TG A A G

T A G TG A A GA A T T

A C G T T ? G T G A A G A A T T

Page 26: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.

SNP calling

15A 13C AC heterozygous SNP

15A 4C

5A 2C

1A 2C

Bayesian statistics(SNPs 1/1,000)

31A 42C Throw it out

Page 27: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.

REF: aatgttttctcagaatgtggagaaccttggtgcggacgatgcgcaat_atagggtgggtaccgtccggatac_gctgc______aat______ctgcaatgggaacgacatgatacaatcctgacgggcggtatagaggttctgttgcgtagttagtgttcgtgctggSIM: T AAGAATSIM: T AAGAATCALL: T GCALL: T TREAD: ATGTTTTCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GC READ: ATGTTTTCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GC READ: TTCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AA READ: TCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG READ: CTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______A READ: AATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______AATAAT READ: ATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AA-______GAATAATC READ: ATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______AATAATC READ: GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCA READ: GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCA READ: GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCA READ: TGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAAT READ: GAACCTTGGTGCGGACGATGCGCAATTATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAAT READ: AACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGG READ: AACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGG READ: CTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAA READ: CTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAA READ: TGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACA READ: TGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAAT READ: GCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATC READ: CAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTG READ: _ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGG READ: TAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGG READ: GGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCG READ: TGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTA READ: GGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTAT READ: GTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAG READ: TACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGA READ: CGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGT READ: TTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTT READ: CGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTG READ: TGCAAGAAT______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCGTAGT READ: AC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCG READ: AT______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCGTAGTTAGTGTT READ: ______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCGTAGTTAGTGTTCG

Page 28: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.

Lane

Multiple technologies and read lengths

SAM

Calibration

Mapping

SNP calling

VCFSNPs, MNPS, indels

Filtering

Complex regions

Page 29: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.

SNP calling - Diploid Bayesian

SAM Genome statisticsCalibration

Error model Priors

Bayesian ModelA C G T A:C A:G A:T C:G C:T G:T23.1 43.2 …log posteriors

Counts filter Ambiguity filter

VCF

Simple isolated SNP

insert Adjacent SNPs, inserts

Complex region calling

SNPs, indels, MNPs

Page 30: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.

Complex Region Calling

Genome

AlignedReads

Modified Genome

Probabilistic realignmentthrough all paths for eachread against each modified genome

Page 31: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.

Comparing twins

3,000,000 SNPs

Do any of them differ between the twins?

15A 4C 3A 10C 3G

Page 32: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.
Page 33: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.

DNA

mRNA

protein

Gene

Page 34: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.
Page 35: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.

Cancer comparison

Page 36: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.

Copy Number Variants

• Varying levels of extraction of reads across genome (use differences)

• Locate boundaries (as accurately as possible)

• Extract number of variants

• Use in combination with calling SNPs

Page 37: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.

Large pedigrees

Page 38: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.
Page 39: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.

Chlorocebus pygerythrus

Page 40: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.
Page 41: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.
Page 42: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.
Page 43: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.
Page 44: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.

Metagenomics or what is living on you

• Mapping reads back onto a database of known bacteria/viruses

• Many are ambiguous

• Many don’t map at all

• Estimate frequency of each species

• Remove human “contamination”

Page 45: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.

TS10.389 gi|29611500|ref|NC_004703.1| Bacteroides thetaiotaomicron VPI-5482 plasmid p54820.183 gi|187734516|ref|NC_010655.1| Akkermansia muciniphila ATCC BAA-8350.145 gi|150002608|ref|NC_009614.1| Bacteroides vulgatus ATCC 84820.037 gi|119025018|ref|NC_008618.1| Bifidobacterium adolescentis ATCC 15703

TS4 0.428 gi|29611500|ref|NC_004703.1| Bacteroides thetaiotaomicron VPI-5482 plasmid p5482 0.210 gi|150002608|ref|NC_009614.1| Bacteroides vulgatus ATCC 8482 0.149 gi|60650141|ref|NC_006873.1| Bacteroides fragilis NCTC 9343 plasmid pBF9343 0.037 gi|121999251|ref|NC_008790.1| Campylobacter jejuni subsp. jejuni 81-176 plasmid pTet 0.036 gi|238922432|ref|NC_012781.1| Eubacterium rectale ATCC 33656

TS25 0.752 gi|29611500|ref|NC_004703.1| Bacteroides thetaiotaomicron VPI-5482 plasmid p5482 0.073 gi|150002608|ref|NC_009614.1| Bacteroides vulgatus ATCC 8482 0.041 gi|121999251|ref|NC_008790.1| Campylobacter jejuni subsp. jejuni 81-176 plasmid pTet 0.020 gi|58036264|ref|NC_004307.2| Bifidobacterium longum NCC2705 0.018 gi|189438863|ref|NC_010816.1| Bifidobacterium longum DJO10A

Page 46: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.

Metagenomics

• Map reads to database

• Estimate most likely frequenciesa hill climbing estimation problem

• Can anything be done about unmapped reads?

Page 47: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.

How do we get there?

• Software engineering (500,000 lines code)

• Algorithms

• Bayesian statistics

• Testingcalibration/simulation/analysis

Page 48: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.
Page 49: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.

How do we get there?

• Performance optimizationalgorithmsdisk I/O and compressionparallel executionoptimization for memory sizeoptimization for cache sizetargeted code optimization

Page 50: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.