Iterative Learning of Single Individual Haplotypes from High-Throughput DNA Sequencing Data Zrinka Puljiz and Haris Vikalo Electrical and Computer Engineering Department The University of Texas at Austin 8 th International Symposium on Turbo Codes & Iterative Information Processing Bremen, Germany, August 18-22, 2014 Iterative Learning of Single Individual Haplotypes 1 / 22
22
Embed
Iterative Learning of Single Individual Haplotypes from ...trsys.faculty.jacobs-university.de/turbo/presentations/papers/a32... · Iterative Learning of Single Individual Haplotypes
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Iterative Learning of Single Individual Haplotypesfrom High-Throughput DNA Sequencing Data
Zrinka Puljiz and Haris Vikalo
Electrical and Computer Engineering DepartmentThe University of Texas at Austin
8th International Symposium on Turbo Codes & Iterative Information ProcessingBremen, Germany, August 18-22, 2014
Iterative Learning of Single Individual Haplotypes 1 / 22
Overview of the Talk
Motivation and background
DNA sequencing and studies of genetic variations
Haplotype assembly
data structure and problem formulation
graphical representation of the problem
existing methods
Communication systems analogy and belief propagation
haplotype assembly as a decoding problem
belief propagation algorithm
performance analysis, comparison with existing methods
Conclusions and future work
Iterative Learning of Single Individual Haplotypes 2 / 22
DNA Sequencing: Discovering Genetic Blueprint
Determine the order of nucleotides in a DNA sequence
Human Genome Project: mapping the genetic blueprint
followed by sequencing more individuals, studies of genetic variations
Iterative Learning of Single Individual Haplotypes 3 / 22
Study of Genetic Variations in Humans
Humans are diploid organism with 23 pairs of chromosomes
chromosomes in a pair of autosomes are homologous
the most common type of variation are SNPs
Iterative Learning of Single Individual Haplotypes 4 / 22
Study of Genetic Variations in Humans Cont’d
Describing variations
SNP calling determines locations and type of polymorphisms
based on the detected SNPs, perform genotype calling
example: A/T, A/C, G/T
Genotypes provide only the list of unordered pairs of alleles
no association of alleles with one of the chromosomes in a pair
The complete information is provided by haplotypes
the list of alleles at contiguous sites in a region of a chromosome
example: (A,C,G) and (T,A,T)
fundamental for many applications (personalized medicine!)
Iterative Learning of Single Individual Haplotypes 5 / 22
Single Individual Haplotyping
Determine a haplotype of an individual using DNA sequencing
The SNP rate is low, typically estimated to be 10�3
high-throughput DNA sequencing provides reads that are too short
get pairs of fragments at opposite ends of a strand of known length
Iterative Learning of Single Individual Haplotypes 6 / 22
A Fragment Conflict Graph Interpretation
Represent reads by nodes, conflicts by edges
fragments are in conflict if they cover a common SNP location but
have di↵erent nucleotides there (so, di↵erent chromosomes)
If data is error-free, conflict graph is bipartite
otherwise, the graph contains cycles
Iterative Learning of Single Individual Haplotypes 7 / 22
Various Formulation of the Haplotype Assembly Problem
If the conflict graph is not bipartite, assembly is non-trivial
Approach: minimize the number of transformation stepsneeded to alter the graph so that it becomes bipartite
minimum edge removal (MER), minimum fragment removal (MFR),
minimum SNP removal
Minimum error correction (MEC): find the smallest number ofnucleotides in reads whose flipping to a di↵erent value resolvesconflicts among the fragments from the same chromosome
essentially, remove cycles in the conflict graph by assuming the
fewest possible sequencing errors
NP hard, various methods: HapCut [Bansal & Banfa, 2008],
HapCompass [Aguiar & Istrail, 2013], HapTree [Berger et al., 2014]
Iterative Learning of Single Individual Haplotypes 8 / 22
Minimum Error Correction Formulation
Label bases in heterozygous sites as h1i
, h2i
2 {1, 0}define h = h1 = h2 = [h1
1 h12 . . . h1
n
]
Each read is as a ternary string with entries 0, 1 and ⇥organize reads into a matrix R, row r
i
is the i th read
R =
2
666664
x x 0 x x 1x 1 x x 0 x
x x 0 x 0 x
0 x x 1 x x
1 x 1 x x x
x x 1 x 0 x
x 0 x 0 x x
x x x 0 x 0
3
777775
The MEC formulation is concerned with minimizing Z over h,
Z =mX
i=1
min(hd(ri
,h), hd(ri
, h)), hd(ri
,h) =nX
j=1
d(ri ,j , hj)
Iterative Learning of Single Individual Haplotypes 9 / 22
Structure of the Data Matrix
Consider the error-free SNP fragment matrix
R =
2
666664
x x 0 x x 1x 1 x x 0 x
x x 0 x 0 x
0 x x 1 x x
1 x 1 x x x
x x 1 x 0 x
x 0 x 0 x x
x x x 0 x 0
3
777775
Let h = [0 1 0 1 0 1], and the “origin” of the reads in R bes = [0 0 0 0 1 1 1 1]. Then for a binary R
i ,j it holds
si
hj
Ri,j
0 0 00 1 11 0 11 1 0
Iterative Learning of Single Individual Haplotypes 10 / 22
Haplotype Assembly as a Decoding Problem
Collect indices {(ik
, jk
)} identifying positions where the m ⇥ nmatrix R has binary entries (1 k M)