Alessandra Godi Alessandra Godi IASI (CNR) IASI (CNR) Roma Roma Solving Haplotyping Inference Parsimony problem using a polynomial class representative formulation and a set covering formulation Université Université Libre de Libre de Bruxelles Bruxelles Martine Labbé Martine Labbé iro Winter 2007 - Cortina d’Ampezzo, February 5th -9th, 200
Solving Haplotyping Inference Parsimony problem using a polynomial class representative formulation and a set covering formulation. Alessandra Godi. Martine Labbé. Université Libre de Bruxelles. IASI (CNR) Roma. Airo Winter 2007 - Cortina d’Ampezzo , February 5th -9th, 2007. - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Alessandra GodiAlessandra Godi
IASI (CNR) IASI (CNR)
RomaRoma
Solving Haplotyping Inference Parsimony problem using a
polynomial class representative formulation and
a set covering formulation
Université Libre Université Libre de Bruxellesde Bruxelles
Martine LabbéMartine Labbé
Airo Winter 2007 - Cortina d’Ampezzo, February 5th -9th, 2007
The alphabet of life…
Base pairs (A-T, G-C) are complementary
DNA structure=Double Helix (Watson-Crick)
Basic unit = nucleotide: Sugar
PhosphateBase (A, G, T, C)
Humans have 23 pairs of chromosomes: 22 autosome pairs 1 pair of sex chrom.
Each chromosome includes hundreds of different genes.
In the nucleus of each cell, the DNA molecule is packaged into thread-like structures called chromosomes.
A single ‘copy’ of a chromosome is called haplotype, while a description of the mixed data on the two ‘copies’ is called genotype.
For disease association studies, haplotype data is more valuable than genotype data, but haplotype data is hard to collect.
Genotype data is easy to collect.
All humans are 99,99 % identical.
Diversity? polymorphismpolymorphism..
A SNP is a Single Nucleotide Polymorphism - a site in the genome where two different nucleotides appear with sufficient frequency in the population (say each with 5% frequency or more).
SNPs
SNP (Single Nucleotide Polymorphism)
A
GG
A
A
A
G
T
T
T
T
G
A
A
CC
C
C
C
C
CT
T
T
AATATATCGAATATATCG
AATATATCGAATATATCG
AATATATCGAATATATCG
AATATATCGAATATATCG
AATATATCGAATATATCG
AATATATCGAATATATCG
TCCGTATACCTATCCGTATACCTA
TCCGTATACCTATCCGTATACCTA
TCCGTATACCTATCCGTATACCTA
TCCGTATACCTATCCGTATACCTA
TCCGTATACCTATCCGTATACCTA
TCCGTATACCTATCCGTATACCTA
GGGGTGTGTGTACGGGGTGTGTGTAC
GGGGTGTGTGTACGGGGTGTGTGTAC
GGGGTGTGTGTACGGGGTGTGTGTAC
GGGGTGTGTGTACGGGGTGTGTGTAC
GGGGTGTGTGTACGGGGTGTGTGTAC
GGGGTGTGTGTACGGGGTGTGTGTAC
TGCTAGCACGCGTGCTAGCACGCG
TGCTAGCACGCGTGCTAGCACGCG
TGCTAGCACGCGTGCTAGCACGCG
TGCTAGCACGCGTGCTAGCACGCG
TGCTAGCACGCGTGCTAGCACGCG
TGCTAGCACGCGTGCTAGCACGCG
TGTGTAATATACGTGTGTAATATACG
TGTGTAATATACGTGTGTAATATACG
TGTGTAATATACGTGTGTAATATACG
TGTGTAATATACGTGTGTAATATACG
TGTGTAATATACGTGTGTAATATACG
TGTGTAATATACGTGTGTAATATACG
SNP (Single Nucleotide Polymorphism)
A
GG
A
A
A
G
T
T
T
T
G
A
A
CC
C
C
C
C
CT
T
T
SNP (Single Nucleotide Polymorphism)
Genotype: A/T T/G A C
Haplotype 1: A G A CHaplotype 2: T T A C
SNP 1 SNP 2 SNP 3 SNP 4A
GG
A
A
A
G
T
T
T
T
G
A
A
CC
C
C
C
C
CT
T
T
Hetero Hetero Homo Homozigous zigous zigous zigous
SNP: encoding
SNP 1 SNP 2 SNP 3 SNP 4A
GG
A
A
A
G
T
T
T
T
G
A
A
CC
C
C
C
C
CT
T
T
011000
100011
110000
000111
Genotype: 0/1 1/0 1 0
Haplotype 1: 0 1 1 0Haplotype 2: 1 0 1 0
2 2 1 0
Haplotyping of a population
Given a set of genotypes G (strings on {0,1,2}n alphabet), find a set of “generating” haplotypes HH (strings on {0,1}n alphabet).
genotype genotype individual individual
The GENOME is the set of genetic information which lies in the DNA sequence of each living organism.
The DNA sequence is a linear disposition of 4 different molecule, nucleotide, or bases:A, T, C, G.
The bases are paired each other by hydrogen bonds.
The DNA implies differences between the individuals of the same species.
What makes us different from each other is called polymorphism.
At DNA level: a Polymorphism is a nucleotide sequence which varies within a chromosome population:
HAPLOTYPING of a population:our approach to the problem by using ILP
A new exponential
formulation
1. A pure set covering model obtained by Fourier-Motzking procedure by Gusfield (2002)model
2. A branch and cut procedure to decrease the number of constraints
A new polynomial formulation
A formulation using class representatives
A new polynomial formulation
I={h1,…, hq} a solution of the problem
genotypes of length nG={g1,g2,…,gm}
Main idea: class representatives
Each haplotype induces a subset of ordinated genotypes, and each geno belongs to exactly two of these subsets:
h1 {gi, gj, gk,…}
h2 {gi, gl, gr, gs…}
h3 {gk, gl, gs, gt…}
….
….
= Si
= Si’
= Sk
The smallest index geno identifies the subset; the prime appears if the correspondent index has been already used.
K’ = {1’, 2’, …, m’}K = {1, 2, …, m}
A new polynomial formulation
VARIABLES
yk{i,j}=
1
0 Otherwise
If geno gk belongs to two subset of geno’s, one having a geno with smallest index equal to i and the other one having the geno with smallest index j
k K i, j K K’
A new polynomial formulation
Ex:
h1 = 001 {g1, g2} = S1
g1= 021, g2= 002, g3 = 012
h2 = 011 {g1, g3} = S1’
y1{1,1’} = 1
Let us note that some y variables do not exist:
y2{1’,2’} = 0 If y2
{1’,2’} = 1
S1={g1,….}S1’={g1,g2….}
S2={g2,…}S2’={g2,…}
Absurd!!!
A new polynomial formulation
xi =
1 If there exists a subset of geno’s of the solution having geno i as geno with smallest index
i K K’ 0 Otherwise
zi,p =
0 i K K’ p SNP
1It is the value of the p-th coordinate of the haplo explaining the subset of geno’s used in the solution and having geno i as geno with smallest index
OBJECTIVE FUNCTION:
min xii K K’
VARIABLES
A new polynomial formulation
yk{i,j} 1 k K2.
i,j K K’, i≤k,
j≤k
CONSTRAINTS:
xi xi’ i K, i K’1.
A new polynomial formulationCONSTRAINTS:
yk{i,j} + yk
{i,j} ≤ xi k K3.
j K K’, j ≥ i
j K K’, j < i
i K K’,
yk{k,k’} ≤ xk’
k K3a. i = k’
A new polynomial formulationCONSTRAINTS:
4a. zi,p= 0 i K K’
pSNP s.t. gi(p)=0
4b. zi,p= 1 i K K’
pSNP s.t. gi(p)=1
4c. zi,p + zj,p = 1 {i,j} K K’
pSNP s.t. gi(p)=2
A new polynomial formulationCONSTRAINTS:
zi,p ≤ 1 - yk{i,j} - yk
{i,j} xi k K5.j K K’,
j ≥ i
j K K’, j < i
i K K’
pSNP : gk(p)=0
yk{k,k’} + zk’,p ≤ 1 k K, i = k’5a.
pSNP : gk(p)=0
A new polynomial formulationCONSTRAINTS:
zi,p ≥ yk{i,j} + yk
{i,j} k K6.j K K’,
j ≥ i
j K K’, j < i
i K K’
pSNP : gk(p)=1
zk’,p ≥ yk{k,k’} k K, i = k’6a.
pSNP : gk(p)=1
A new polynomial formulationCONSTRAINTS:
zi,p + zj,p ≥ yk{i,j}
k K7.
i,j K K’
pSNP : gk(p)=2
7a. zi,p + zj,p ≤ 2 - yk{i,j}
k K
i,j K K’
pSNP : gk(p)=2
10x10Opt zLP sec
zLP
LP iter
seczILP
MIP iter
B&B nodes
Poly 15 12 0,01 54 0,12 263 14
BrownModel[‘05]
15 2 0,05 140 4,85 16,646 1360
15x15Opt zLP sec
zLP
LP iter
seczILP
MIP iter
B&B nodes
Poly 27 22,83 0,01 173 0,08 173 11
BrownModel[‘05]
27 8 0,02 129 4.25 19.301 2.213
Preliminar results
20x20Opt zLP sec
zLP
LP iter
seczILP
MIP iter
B&B nodes
Poly 16 15 0,2 268 16 573 9
BrownModel[‘05]
16 3 O,07 598 27.604 16*106 540.623
Preliminar results
Let G be the genotype set and H the set of haplotypes which are compatible with some genotype in G.
^
INTEGER VARIABLES
Xh
1 if h is chosen
0 otherwise
1 if (h1,h2) is selected
0 otherwise
yh1,h2
For each g G
Pg = {(h1,h2) con h1,h2H | h1 h2 = g}^
From Gusfield’s formulation (2002)…
min Xh
hH
OBJECTIVE FUNCTION
^
CONSTRAINTS
1 g G1.
X 2.
yh1,h2
(h1,h2) Pg
yh1,h2h1
(h1,h2) Pg , g G
X 3. yh1,h2h2
(h1,h2) Pg , g G
From Gusfield’s formulation (2002)…
min xh
hH
1xh
h=h1 h=h2
g G
ˇ
x {0,1}n
…to a new set covering formulation by using the Fourier- Motzkin procedure
Set-Covering
s.t. (h1,h2) Pg
Genotype Structure +
Basic SC theory
Facets and
Valid Inequalities
g fixed fixedfreeN is the set of SNP
F
N\FF={pN: g(p) {0,1}}
Set-covering for HIP
1. The polytope HSC if full-dimensional IFF g G , |N\F|=2.
2. xj 0 is a facet for HSC IFF g G there exists hi s.t. hj hi=g, we have |N\F|=3.
3. xj 1 is facet j .
Proposition
g
g’
fixed fixed
fixed free
freefree
F
N\F
F’ N\F’
F={pN: g(p) {0,1}}
C=(N\F’)F
F’={pN: g’(p) {0,1}}
xi 1i S
Set-covering for HIP
N is the set of SNPs
|C|=|(N\F’)F|= 2 e (N\F)(N\F’)
|C|=|(N\F’)F| 3
TheoremLet us consider a genotype g and a subset S of haplotypes which are associated to a minimal set covering inequality:
This inequality is facet defining IFF for each genotype g’g one of the following conditions holds:
Set-covering for HIP
1.xh
h S
Set-covering for HIP
1st case: If |C|=|(N\F’)F|= 2 (N\F)(N\F’) =
2nd case : If |C|= |{p}|=1
If C= 3rd case :
the set covering inequality is dominated by another one that can be defined by using a SEQUENTIAL LIFTING procedure.
NOTE: For the following cases:
Set-covering for HIP: main idea
To overcome the exponential structure of the formulation:
1. Add only set-covering inequalities which are facet-defining
2. Add them in branch and cut procedure
Set-covering for HIP: a branch and cut procedure
a fractional solution of a subproblem of the original one
x*
g: (h1, h2 ) (h3,h4) (h5, h6) (h7, h8)
All set covering inequalities associated with g have the following structure:
x{1 or 2}+ x{3 or 4} + x{5 or 6}+ x{7 or 8} ≥ 1
Set-covering for HIP: a branch and cut procedure
We want to find a set covering inequality of g that violates x*
If it esists, we have found a set covering inequality which cut off x* !!!
We choose to add it to the system only if it is facet-defining.
min {x*1,x*2} + min {x*3,x*4} + min {x*5,x*6} + min {x*7,x*8} < 1
Branch and Cut preliminar results
Av. on max # of 2s
#constrmaster problem
#constr reduced problem
#added cuts
Solving time
50 genos10 SNPs
5 >60.000
7 30 0.00 sec
50 genos30 SNPs
8 >2512 7 200 0.05 sec
Average on 10 samples for each kind of instance generated by MS (Hudson, 2002) with recombination level r = 0
Future Works
On Polynomial formulation:
1. Strengthening of the model by Clique inequalities on genotype conflict graph
2. Cplex Concert Technologies3. More test vs other polynomial
formuationsOn Exponential formulation:
1. Implementation of Lifting Procedure2. More test in comparison with