Top Banner
March 27, 2022 1 Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca
34

June 2, 20151 Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.

Dec 19, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: June 2, 20151 Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.

April 18, 2023 1

Combinatorial methods in Bioinformatics: the haplotyping

problem

Paola BonizzoniDISCo

Università di Milano-Bicocca

Page 2: June 2, 20151 Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.

April 18, 2023 2

Content

Motivation: biological terms Combinatorial methods in haplotyping Haplotyping via perfect phylogeny : the PPH

problem Inference of incomplete perfect phylogeny:

algorithms Incomplete pph and missing data Other models: open problems

Page 3: June 2, 20151 Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.

April 18, 2023 3

Biological termsDiploid organism

haplotype A

A

A

maternal

G

C

A

paternal

genotype

homozygous

heterozygous i

i+1

i+2

Biallelic site i

|Value(i) { A,C,G,T}| 2

Page 4: June 2, 20151 Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.

April 18, 2023 4

Motivations Human genetic variations are related to diseases (cancers, diabetes,

osteoporoses) most common variation is the Single Nucleotide Polymorphism (SNP) on haplotypes in chromosomes

The human genome project produces genotype sequences of humans Computational methods to derive haplotypes from genotype data are

demanded Ongoing international HapMap project: find haplotype differences on large

scalepopulation data

Combinatorial methods:

graphs

Set-cover problems

Optimization problems

Page 5: June 2, 20151 Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.

April 18, 2023 5

Haplotyping: the formal model Haplotype: m-vector h=<0, 1,…, 0> over {0,1}m

Genotype: m-sequence g=<{0,1}, …,{0,0}, …{1,1}> over {0,1,*}

Def. Haplotypes <h, k> solve genotype g iff : g(i)=* implies h(i) k(i)

h(i)= k(i)= g(i) otherwise

* 0 1 g = <*, …, 0,…, 1 >

Page 6: June 2, 20151 Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.

April 18, 2023 6

Examplesg =<0,*,1,*,0,1,1>

h=<0,1,1,0,0,1,1>

k=<0,0,1,1,0,1,1> g solved by <k,h> g k

Clark inference rule

g1 =<0,*,1,*,0,1,1> g2 =<0,1,*,0,0,1,1>

h1=<0,0,1,1,0,1,1>

g3 =<0,0,*,*,1,1,1>

g3 =<0,1,0,*,0,1,1>

h2=<0,1,1,0,0,1,1>

h1=<0,0,1,1,0,1,1>

g2 =<0,1,*,0,0,1,1>

h2=<0,1,1,0,0,1,1>

h1=<0,0,1,1,0,1,1>

h3=<0,1,0,0,0,1,1>

g3 =<0,1,0,*,0,1,1>

h

g1 h2

h1

Page 7: June 2, 20151 Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.

April 18, 2023 7

Haplotype inference: the general problem

Problem HI: Instance: a set G={g1, …,g m} of genotypes and

a set H={h1, …,h n } of haplotypes,

Solution: a set H’ of haplotypes that solves each genotype g in G s.t. H H’.

H’ derives from an inference RULE

Page 8: June 2, 20151 Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.

April 18, 2023 8

Type of inference rules

Clark’s rule: haplotypes solve g by an iterative rule Gusfield coalescent model: haplotypes are related

to genotypes by a tree model Pedigree data: haplotypes are related to genotypes

by a directed graph

Page 9: June 2, 20151 Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.

April 18, 2023 9

Mendelian law and Recombination

BA

Father

C D

Mother

A C A D B C DB

C1 C2 C3 C4

BD

AC

Parent

AC

BD

AD

BC

Child:

Page 10: June 2, 20151 Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.

April 18, 2023 10

PedigreePedigree

Pedigree, nuclear family, founder

Page 11: June 2, 20151 Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.

April 18, 2023 11

PedigreePedigree

Pedigree, nuclear family, founder

Father Mother

Children

ID Num

Genotypes

Founders

Nuclear family

Familytrioloop

Mating node

Page 12: June 2, 20151 Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.

April 18, 2023 12

Haplotyping from genotypes: Haplotyping from genotypes: The problem & methodsThe problem & methods

Problem: Input: genotype data (missing). Output: haplotypes.

Input data: Data with pedigree (dependent). Data without pedigree info (independent).

Statistical methods Find the most likely haplotypes based on genotype data. Adv: solid theoretical bases Disadv: computation intensive

Rule-based methods Define rules based on some plausible assumptions and find those

haplotypes consistent with these rules. Adv: usually simple thus very fast

Disadv: no numerical assessment of the reliability of the results

Page 13: June 2, 20151 Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.

April 18, 2023 13

HI by the perfect phylogeny model

IDEA:

0, 1,1,0,1

0, 1,0,1,1

g1= 0, 1,*,*,1

g2= *, 0,0,0,1

1, 0,0,0,10, 0,0,0,1

G H

Genotypes are the mating of haplotypes in a tree

Given G find H and T that explain G!

00000

Page 14: June 2, 20151 Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.

April 18, 2023 14

Perfect Phylogeny models

Input data: 0-1 matrix A characters, species Output data: phylogeny for A

s1

s2

s3

s4

c1 c3c2 c5c4

1 1 0 0 0

0 0 1 0 0

1 1 0 0 1

0 0 1 1 0

Path c3c4

s4

s2 s1s3

c3

c4

c2C1 ,

c5

R

Page 15: June 2, 20151 Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.

April 18, 2023 15

Perfect phylogeny

each row si labels exactly one leaf of T each column cj labels exactly one

edge of T each internal edge labelled by at

least one column cj

row si gives the 0,1 path from the root to si

Def. A pp T for a 0-1 matrix A:

s4

s2 s1s3

c3

c4

c2C1 ,

c5

Path c3c4

0 0 1 1

Page 16: June 2, 20151 Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.

April 18, 2023 16

pp model: another view

L(x) cluster of x:

set of leaves of T x

s4

s2

s1s3

x

A pp is associated to a tree-family (S,C) with S={s1 ,…, sn} C={S’ S: S’ is a cluster} s.t. X, Y in C , if XY then XY or Y X.

Page 17: June 2, 20151 Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.

April 18, 2023 17

pp : another view

A tree-family (S,C) is represented by a 0-1 matrix:

0 1 0 0 0

0 0 1 0 0

1 1 0 0 1

0 0 1 1 0

c i • c i S’ : s j S’ iff b

ji=1

s j

Lemma

A 0-1 matrix is a pp iff it represents a tree-family

• for each set in C at least a column

Page 18: June 2, 20151 Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.

April 18, 2023 18

Haplotyping by the pp

A 0-1 matrix B represents the phylogenetic tree for a set H of haplotypes:

si haplotype ci SNPs

1100001001

01000

000000000001000

1100001001si

ci

0-1 switch in position ionly once in the tree !!

SNP site

01000

00000

Page 19: June 2, 20151 Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.

April 18, 2023 19

Haplotyping and the pp: observations

The root of T may not be the haplotype 000000 0-1 switch or 1-0 switch (directed case)

0-1 switch

01100

11000

01000

00011

1-0 switch

00011

01000

00011

01000

0101011010

01010

00011

0100111001

01001

00000

Page 20: June 2, 20151 Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.

April 18, 2023 20

HI problem in the pp model Input data: a 0-1-*matrix B n m of genotypes G Output data: a 0-1 matrix B’ 2n m of haplotypes s.t. (1) each g G is solved by a pair of rows <h,k> in B’ (2) B’ has a pp (tree family)

DECISION Problem

0, 1,0,1,1

01*1*001*001*11*110000*1*1*

???

Page 21: June 2, 20151 Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.

April 18, 2023 21

An example

a * *

b 0 *

c 1 0

a 1 0

a’ 0 1

b 0 1

b’ 0 0

c 1 0

c’ 1 0

a

c c’

b’

a’ b

Page 22: June 2, 20151 Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.

April 18, 2023 22

The pph problem: solutions An undirected algorithm Gusfield Recomb 2002 An O(nm2)- algorithm Karp et al. Recomb 2003 A linear time O(nm) algorithm ?? Optimal algorithm

A related problem: the incomplete directed pp (IDP)

Inferring a pp from a 0-1-* matrixO(nm + klog2(n+ m)) algorithm Peer, T. Pupko, R. Shamir, R.

Sharan SIAM 2004

Page 23: June 2, 20151 Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.

April 18, 2023 23

IDP problem

OPEN PROBLEM: find an optimal algorithm ??

C1

C2 C4

C5

C3

S2S1 S3

1 ? 0 0 1? ? 0 1 0? 0 1 ? ?

1 2 3 4 5

1 0 0 0 1? ? 0 1 0? 0 1 ? ?

1 0 0 0 11 1 0 1 0? 0 1 ? ?

1 0 0 0 11 ? 0 1 0? 0 1 ? ?

1 0 0 0 11 1 0 1 01 0 1 0 1

Instance: A 0-1-? Matrix ASolution: solve ? Into 0 or 1 to obtain a matrix A’ and a pp for A’, or say “no pp exists”

Page 24: June 2, 20151 Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.

April 18, 2023 24

Decision algorithms for incomplete pp

Based on: Characterization of 0-1 matrix A that has a pp

-Tree family - - forbidden submatrix – give a no certificate

1 01 1

0 1

00

01 10

11

XY

Bipartite graph G(A)=(S,C,E) with E={(si,cj): bij =1}

Forbidden subgraph c C’

s1 s3s210 11

01

Page 25: June 2, 20151 Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.

April 18, 2023 25

Test: a 0-1 matrix A has a pp?

O(nm) algorithm (Gusfield 1991)Steps: 1. Given A order {c1, …,cm} as (decreasing)

binary numbers A’2. Let L(i,j)=k , k = max{l <j: A’[i,l]=1}3. Let index(j) = max{L(i,j): i}4. Then apply th.

TH. A’ has a pp iff L(i,j) = index(j) for each (i,j)

s.t. A’[i,j]=1

Page 26: June 2, 20151 Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.

April 18, 2023 26

Idea:

Page 27: June 2, 20151 Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.

April 18, 2023 27

The IDP algorithm

c C’

s1 s3s2

Page 28: June 2, 20151 Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.

April 18, 2023 28

Other HI problems via the pp model Incomplete 0-1-*-? matrix because of missing data: haplotypes pp (Ihpp) haplotype rowsgenotype pp (Igpp) genotype rows

Algorithms:

Ihpp = IDP given a row as a root (polynomial time)

NP-complete otherwise

Igpp has polynomial solution under rich data hypothesis (Karp et al. Recomb 2004 – Icalp 2004 )

NP-complete otherwise

Page 29: June 2, 20151 Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.

April 18, 2023 29

HI problem and other models Haplotype inference in pedigree data

under the recombination model

0

0

0

1

1

1

maternal

0

0

1

1

0

0

0

0

0

0

paternal

0

0

0

0

0

0

0

0

0

0

1

1

recombination

child

Page 30: June 2, 20151 Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.

April 18, 2023 30

Pedigree graphSingle Mating Pedigree Tree

Mating loop

Nuclear family

Pedigree Graph

father mather

child

Page 31: June 2, 20151 Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.

April 18, 2023 31

Haplotype inference in pedigree

00

01

10

10

11

00

01

11

01

0|0

0|1

1|0

1|0

1|1

0|0

01

11

10

0|0

1|0

1|0

0|0

0|1

0|1

0|0

1|0

0|1

0|1

1|1

0|0

Paternal maternal

0

1

1

1

1

0

0|1

1|1

1|0

Page 32: June 2, 20151 Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.

April 18, 2023 32

Problems:

MPT-MRHI (Pedigree tree multi-mating minimum recombination HI) SPT-MRHI (Pedigree tree single-mating minimum recombination HI)

OPEN

Np-complete even if the graph is acyclic, but unbounded number of children…

Page 33: June 2, 20151 Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.

April 18, 2023 33

Conclusions

Page 34: June 2, 20151 Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.

April 18, 2023 34

References