Combinatorial Methods for Disease Association Search and Susceptibility Prediction

Combinatorial Methods for Disease Association

Search and Susceptibility Prediction

Alexander Zelikovsky

joint work with Dumitru BrinzaDepartment of Computer Science

2

Outline

SNPs, Haplotypes and Genotypes Disease Association Search

Genome-wide association search challenges Problem formulation Exhaustive & Combinatorial Search Optimization formulation & complimentary greedy search

Predicting susceptibility to complex diseases Problem formulation/cross-validation Previous methods: SVM, RF, LP Optimum clustering and prediction via model-fitting

Conclusions

3

Length of Human Genome 3 109 #Single nucleotide polymorphism (SNPs) 1 107 SNPs are mostly biallelic, e.g., AC Minor allele frequency should be considerable e.g. >.1% Difference b/w ALL people 0.25% (b/w any 2 0.1%) Diploid = two different copies of each chromosome Haplotype = description of a single copy (expensive)

example: 00110101 (0 is for major, 1 is for minor allele) Genotype = description of the mixed two copies

example 01122110 (0=00, 1=11, 2=01) International Hapmap project: www.hapmap.org

SNP, Haplotypes, Genotypes

http://www.hapmap.org/

4

Challenges of Disease Association Monogenic disease

A mutated gene is entirely responsible for the disease . Typically rare in population: < 0.1%.

Complex disease Interaction of multiple genes

2-SNP interaction analysis for a genome-wide scan with 1 million SNPs (3 kb coverage) has 1012 pairwise tests

Multiple independent causes Each cause explains < 10-20% of cases

Common: > 0.1%. In NY city, 12% of the population has Type 2 Diabetes

Multiple testing adjustment Reason for non-reproducible findings

5

Disease Association Search Problem

01012010201022100220110210120021020012001222111000200110022121011101202020100110012012001010001102102200020211120021011000212120

-1-1-1-11111

Disease Status

Non-diseased genotypes: H

Sample population Sof individual genotypes

Risk/resistance factor = multi-SNP combination (MSC) a subset of SNP-columns of S the values of these SNPs, 0, 1, or 2

Cluster C= subset of S with an MSC, d(C) = diseased, h(C) = non-diseased

PROBLEM: Find all MSCs significantly associated with the disease

Diseased genotypes: D

SNPs

6

Significance of Risk/Resistance Factors Measured by

Relative risk (RR) Odds ratio (OR) Their p-values

Unadjusted p-value: Probability of case/control distribution among exposed to risk factor, computed by binomial distribution

Multiple-testing adjustement: Bonferroni

easy to compute overly conservative

Randomization computationally expensive more accurate

7

Exhaustive & Combinatorial Search Exhaustive search is infeasible

sample with n genotypes/m SNPs requires O(n3m)

Combinatorial search Definition: Disease-closure of a multi-SNP combination C is a multi-

SNP combination C’, with maximum number of SNPs, which consists of the same set of disease individuals and minimum number of nondisease individuals.

Searches only closed clusters Closure of cluster C = C’ d(C’)=d(C) and h(C’) is minimized

Avoids checking of trivial MSCs Small d(C) implies not looking in subclusters

Finds faster associated MSCs but still too slow

Tagging: compress S by extracting most informative SNPs restore other SNPs from tag SNPs multiple regression method

8

MLR Tagging

Stepwise Index SNP Algorithm: Choose as a tag the SNP which best predicts all other SNPs Choose the next one which together with a first best predicts all

other SNPs and so on. Prediction method is based on Multiple Linear Regression

(MLR)

So far beats in quality other methods (STAMPA)

9

Data Sets Crohn's disease (Daly et al ): inflammatory bowel disease (IBD). Location: 5q31 Number of SNPs: 103 Population Size: 387 case: 144 control: 243

Autoimmune disorders (Ueda et al) : Location: containing gene CD28, CTLA4 and ICONS Number of SNPs: 108 Population Size: 1024 case: 378 control: 646

Tick-borne encephalitis dataset of (Barkash et al) : Location: containing gene TLR3, PKR, OAS1, OAS2, and OAS3. Number of SNPs: 41 Population Size: 75 case: 21 control: 54

10

Disease association search results IES(30):

exhaustive search 30 indexed SNPs with

MLR based tagging method

ICS(30): combinatorial

search 30 indexed SNPs

with MLR based tagging method.

11

Disease Association Search Optimum Association Search

Problem: Find MSC that is the most associated

with the disease Measure: positive predictive value

= find (non-)diseased-free cluster of maximum size

Bad news: Generalization of max independent set NP complete and cannot be well

approximated Hope: sample S is not arbitrary

12

Complimentary Greedy Algorithm Algorithm

Start with C=S (resp. MCS is empty) Repeat until h(C)=0 (non-diseased-free)

Find 1-SC s maximizing (h(C)-h(C {s})) / (d(C) – d(C {s})) = minimize payment with diseased for removal of non-diseased

Add s to SNPs of C’s MSC

Analogy: finding independent set by greedy removing highest degree vertecies

Extremely fast but inaccurate Can be used in susceptibility prediction

13

Most disease-associated & disease-resistant MSC

Comparison of three methods for searching the disease-associated and disease-resistant multi-SNPs combinations with the largest PPV. The starred values refer toresults of the runtime-constrained exhaustive search

14

Genetic Susceptibility Prediction

Given: Genotypes of diseased and non-diseased individuals, Genotype of a testing person.Find: The disease status of the testing person

01012010201022100220110210120021020012001222111000200110022121011101202020100110012012001010001102102200020211120021011000212120

-1-1-1-11111

Genotype Disease Status

healthy

sick

testing - gt 0110211101211201 s(gt)

15

Cross-validation Leave-one-out test: The disease status of each genotype

in the data set is predicted while the rest of the data is regarded as the training set.

-10101201020102210022011021012002102001200122211100020011002212101

Leave-many-out test: Repeat randomly picking 2/3 of the population as training set and predict the other 1/3.

-1-11

Genotype Real Disease Status

-1-1

1

PredictedDisease Status

1

0020011002212101 1 1

Accuracy = 80%

16

Quality Measures of Prediction (confusion table)

Sensitivity: The ability to correctly detect disease. sensitivity = TP/(TP+FN) Specificity: The ability to avoid calling normal as disease.

specificity = TN/(FP+TN) Accuracy = (TP +TN)/(TP+FP+FN+TN) Risk Rate: Measurements for risk factors.

PredictionDisease

+ -

Test

+True Positive False Positive

(TP) (FP)

-False Negative True Negative

(FN) (TN)

17

Prediction Methods Support vector machine Random forest LP-based prediction

18

Prediction via Clustering Drawback of the prediction problem formulation =

need of cross-validation no optimization Clustering P = partition into clusters defined by

MSC’s Minimizing number of errors S.t. bounded information entropy –∑(Si/S) log(Si/S)

Model-fitting prediction Set status of testing genotype to diseased Find number of errors Set status of testing genotype to diseased Find number of errors Predict status that implies lesser number of errors

19

Leave-1-out cross-validation results

Leave-one-out cross-validation for combinatorial search-based prediction (CSP) and complimentary greedy search-based prediction (CGSP) are given when 20, 30, or all SNPs are chosen as informative SNPs.

20

ROC curve

Comparison of 5 prediction methods on (Barkash et. al,2006 ) data on all SNPs.Area under the CSP’s curve is 0.81 vs 0.52 under the SVM’s curve.

21

Conclusions

Combinatorial search is able to find statistically significant multi-gene interactions, for data where no significant association was detected before

Complimentary greedy search can be used in susceptibility prediction

Optimization approach to prediction New susceptibility prediction is by 8%

higher than the best previously known MLR-tagging efficiently reduces the

datasets allowing to find associated multi-SNP combinations and predict susceptibility

22

International Symposium on Bioinformatics Research and Applications

May 6-9, 2007, Georgia State University, Atlanta, Georgia http://www.cs.gsu.edu/ISBRA/

Submissions must and must not exceed 12 pages in Springer LNCS style The proceedings of ISBRA 2007 will be published in LNBI

Important Dates Submission Deadline Notification of Acceptance Final Version Submission

December 20, 2006 January 31, 2007 February 21, 2007

Symposium Organizers • General Chairs: Dan Gusfield (University of California, Davis) and Yi Pan (Georgia State University) • Program Chairs: Ion Mandoiu (University of Connecticut) and

Alexander Zelikovsky (Georgia State University)

ISBRA provides a forum for the exchange of ideas and results among researchers, developers, and practitioners working on all aspects of bioinformatics and computational biology and their applications

23

History of ISBRA ISBRA is the successor of the International Workshop on Bioinformatics Research and Applications (IWBRA), held on

- May 22-25, 2005 in Atlanta, GA and

- May 28-31, 2006 in Reading, UK

in conjunction with the International Conference on Computational Science

The two editions of IWBRA have enjoyed a great success, with special issues devoted to full versions of selected papers in

Springer LNCS Transactions on Computational Systems Biology and

IEEE/ACM Transactions on Computational Biology and Bioinformatics

24

Support Vector Machine (SVM) Algorithm Learning Task

Given: Genotypes of patients and healthy persons. Compute: A model distinguishing if a person has the

disease. Classification Task

Given: Genotype of a new patient + a learned model Determine: If a patient has the disease or not.

Linear SVM Non-Linear SVM

25

Random Forest Algorithm

Random Forests grows many classification trees. To classify a new object from an input vector, put the input vector down to each tree in the forest. Each tree gives a classification, and we say the tree “votes” for that class. The forest chooses the classification having the most votes (over all the trees in the forest).

Growing Tree, Split selection and Prediction.

Random sub-sample of training data, Random splitter selection.

26

LP-based Prediction Algorithm Model:

Certain haplotypes are susceptible to the disease while others are resistant to the disease.

The genotype susceptibility is assumed to be a sum of susceptibilities of its two haplotypes.

Assign a positive weight to susceptible haplotypes and a negative weight to resistant haplotypes such that for any control genotype the sum of weights of its haplotypes is negative and for any case genotype it is positive.

For each vertex-haplotype hi assign the weight pi,

such that for any genotype-edge ei,j =(hi,,hj )

where s(ei,j ) {-1,1} is the disease status of genotype ei,j. The sum of absolute values of genotype weights is maximized.

Combinatorial Methods for Disease Association Search and Susceptibility Prediction

Documents