Top Banner
Constraints and Global Optimization for Gene Prediction Overlap Resolution Christian Theil Have Research group PLIS: Programming, Logic and Intelligent Systems Department of Communication, Business and Information Technologies Roskilde University, P.O.Box 260, DK-4000 Roskilde, Denmark WCB 2011 in Perugia, September 12, 2011 Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
28

Constraints and Global Optimization for Gene Prediction Overlap Resolution

May 11, 2015

Download

Technology

Christian Have

We apply constraints and global optimization to the problem of restricting overlapping of gene predictions for prokaryotic genomes. We investigate existing heuristic methods and show how they may be expressed using Constraint Handling Rules. Furthermore, we integrate existing methods in a global optimization procedure expressed as probabilistic model in the PRISM language. This approach yields an optimal (highest scoring) subset of predictions that satisfy the constraints. Experimental results indicate accuracy comparable to the heuristic approaches.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Constraints and Global Optimization for Gene Prediction Overlap Resolution

Constraints and Global Optimizationfor Gene Prediction Overlap Resolution

Christian Theil Have

Research group PLIS: Programming, Logic and Intelligent SystemsDepartment of Communication, Business and Information Technologies

Roskilde University, P.O.Box 260, DK-4000 Roskilde, Denmark

WCB 2011 in Perugia, September 12, 2011

Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution

Page 2: Constraints and Global Optimization for Gene Prediction Overlap Resolution

Outline

1 Motivation and backgroundGene finding

2 Overlap resolutionLocal heuristicsGlobal optimization

3 Implementation

4 Evaluation

5 Summary and future directions

Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution

Page 3: Constraints and Global Optimization for Gene Prediction Overlap Resolution

Motivation and background

Motivation

In the LoSt project we explore the use of Probabilistic Logicprogramming (with PRISM) for biological sequence analysis.

In particular, we have been concerned with applying these techniquesto gene finding for prokaryotes.

Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution

Page 4: Constraints and Global Optimization for Gene Prediction Overlap Resolution

Motivation and background Gene finding

Gene finding

A traditional view,

Considered as a sequence classification task.

Performed without much context.

Ignores constraints between predicted genes and their placement inthe genome.

Problem with overlapping genes.

Over-prediction of overlapping genes and shadow genes.

Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution

Page 5: Constraints and Global Optimization for Gene Prediction Overlap Resolution

Motivation and background Gene finding

Two step approach

A divide and conquer two step approach to gene finding,

1 Gene prediction: A gene finder supplies a set of candidate predictionsp1 . . . pn, called the initial set.

2 Pruning: The initial set is pruned according to certain rules orconstraints. We call the pruned set the final set.

Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution

Page 6: Constraints and Global Optimization for Gene Prediction Overlap Resolution

Overlap resolution

Pruning step as a Constraint Satisfaction Problem

Definition

A Constraint Satisfaction Problem is a triplet 〈X ,D,C 〉. X is a set of nvariables, X = x1, . . . , xn, with domains D = D(x1), . . . ,D(xn). Theconstraints C impose restrictions on possible assignments for sets ofvariables. A solution is an assignment of a value v ∈ D(xi ) to eachvariable xi ∈ X , consistent with C .

We introduce variables X = xi . . . xn corresponding to each predictionp1 . . . pn in the initial set. All variables have boolean domains,∀xi ∈ X ,D(xi ) = {true, false} and xi = true ⇔ pi ∈ final set.

Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution

Page 7: Constraints and Global Optimization for Gene Prediction Overlap Resolution

Overlap resolution Local heuristics

Local heuristic techniques

Local heuristic pruning rules to post-process a set of gene predictions.

Deletes inconsistent predictions based on various criteria.

Pruning decisions based on the context of only a subset of thepredictions.

CSP formulation

Pruning xi just means D(xi ) = D(xi )\true ⇔ D(xi ) = false.

Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution

Page 8: Constraints and Global Optimization for Gene Prediction Overlap Resolution

Overlap resolution Local heuristics

Local heuristic techniques as Constraint Handling Rules

Pruning is conveniently expressed as CHR simplification rules.

1 Constraint store starts out as the initial set.

2 Simplification rules prune predictions from the constraint store, untilno more rules apply.

3 Final constraint store represents the final set.

Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution

Page 9: Constraints and Global Optimization for Gene Prediction Overlap Resolution

Overlap resolution Local heuristics

Genemark frame-by-frame

M.Shmatkov, Anton and A.Melikyan, Arik and L.Chernousko, Felixand Borodovsky, MarkFinding prokaryotic genes by the frame-by-frame algorithm: targetinggene starts and overlapping genesBioinfomatics, 1999

Frame-by-frame

prediction(Left1,Right1), prediction(Left2,Right2) <=>

Left1 =< Left2,

Right1 >= Right2

|

prediction(Left1,Right1).

Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution

Page 10: Constraints and Global Optimization for Gene Prediction Overlap Resolution

Overlap resolution Local heuristics

ECOPARSE (1)

Krogh, Anders and Mian, I. Saira and Haussler, DavidA hidden Markov model that finds genes in E.coli DNA.Nucl. Acids Res, 22, 1994.

Rule 1

pred(Left1,Right1,Score1), pred(Left2,Right2,Score2) <=>

overlap_len((Left1,Right1),(Left2,Right2),OverlapLen),

length_ratio((Left1,Right1),(Left2,Right2),Ratio),

length(Left1,Right1,Length1),

length(Left2,Right2,Length2),

OverlapLen > 15, Score1 > Score2

((Length1 > 400, Length2 > 400) ; Ratio > 0.5),

|

pred(Left1,Right1,Score1).

Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution

Page 11: Constraints and Global Optimization for Gene Prediction Overlap Resolution

Overlap resolution Local heuristics

ECOPARSE (2)

Rule 2

pred(Left1,Right1,Score1), pred(Left2,Right2,Score2) <=>

overlap_len((Left1,Right1),(Left2,Right2),OverlapLen),

length_ratio((Left1,Right1),(Left2,Right2),Ratio),

length(Left1,Right1,Length1),

length(Left2,Right2,Length2),

OverlapLen > 15, Ratio =< 0.5, Length1 =< Length2

|

pred(Left1,Right1,Score1).

Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution

Page 12: Constraints and Global Optimization for Gene Prediction Overlap Resolution

Overlap resolution Local heuristics

ECOPARSE (3) - Ambiguity!

We have two predictions P1 and P2 that overlap each end of a thirdprediction P3 by more than 15 bases. If P1 and P3 is are consideredbefore P2 and P3 then P3 will be removed by the first rule. ConsequentlyP2 does not overlap and is kept. If they are considered in opposite order,however, then P2 will be removed by the second rule and subsequently P3is removed by the first rule.

Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution

Page 13: Constraints and Global Optimization for Gene Prediction Overlap Resolution

Overlap resolution Local heuristics

Local heuristics and confluence

the final set will be same ⇔ program is confluent, i.e. it is notsensitive to the order of execution.

With the simple Genemark rule it does not matter in which order thepredictions are processed.

With slightly more complicated heuristics (such as ECOPARSE),programs tend not to be confluent.

Conclusion: Execution strategy is important!

Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution

Page 14: Constraints and Global Optimization for Gene Prediction Overlap Resolution

Overlap resolution Global optimization

Pruning step as a Constraint Optimization Problem (1)

CSP may have multiple solutions!

We want the “best” solution, meaning many correct predictions, fewincorrect.

Set of real genes are not known in advance!

Instead, optimize the probability that the predictions are correct(confidence scores from gene finder).

Extends the problem as a constraint optimization problem.

Definition

A Constraint Optimization Problem (COP) is a CSP where each solution isassociated with a cost and the goal is to find a solution with minimal cost.

Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution

Page 15: Constraints and Global Optimization for Gene Prediction Overlap Resolution

Overlap resolution Global optimization

COP formulation

We would like the final set to reflect the relative confidence scores in thepredictions assigned by the gene finder and at the same time be consistentwith the overlap constraints.

COP formulation

Let the scores of p1 . . . pn be s1 . . . sn and si ∈ R+.Maximize

∑ni=1 si , subject to C.

Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution

Page 16: Constraints and Global Optimization for Gene Prediction Overlap Resolution

Overlap resolution Global optimization

Variable ordering

Assume an ordering on the variables,

Initial set predictions p1 . . . pn are sorted by the position of theirleft-most base, such that∀pi , pj , i < j ⇒ left-most(pi ) ≤ left-most(pj).

The variables x1 . . . xn of the CSP/COP are given the same ordering.

Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution

Page 17: Constraints and Global Optimization for Gene Prediction Overlap Resolution

Overlap resolution Global optimization

COP implementation with Markov Chain (1)

We propose to use a (constrained) Markov chain for the COP.

The Markov chain has a begin state, an end state and two states foreach variable xi corresponding to its boolean domain D(xi ).

The state corresponding to D(xi ) = true is denoted αi and the statecorresponding to D(xi ) = false is denoted βi .

In this model, a path from the begin state to the end statecorresponds to a potential solution of the CSP.

Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution

Page 18: Constraints and Global Optimization for Gene Prediction Overlap Resolution

Overlap resolution Global optimization

From confidence scores to transition probabilities

P(α1|begin) = σ1

P(β1|begin) = 1− σ1

P(end |αn) = P(end |βn) = 1.

P(αi |αi−1) = P(αi |βi−1) = σiP(βi |αi−1) = P(βi |βi−1) = 1− σi

σi = 0.5 + λ+(0.5− λ)× (si −min(s1 . . . sn))

max(s1 . . . sn)−min(s1 . . . sn)

Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution

Page 19: Constraints and Global Optimization for Gene Prediction Overlap Resolution

Overlap resolution Global optimization

Constraining the Markov chain

Constraints are defined on states that are not allowed to occurtogether in a path.

CHR rules similar to those of the local heuristics.

Instead of removing predictions they define conditions forinconsistency (inconsistency rules).

Inconsistency rules

Inconsistency rules match predictions corresponding to α and β states inthe head of the rule. The guard of the rule ensures that the additionalcriteria for rule application are met and the implication of the rules isalways failure.

Such rules are necessarily confluent!

Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution

Page 20: Constraints and Global Optimization for Gene Prediction Overlap Resolution

Overlap resolution Global optimization

Example: Glimmer 3

Delcher, Arthur L. and Bratke, Kirsten A. and Powers, Edwin C. andSalzberg, Steven L.Identifying bacterial genes and endosymbiont DNA with GlimmerBioinformatics, 23, 2007.

Glimmers inconsistency rule

alpha(Left1,Right1), alpha(Left2,Right2) <=>

overlap_length((Left1,Right1),(Left2,Right2),OverlapLen),

OverlapLen > 110

|

fail.

Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution

Page 21: Constraints and Global Optimization for Gene Prediction Overlap Resolution

Overlap resolution Global optimization

Inconsistency rules, Genemark and ECOGENE

Genemark inconsistency rules

alpha(Left1,Right1), alpha(Left2,Right2) <=>

Left1 =< Left2, Right1 >= Right2 | fail.

beta(Left1,Right1), alpha(Left2,Right2) <=>

Left1 =< Left2, Right1 >= Right2 | fail.

ECOGENE inconsistency rules

alpha(Left1,Right1), alpha(Left2,Right2) <=> ... | fail.

beta(Left1,Right1), alpha(Left2,Right2) <=> ... | fail.

Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution

Page 22: Constraints and Global Optimization for Gene Prediction Overlap Resolution

Implementation

Implementation in PRISM

Implemented as a constrained markov Chain in PRISM.

PRISM Markov chain

values(choose(_), [alpha,beta]).

markov_chain(P,_) :-

last_prediction_id(LPID), P > LPID.

markov_chain(P,Store1) :-

last_prediction_id(N), P =< N,

msw(choose(P), AlphaBeta),

X =.. [ AlphaBeta, P ],

check_constraints(X,Store1,Store2),

NextPrediction is P + 1,

markov_chain(NextPrediction,Store2).

Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution

Page 23: Constraints and Global Optimization for Gene Prediction Overlap Resolution

Implementation

Constraint checking and relevant recent states

Maintain a list of recent states (mi ) sorted by the right-most positionof the corresponding predictions.

Constraints are only checked for predictions corresponding toelements of mi .

In step i , we construct mi as the maximal prefix of xi + mi−1, such thatxj ∈ mi ⇐⇒ right-most(pj) ≥ left-most(pi ). If the constraints propagatefail, then the PRISM derivation fails and the (partial) path it representsis pruned from the solution space.

The most probable consistent path is found using PRISMs genericadaptation of the Viterbi algorithm for PRISM programs.

Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution

Page 24: Constraints and Global Optimization for Gene Prediction Overlap Resolution

Evaluation

Are the constraints realistic assumptions?

Definition

We consider constraints safe when the constraints prune only falsepositives.

Neither of the examined constraints are safe with respect the ”known”genes in E.coli, (NC 000913).

Genemark constraint removes three completely overlapped knowngenes.

Four overlaps longer than 110 bases potentially removed Glimmer 3constraint.

93 overlaps longer than 15 bases potentially removed by ECOGENEconstraints.

Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution

Page 25: Constraints and Global Optimization for Gene Prediction Overlap Resolution

Evaluation

Experimental results

Prediction on E.coli. using simplistic codon frequency based genefinder.

Pruning using our global optimization approach (with allinconsistency rules) versus local heuristic rules1.

Method #predictions Sensitivity Specificity Time (seconds)initial set 10799 0.7625 0.2926 na

Genemark rules 5823 0.7558 0.5379 1.4ECOGENE rules 4981 0.7148 0.5947 1.7

global optimization 5222 0.7201 0.5714 75

Sensitivity = fraction of known reference genes predicted.

Specificity = fraction of predicted genes that are correct.

1Note that the results for the ECOGENE heuristic may vary depending on executionstrategy - in case of above results, predictions with lower left position are consideredfirst.

Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution

Page 26: Constraints and Global Optimization for Gene Prediction Overlap Resolution

Summary and future directions

Summary

We presented

A unified view overlap resolution using Constraint Handling Rules.

a novel way to post-process gene prediction results based onconstrained global optimization

that provides an optimality guarantee wrt. to gene finder confidencescores

and it has similar accuracy to the heuristic methods.

Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution

Page 27: Constraints and Global Optimization for Gene Prediction Overlap Resolution

Summary and future directions

Future directions

Better modeling transition probabilities using assumed gene-finderfalse positive rates.

Soft constraints - since constraints are unsafe.

Use technique as a combiner: Include more gene finder predictions ininitial set.

Include and experiment with other constraints.

Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution

Page 28: Constraints and Global Optimization for Gene Prediction Overlap Resolution

Summary and future directions

Questions

Questions?

Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution