Weighing Evidence Weighing Evidence in the Absence in the Absence of a Gold Standard of a Gold Standard Phil Long Genome Institute of Singapore (joint work with K.R.K. “Krish” Murthy, Vinsensius Vega, Nir Friedman and Edison Liu.)
Feb 07, 2016
Weighing Evidence Weighing Evidence in the Absence in the Absence
of a Gold Standardof a Gold StandardPhil Long
Genome Institute of Singapore
(joint work with K.R.K. “Krish” Murthy, Vinsensius Vega, Nir Friedman and Edison Liu.)
Problem: Ortholog mappingProblem: Ortholog mapping
• Pair genes in one organism with their equivalent counterparts in another
• Useful for supporting medical research using animal models
A little molecular biologyA little molecular biology
• DNA has nucleotides (A, C, T and G) arranged linearly along chromosomes
• Regions of DNA, called genes, encode proteins
• Proteins biochemical workhorses
• Proteins made up of amino acids• also strung together linearly
• fold up to form 3D structure
Mutations and evolutionMutations and evolution
• Speciation often roughly as follows:• one species separated into two populations• separate populations’ genomes drift apart through
mutation• important parts (e.g. genes) drift less
• Orthologs have common evolutionary ancestor• Genes sometimes copied
• original retains function• copy drifts or dies out
• Both fine-grained and coarse-grained mutations
Evidence of orthologyEvidence of orthology
• (protein) sequence similarity
• comparison with third organism
• conservation of synteny
.
.
.
Conserved synteny Conserved synteny • Neighbor relationships often preserved
• Consequently, similarity among their neighbors evidence that a pair of genes are orthologs
PlanPlan
• Identify numerical features corresponding to• sequence similarity
• common similarity to third organism
• conservation of synteny
• “Learn” mapping from feature values to prediction
Problem – no “gold Problem – no “gold standard”standard”
• for mouse-human orthology, Jackson database reasonable
• for human-zebrafish? human-pombe?
Another “no gold standard” Another “no gold standard” problem: protein-protein problem: protein-protein
interactionsinteractions• Sources of evidence:
• Yeast two-hybrid
• Rosetta Stone
• Phage display
• All yield errors
.
.
.
Related Theoretical Work Related Theoretical Work [MV95] – Problem[MV95] – Problem
• Goal:• given m training examples generated as below• output accurate classifier h
• Training example generation:• All variables {0,1}-valued• Y chosen randomly, fixed
• X1,...,Xn chosen independently with Pr(Xi = Y) = pi, where pi is• unknown, • same when Y is 0 or 1 (crucial for analysis)
• only X1,...,Xn given to training algorithm
Related Theoretical Work Related Theoretical Work [MV95] – Results[MV95] – Results
• If n ≥ 3, can approach Bayes error (best possible for source) as m gets large
• Idea:• variable “good” if often agrees with others
• can e.g. solve for Pr(X1 = Y) as function of Pr(X1 = X2), Pr(X1 = X3), and Pr(X2 = X3)
• can estimate Pr(X1 = X2), Pr(X1 = X3), and Pr(X2 = X3) from the training data
• can plug in to get estimates of Pr(X1 = Y),...,Pr(Xn = Y)
• can use resulting estimates of Pr(X1 = Y),...,Pr(Xn = Y) to approximate optimal classifier for source
In our problem(s)...In our problem(s)...
• Pr(Y = 1) small
• X1,...,Xn continuous-valued
• Reasonable to assume X1,...,Xn conditionally independent given Y
• Reasonable to assume Pr(Y = 1 | Xi = x) increasing in x, for all i
• Sufficient to sort training examples in order of associated conditional probabilities that Y = 1
Key IdeaKey Idea
• Suppose Pr(Y = 1) known• For variable i,
• Set threshold so that Pr(Ui = 1) = Pr(Y = 1)• Then Pr(Y = 1 and Ui = 0) = Pr(Y = 0 and Ui = 1) • Can solve for these error probabilities for all i in
terms of probabilities Ui’s agree,...
- - - - - - - - - - - - - + -- - - + + - + - - + + +
Ui = 1Ui = 0
Final Plan (informal)Final Plan (informal)
• Assume various values of Pr(Y = 1); predict orthologs given each
• For pairs of genes predicted to be orthologs even when Pr(Y = 1) assumed small, confidently predict orthology
• For pairs of genes predicted to be orthologs only when Pr(Y = 1) assumed pretty big, predict orthology more tentatively
Final Plan – Final Plan – Probabilistic ViewpointProbabilistic Viewpoint
• Consider hidden variable Z:• takes values uniformly distributed in [0,1]• interpretation: “obviously orthologous”
• Assumptions• Pr(Y = 1| Z = z) increasing in z
• For all z, Pr(Z ≥ z | Xi = x ) increasing in x
• For various z• Let Vz = 1 if Z ≥ z, Vz = 0 otherwise
• Let Uz,i = 1 if Xi ≥ θz,i, Uz,i = 0 otherwise, where θz,i chosen so that Pr(Uz,i = 1) = Pr(Vz = 1)
• Interpretations: • Vz is “In the top 100(1-z)% most likely to have Y = 1 overall”
• Uz,i “In the top 100(1-z) % most likely to have Y = 1 given Xi”
Final Plan - AlgorithmFinal Plan - Algorithm
• Estimate conditional probability that Vz = 1, i.e. that Z ≥ z, given each
training example, using estimated probabilities pairs of Uz,i’s agree
• Add to estimate Z’s; sort by estimates.
Practical problemPractical problem
• Small errors in estimates of Pr(Uz,i = Uz,j)’s
can lead to large errors in estimates of Pr(Uz,i = Vz )’s (in fact, program crashes).
• Solution: • when Pr(Vz = 1) small is important case
(confident predictions)
• can approximate: Pr(Uz,i ≠ Vz ) ~ ½ (Pr(Uz,i ≠ Uz,j) + Pr(Uz,i ≠ Uz,k) - Pr(Uz,j ≠ Uz,k)).
Evaluation: Artificial SourceEvaluation: Artificial Source
• Examples generated using randomly chosen probability distribution:
• Pr(Yz = 1) = 0.1, n = 5• For each i,
• choose μi uniformly from [min,max]• set distributions for ith variable:
• Pr(Xi | Y=0) = N(-μi,1), • Pr(Xi | Y=1) = N(μi,1).
• Evaluate using area under the ROC curve• Repeat 100 times, average
ROC curveROC curve
False positives
True
positives
1
1
Area under
the ROC curve
Results: Artificial SourceResults: Artificial Source
m min μ max μ peer AUC opt (w/ Y’s)
1000 0.2 1.0 .940 .985
1000 0.1 0.5 .811 .881
1000 0.05 0.25 .635 .818
1000 0.02 0.1 .611 .753
Evaluation: mouse-human Evaluation: mouse-human ortholog mappingortholog mapping
• Use Jackson mouse-human ortholog database as “gold standard”
• Apply algorithm, post-processing to map each gene to unique ortholog
• Compare with analogous BLAST-only algorithm• Plot ROC curve • Treat anything not in database as non-ortholog
• some “false positives” in fact correct• error rate overestimated
Results: mouse-human Results: mouse-human ortholog mappingortholog mapping
Open problemsOpen problems
• Given our assumptions, is there an algorithm for learning using random examples that always approaches the optimal AUC given knowledge of the source?
• Is discretizing the independent variables necessary?
• How does our method compare with other natural algorithms? (E.g. what about algorithms based on clustering?)