This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Link Discovery TutorialPart II: Accuracy
Axel-Cyrille Ngonga Ngomo(1), Irini Fundulaki(2), Mohamed Ahmed Sherif(1)
(1) Institute for Applied Informatics, Germany(2) FORTH, Greece
October 18th, 2016Kobe, Japan
Ngonga Ngomo et al. (InfAI & FORTH) LD Tutorial: Accuracy October 17, 2016 1 / 54
Table of Contents
1 Introduction
2 Raven
3 Eagle
4 Coala
5 Summary and Conclusion
Ngonga Ngomo et al. (InfAI & FORTH) LD Tutorial: Accuracy October 17, 2016 2 / 54
Table of Contents
1 Introduction
2 Raven
3 Eagle
4 Coala
5 Summary and Conclusion
Ngonga Ngomo et al. (InfAI & FORTH) LD Tutorial: Accuracy October 17, 2016 3 / 54
IntroductionLink Discovery as Classification Task
Definition (Declarative Link Discovery)Given sets S and T of resources and relation RFind M = (s, t) ∈ S × T : R(s, t)Here, find M ′ = (s, t) ∈ S × T : δ(s, t) ≥ τ
Definition (Classification perspective)Given sets S and T of resources and relation RFind M = (s, t) ∈ S × T : C(s, t) = +1Here, C(s, t) = +1↔ σ(s, t) ≥ θ
Classical machine learning problem [Ngo+11; NL12]Dedicated techniques perform betterUnsupervised, active and unsupervised techniques possible
Ngonga Ngomo et al. (InfAI & FORTH) LD Tutorial: Accuracy October 17, 2016 4 / 54
IntroductionLink Discovery as Classification Task
Definition (Declarative Link Discovery)Given sets S and T of resources and relation RFind M = (s, t) ∈ S × T : R(s, t)Here, find M ′ = (s, t) ∈ S × T : δ(s, t) ≥ τ
Definition (Classification perspective)Given sets S and T of resources and relation RFind M = (s, t) ∈ S × T : C(s, t) = +1Here, C(s, t) = +1↔ σ(s, t) ≥ θ
Classical machine learning problem [Ngo+11; NL12]Dedicated techniques perform betterUnsupervised, active and unsupervised techniques possible
Ngonga Ngomo et al. (InfAI & FORTH) LD Tutorial: Accuracy October 17, 2016 4 / 54
IntroductionLink Discovery as Classification Task
Definition (Declarative Link Discovery)Given sets S and T of resources and relation RFind M = (s, t) ∈ S × T : R(s, t)Here, find M ′ = (s, t) ∈ S × T : δ(s, t) ≥ τ
Definition (Classification perspective)Given sets S and T of resources and relation RFind M = (s, t) ∈ S × T : C(s, t) = +1Here, C(s, t) = +1↔ σ(s, t) ≥ θ
Classical machine learning problem [Ngo+11; NL12]Dedicated techniques perform betterUnsupervised, active and unsupervised techniques possible
Ngonga Ngomo et al. (InfAI & FORTH) LD Tutorial: Accuracy October 17, 2016 4 / 54
IntroductionChallenge
Challenges1 Creation of labeled training data tedious2 Need automated means for automatic class and property matching3 Need for efficient execution of link specifications4 Dedicated machine learning approaches necessary
Solutions1 Use active learning approach for link discovery2 Rely on hospital/resident algorithm3 See previous section4 Topic of this section
Ngonga Ngomo et al. (InfAI & FORTH) LD Tutorial: Accuracy October 17, 2016 5 / 54
IntroductionChallenge
Challenges1 Creation of labeled training data tedious2 Need automated means for automatic class and property matching3 Need for efficient execution of link specifications4 Dedicated machine learning approaches necessary
Solutions1 Use active learning approach for link discovery2 Rely on hospital/resident algorithm3 See previous section4 Topic of this section
Ngonga Ngomo et al. (InfAI & FORTH) LD Tutorial: Accuracy October 17, 2016 5 / 54
Table of Contents
1 Introduction
2 Raven
3 Eagle
4 Coala
5 Summary and Conclusion
Ngonga Ngomo et al. (InfAI & FORTH) LD Tutorial: Accuracy October 17, 2016 6 / 54
RAVENApproach
Definition (Classification perspective)Given sets S and T of resources and relation RFind M = (s, t) ∈ S × T : C(s, t) = +1Here, C(s, t) = +1↔ σ(s, t) ≥ θ
Learning classifier C involves learning1 Two sets of restrictions that specify the sets S resp. T,2 the components σ1 . . . σn of a complex similarity measure σ3 a set of thresholds θ1, ..., θn for σ1, . . . , σn
AssumptionsRestrictions are class restrictionsClassifier shape is given (e.g., linear combination)
Ngonga Ngomo et al. (InfAI & FORTH) LD Tutorial: Accuracy October 17, 2016 7 / 54
RAVENApproach
Definition (Classification perspective)Given sets S and T of resources and relation RFind M = (s, t) ∈ S × T : C(s, t) = +1Here, C(s, t) = +1↔ σ(s, t) ≥ θ
Learning classifier C involves learning1 Two sets of restrictions that specify the sets S resp. T,2 the components σ1 . . . σn of a complex similarity measure σ3 a set of thresholds θ1, ..., θn for σ1, . . . , σn
AssumptionsRestrictions are class restrictionsClassifier shape is given (e.g., linear combination)
Ngonga Ngomo et al. (InfAI & FORTH) LD Tutorial: Accuracy October 17, 2016 7 / 54
RAVENApproach
Definition (Classification perspective)Given sets S and T of resources and relation RFind M = (s, t) ∈ S × T : C(s, t) = +1Here, C(s, t) = +1↔ σ(s, t) ≥ θ
Learning classifier C involves learning1 Two sets of restrictions that specify the sets S resp. T,2 the components σ1 . . . σn of a complex similarity measure σ3 a set of thresholds θ1, ..., θn for σ1, . . . , σn
AssumptionsRestrictions are class restrictionsClassifier shape is given (e.g., linear combination)
Ngonga Ngomo et al. (InfAI & FORTH) LD Tutorial: Accuracy October 17, 2016 7 / 54
RAVENApproach
Class and Property RestrictionsDefine class similarity functionSolve corresponding hospital-resident problemBased on extension of stable marriage problem
Ngonga Ngomo et al. (InfAI & FORTH) LD Tutorial: Accuracy October 17, 2016 8 / 54
RAVENApproach
Class and Property RestrictionsDefine class similarity functionSolve corresponding hospital-resident problemBased on extension of stable marriage problem
Ngonga Ngomo et al. (InfAI & FORTH) LD Tutorial: Accuracy October 17, 2016 8 / 54
RAVENApproach
Class and Property RestrictionsDefine class similarity functionSolve corresponding hospital-resident problemBased on extension of stable marriage problem
Ngonga Ngomo et al. (InfAI & FORTH) LD Tutorial: Accuracy October 17, 2016 9 / 54
RAVENApproach
Class and Property RestrictionsDefine class similarity functionSolve corresponding hospital-resident problemBased on extension of stable marriage problem
Ngonga Ngomo et al. (InfAI & FORTH) LD Tutorial: Accuracy October 17, 2016 10 / 54
RAVENApproach
Class and Property RestrictionsDefine class similarity functionSolve corresponding hospital-resident problemBased on extension of stable marriage problem
Ngonga Ngomo et al. (InfAI & FORTH) LD Tutorial: Accuracy October 17, 2016 11 / 54
RAVENApproach
Class RestrictionsSimilarity function
String similarityNumber of shared property values amongst instances. . .
Solve corresponding hospital-resident problem
Source Target S TDrugbank Disesome Targets GenesSider Diseasome Side-Effect Diseases
Ngonga Ngomo et al. (InfAI & FORTH) LD Tutorial: Accuracy October 17, 2016 12 / 54
RAVENApproach
Learning ThresholdActive perceptron learningBegin with educated guess, e.g., θi = 0.9Update thresholds based on most informative examples
Guess initial classifier
Ngonga Ngomo et al. (InfAI & FORTH) LD Tutorial: Accuracy October 17, 2016 13 / 54
RAVENApproach
Learning ThresholdActive perceptron learningBegin with educated guess, e.g., θi = 0.9Update thresholds based on most informative examples
Guess initial classifier
Ngonga Ngomo et al. (InfAI & FORTH) LD Tutorial: Accuracy October 17, 2016 13 / 54
RAVENApproach
Learning ThresholdActive perceptron learningBegin with educated guess, e.g., θi = 0.9Update thresholds based on most informative examples
Pick most informative examples, i.e., unclassified and closest to boundary
Ngonga Ngomo et al. (InfAI & FORTH) LD Tutorial: Accuracy October 17, 2016 14 / 54
RAVENApproach
Learning ThresholdActive perceptron learningBegin with educated guess, e.g., θi = 0.9Update thresholds based on most informative examples
Ask for classification from oracle
Ngonga Ngomo et al. (InfAI & FORTH) LD Tutorial: Accuracy October 17, 2016 15 / 54
RAVENApproach
Learning ThresholdActive perceptron learningBegin with educated guess, e.g., θi = 0.9Update thresholds based on most informative examples
Update classifier
Ngonga Ngomo et al. (InfAI & FORTH) LD Tutorial: Accuracy October 17, 2016 16 / 54
RAVENEvaluation
Evaluation on Diseases (Diseasome to DBpedia)Learning rate = 0.0210 questions/iterationF-measure of up to 92%
1 3 5 7 9 11 13 15 17 19 21 23 25
Number of iterations
0
10
20
30
40
50
60
70
80
90
100
P (%)R (%)F (%)
Ngonga Ngomo et al. (InfAI & FORTH) LD Tutorial: Accuracy October 17, 2016 17 / 54
RAVENEvaluation
Learning rate = 0.0210 questions/iteration
1 3 5 7 9 11 13 15 17 19 21 23 25
Number of iterations
10
100
1000
Run
time
(ms)
DiseasesDrugsSide Effects
Ngonga Ngomo et al. (InfAI & FORTH) LD Tutorial: Accuracy October 17, 2016 18 / 54
Table of Contents
1 Introduction
2 Raven
3 Eagle
4 Coala
5 Summary and Conclusion
Ngonga Ngomo et al. (InfAI & FORTH) LD Tutorial: Accuracy October 17, 2016 19 / 54
EagleEfficient Active Learning of Link Specifications using Genetic Programming
Ngonga Ngomo et al. (InfAI & FORTH) LD Tutorial: Accuracy October 17, 2016 20 / 54
EagleEfficient Active Learning of Link Specifications using Genetic Programming
EagleProvides means for automatic class and property matchingMinimizes human labeling effort through active learningAllow for learning generic specs (limitation of RAVEN)Similar approaches [NIK+12; ISE+12]
Ngonga Ngomo et al. (InfAI & FORTH) LD Tutorial: Accuracy October 17, 2016 21 / 54
EagleFormal Definition
Same formal setting as RAVENTwo sets of restrictions resp. that specify the sets S resp. T ,a specification of mapping properties (p1, q1), . . . , (pn, qn) for the elements ofS and T anda specification of a complex similarity measure σ as the combination of severalatomic similarity measures σ1, . . . , σn and of a set of thresholds θ1, . . . , θn suchthat θi is the threshold for σi .
Ngonga Ngomo et al. (InfAI & FORTH) LD Tutorial: Accuracy October 17, 2016 22 / 54
EagleLS example
Can learn generic classifier type
f (levenshtein(:title, :title), 0.53)
f (cosine(:venue, :year), 1.00)\
f (jaccard(:title, :authors), 0.43)
f (trigrams(:title, :year), 1.00)
ut
Ngonga Ngomo et al. (InfAI & FORTH) LD Tutorial: Accuracy October 17, 2016 23 / 54
EagleIdea & Goal
EagleIdea: Specifications are treesGoal: Learn elements of trees through genetic operations until best LS isfound
u
(m4, θ4) (m2, θ2)
p3 q3 p2 q2
Ngonga Ngomo et al. (InfAI & FORTH) LD Tutorial: Accuracy October 17, 2016 24 / 54
Eagle AlgorithmStep 1: Generate initial population
Random process (property pairs, thresholds)Compute fitnessFitness = F-Measure w.r.t known data
(m1, θ1)
p1 q1
(m2, θ2)
p2 q2
(m3, θ3)
p3 q3
u
(m4, θ4) (m5, θ5)
p3 q3 p2 q2
Ngonga Ngomo et al. (InfAI & FORTH) LD Tutorial: Accuracy October 17, 2016 25 / 54
Eagle AlgorithmStep 2: Evolve population
Tournament between two individualsTwo operators: Mutation and crossover
(m1, θ1)
p1 q1
(m3, θ3)
p2 q2
(m2, θ2)
p3 q3
u
(m4, θ4) (m5, θ5)
p3 q3 p2 q2
Ngonga Ngomo et al. (InfAI & FORTH) LD Tutorial: Accuracy October 17, 2016 26 / 54
Eagle AlgorithmStep 2: Evolve population
Tournament between two individualsTwo operators: Mutation and crossover
(m1, θ1)
p1 q1
(m3, θ3)
p2 q2
(m2, θ2)
p3 q3
u
(m4, θ4) (m5, θ5)
p3 q3 p2 q2
Ngonga Ngomo et al. (InfAI & FORTH) LD Tutorial: Accuracy October 17, 2016 26 / 54
Eagle AlgorithmStep 2: Evolve population
Tournament between two individualsTwo operators: Mutation and crossover
(m1, θ1)
p1 q1
(m3, θ3)
p2 q2
(m2, θ2)
p3 q3
u
(m4, θ4) (m5, θ5)
p3 q3 p2 q2
Ngonga Ngomo et al. (InfAI & FORTH) LD Tutorial: Accuracy October 17, 2016 26 / 54
Eagle AlgorithmStep 2: Evolve population
Tournament between two individualsTwo operators: Mutation and crossover
p1 q1
(m1, θ1 + α) (m3, θ3)
p2 q2
(m2, θ2)
p3 q3
u
(m4, θ4) (m5, θ5)
p3 q3 p2 q2
Ngonga Ngomo et al. (InfAI & FORTH) LD Tutorial: Accuracy October 17, 2016 26 / 54
Eagle AlgorithmStep 2: Evolve population
Tournament between two individualsTwo operators: Mutation and crossover
p1 q1
(m1, θ1 + α) (m3, θ3)
p2 q2
(m2, θ2)
p3 q3
u
(m4, θ4)
p3 q3
(m2, θ2)
p3 q3
Ngonga Ngomo et al. (InfAI & FORTH) LD Tutorial: Accuracy October 17, 2016 26 / 54
Eagle AlgorithmStep 3: Computation of most informative links
Previous approaches define amount of information of link as closeness to thedecision boundaryHere, use disagreement amongst elements of population of size n
δ((s, t)) = (n − |Mti : (s, t) ∈Mi)|)(n − |Mt
i : (s, t) /∈Mi |)
Function is maximal when n2 count (s, t) as positive and n
2 as negativeCan be modeled with other functions such as entropy
Ngonga Ngomo et al. (InfAI & FORTH) LD Tutorial: Accuracy October 17, 2016 27 / 54
Eagle AlgorithmStep 4: Active Learning
Compute d((s, t)) for all (s, t) returned by a LSPick k most informativeRequire labeling from userUpdate list of positive and negative examples
(m1, θ1 + α)
p1 q1
(m2, θ2)
p3 q3
(m3, θ3)
p2 q2
u
(m4, θ4) (m2, θ2)
p3 q3 p2 q2
Ngonga Ngomo et al. (InfAI & FORTH) LD Tutorial: Accuracy October 17, 2016 28 / 54
Eagle AlgorithmStep 5: Remove least fit elements
Fitness = F-Measure w.r.t known data
(m1, θ1 + α)
p1 q1
(m2, θ2)
p3 q3
(m3, θ3)
p2 q2
u
(m4, θ4) (m2, θ2)
p3 q3 p2 q2
If termination conditions not met, goto Step 2Else terminate and pick fittest LS
Ngonga Ngomo et al. (InfAI & FORTH) LD Tutorial: Accuracy October 17, 2016 29 / 54
Eagle AlgorithmStep 5: Remove least fit elements
Fitness = F-Measure w.r.t known data
(m1, θ1 + α)
p1 q1
(m2, θ2)
p3 q3
(m3, θ3)
p2 q2
u
(m4, θ4) (m2, θ2)
p3 q3 p2 q2
If termination conditions not met, goto Step 2Else terminate and pick fittest LS
Ngonga Ngomo et al. (InfAI & FORTH) LD Tutorial: Accuracy October 17, 2016 29 / 54
Eagle AlgorithmStep 5: Remove least fit elements
Fitness = F-Measure w.r.t known data
(m1, θ1 + α)
p1 q1
(m2, θ2)
p3 q3
(m3, θ3)
p2 q2
u
(m4, θ4) (m2, θ2)
p3 q3 p2 q2
If termination conditions not met, goto Step 2Else terminate and pick fittest LS
Ngonga Ngomo et al. (InfAI & FORTH) LD Tutorial: Accuracy October 17, 2016 29 / 54
Eagle AlgorithmUnsupervised learning
Measure degree of monogamy of links [NIK+12]Only works for 1-1 relations, e.g., owl:sameAs
Compared different sizes of population (20,100)Compared random annotation with active learningMutation and crossover rates = 0.6Maximal number of iterations = 50
Ngonga Ngomo et al. (InfAI & FORTH) LD Tutorial: Accuracy October 17, 2016 31 / 54
EagleExperiments and Results
Dailymed-Drugbank
Ngonga Ngomo et al. (InfAI & FORTH) LD Tutorial: Accuracy October 17, 2016 32 / 54
EagleExperiments and Results
DBpedia-LinkedMDB
Ngonga Ngomo et al. (InfAI & FORTH) LD Tutorial: Accuracy October 17, 2016 33 / 54
EagleExperiments and Results
DBLP-ACM
Ngonga Ngomo et al. (InfAI & FORTH) LD Tutorial: Accuracy October 17, 2016 34 / 54
EagleExperiments and Results
Larger population leads toBetter results, yetLonger runtimes
For most datasets, population size of 100 seems sufficient for most linkeddata setsEAGLE is more time-efficient than state of the art
337s for ACM-DBLP (n=100) vs.1553s for Marlin (ADTree)2196s for Marlin (SVM)4320s for Febrl (SVM)
Active learning clearly outperforms random annotation
Ngonga Ngomo et al. (InfAI & FORTH) LD Tutorial: Accuracy October 17, 2016 35 / 54
Table of Contents
1 Introduction
2 Raven
3 Eagle
4 Coala
5 Summary and Conclusion
Ngonga Ngomo et al. (InfAI & FORTH) LD Tutorial: Accuracy October 17, 2016 36 / 54
CoalaCorrelation-Aware Active Learning of Link Specifications
Ngonga Ngomo et al. (InfAI & FORTH) LD Tutorial: Accuracy October 17, 2016 37 / 54
Ngonga Ngomo et al. (InfAI & FORTH) LD Tutorial: Accuracy October 17, 2016 47 / 54
Coala EvaluationExperimental Setup
Used EAGLE as active learning approachMutation and crossover rate = 0.6Selection rate = 0.7Not deterministic ⇒ Ran each experiment 5 times5 queries to oracle per iteration10 iterations overall2 populations sizes: 20 and 10050 generations between iterations
Two real-world and three synthetic datasetsSingle thread of a server (JDK1.7, Ubuntu 10.0.4, AMD Opteron 2GHz,2GB/Experiment)
Ngonga Ngomo et al. (InfAI & FORTH) LD Tutorial: Accuracy October 17, 2016 48 / 54
Coala EvaluationParameters for WD
Ran experiments on DBLP-ACMPopulation = 20r ∈ 2, 4, 8, 16, 32