Link Discovery Tutorial Part II: Accuracy

Link Discovery TutorialPart II: Accuracy

Axel-Cyrille Ngonga Ngomo(1), Irini Fundulaki(2), Mohamed Ahmed Sherif(1)

(1) Institute for Applied Informatics, Germany(2) FORTH, Greece

October 18th, 2016Kobe, Japan

Ngonga Ngomo et al. (InfAI & FORTH) LD Tutorial: Accuracy October 17, 2016 1 / 54

Table of Contents

1 Introduction

2 Raven

3 Eagle

4 Coala

5 Summary and Conclusion


Table of Contents

1 Introduction

2 Raven

3 Eagle

4 Coala



IntroductionLink Discovery as Classification Task

Definition (Declarative Link Discovery)Given sets S and T of resources and relation RFind M = (s, t) ∈ S × T : R(s, t)Here, find M ′ = (s, t) ∈ S × T : δ(s, t) ≥ τ

Definition (Classification perspective)Given sets S and T of resources and relation RFind M = (s, t) ∈ S × T : C(s, t) = +1Here, C(s, t) = +1↔ σ(s, t) ≥ θ

Classical machine learning problem [Ngo+11; NL12]Dedicated techniques perform betterUnsupervised, active and unsupervised techniques possible












IntroductionChallenge

Challenges1 Creation of labeled training data tedious2 Need automated means for automatic class and property matching3 Need for efficient execution of link specifications4 Dedicated machine learning approaches necessary

Solutions1 Use active learning approach for link discovery2 Rely on hospital/resident algorithm3 See previous section4 Topic of this section


IntroductionChallenge

Challenges1 Creation of labeled training data tedious2 Need automated means for automatic class and property matching3 Need for efficient execution of link specifications4 Dedicated machine learning approaches necessary

Solutions1 Use active learning approach for link discovery2 Rely on hospital/resident algorithm3 See previous section4 Topic of this section


Table of Contents

1 Introduction

2 Raven

3 Eagle

4 Coala



RAVENApproach


Learning classifier C involves learning1 Two sets of restrictions that specify the sets S resp. T,2 the components σ1 . . . σn of a complex similarity measure σ3 a set of thresholds θ1, ..., θn for σ1, . . . , σn

AssumptionsRestrictions are class restrictionsClassifier shape is given (e.g., linear combination)


RAVENApproach





RAVENApproach





RAVENApproach

Class and Property RestrictionsDefine class similarity functionSolve corresponding hospital-resident problemBased on extension of stable marriage problem


RAVENApproach



RAVENApproach



RAVENApproach



RAVENApproach



RAVENApproach

Class RestrictionsSimilarity function

String similarityNumber of shared property values amongst instances. . .

Solve corresponding hospital-resident problem

Source Target S TDrugbank Disesome Targets GenesSider Diseasome Side-Effect Diseases

DBpedia Dailymed Organization OrganizationSider Dailymed Drugs Offer

Drugbank DBpedia Targets ProteinProperty mapping similarLeads to σ1 . . . σn


RAVENApproach

Class RestrictionsSimilarity function

String similarityNumber of shared property values amongst instances. . .

Solve corresponding hospital-resident problem

Source Target S TDrugbank Disesome Targets GenesSider Diseasome Side-Effect Diseases

DBpedia Dailymed Organization OrganizationSider Dailymed Drugs Offer

Drugbank DBpedia Targets ProteinProperty mapping similarLeads to σ1 . . . σn


RAVENApproach

Learning ThresholdActive perceptron learningBegin with educated guess, e.g., θi = 0.9Update thresholds based on most informative examples

Guess initial classifier


RAVENApproach


Guess initial classifier


RAVENApproach


Pick most informative examples, i.e., unclassified and closest to boundary


RAVENApproach


Ask for classification from oracle


RAVENApproach


Update classifier


RAVENEvaluation

Evaluation on Diseases (Diseasome to DBpedia)Learning rate = 0.0210 questions/iterationF-measure of up to 92%

1 3 5 7 9 11 13 15 17 19 21 23 25

Number of iterations

0

10

20

30

40

50

60

70

80

90

100

P (%)R (%)F (%)


RAVENEvaluation

Learning rate = 0.0210 questions/iteration

1 3 5 7 9 11 13 15 17 19 21 23 25

Number of iterations

10

100

1000

Run

time

(ms)

DiseasesDrugsSide Effects


Table of Contents

1 Introduction

2 Raven

3 Eagle

4 Coala



EagleEfficient Active Learning of Link Specifications using Genetic Programming


EagleEfficient Active Learning of Link Specifications using Genetic Programming

EagleProvides means for automatic class and property matchingMinimizes human labeling effort through active learningAllow for learning generic specs (limitation of RAVEN)Similar approaches [NIK+12; ISE+12]


EagleFormal Definition

Same formal setting as RAVENTwo sets of restrictions resp. that specify the sets S resp. T ,a specification of mapping properties (p1, q1), . . . , (pn, qn) for the elements ofS and T anda specification of a complex similarity measure σ as the combination of severalatomic similarity measures σ1, . . . , σn and of a set of thresholds θ1, . . . , θn suchthat θi is the threshold for σi .


EagleLS example

Can learn generic classifier type

f (levenshtein(:title, :title), 0.53)

f (cosine(:venue, :year), 1.00)\

f (jaccard(:title, :authors), 0.43)

f (trigrams(:title, :year), 1.00)

ut


EagleIdea & Goal

EagleIdea: Specifications are treesGoal: Learn elements of trees through genetic operations until best LS isfound

u

(m4, θ4) (m2, θ2)

p3 q3 p2 q2


Eagle AlgorithmStep 1: Generate initial population

Random process (property pairs, thresholds)Compute fitnessFitness = F-Measure w.r.t known data

(m1, θ1)

p1 q1

(m2, θ2)

p2 q2

(m3, θ3)

p3 q3

u

(m4, θ4) (m5, θ5)

p3 q3 p2 q2


Eagle AlgorithmStep 2: Evolve population

Tournament between two individualsTwo operators: Mutation and crossover

(m1, θ1)

p1 q1

(m3, θ3)

p2 q2

(m2, θ2)

p3 q3

u

(m4, θ4) (m5, θ5)

p3 q3 p2 q2




(m1, θ1)

p1 q1

(m3, θ3)

p2 q2

(m2, θ2)

p3 q3

u

(m4, θ4) (m5, θ5)

p3 q3 p2 q2




(m1, θ1)

p1 q1

(m3, θ3)

p2 q2

(m2, θ2)

p3 q3

u

(m4, θ4) (m5, θ5)

p3 q3 p2 q2




p1 q1

(m1, θ1 + α) (m3, θ3)

p2 q2

(m2, θ2)

p3 q3

u

(m4, θ4) (m5, θ5)

p3 q3 p2 q2




p1 q1

(m1, θ1 + α) (m3, θ3)

p2 q2

(m2, θ2)

p3 q3

u

(m4, θ4)

p3 q3

(m2, θ2)

p3 q3


Eagle AlgorithmStep 3: Computation of most informative links

Previous approaches define amount of information of link as closeness to thedecision boundaryHere, use disagreement amongst elements of population of size n

δ((s, t)) = (n − |Mti : (s, t) ∈Mi)|)(n − |Mt

i : (s, t) /∈Mi |)

Function is maximal when n2 count (s, t) as positive and n

2 as negativeCan be modeled with other functions such as entropy


Eagle AlgorithmStep 4: Active Learning

Compute d((s, t)) for all (s, t) returned by a LSPick k most informativeRequire labeling from userUpdate list of positive and negative examples

(m1, θ1 + α)

p1 q1

(m2, θ2)

p3 q3

(m3, θ3)

p2 q2

u

(m4, θ4) (m2, θ2)

p3 q3 p2 q2


Eagle AlgorithmStep 5: Remove least fit elements

Fitness = F-Measure w.r.t known data

(m1, θ1 + α)

p1 q1

(m2, θ2)

p3 q3

(m3, θ3)

p2 q2

u

(m4, θ4) (m2, θ2)

p3 q3 p2 q2

If termination conditions not met, goto Step 2Else terminate and pick fittest LS




(m1, θ1 + α)

p1 q1

(m2, θ2)

p3 q3

(m3, θ3)

p2 q2

u

(m4, θ4) (m2, θ2)

p3 q3 p2 q2





(m1, θ1 + α)

p1 q1

(m2, θ2)

p3 q3

(m3, θ3)

p2 q2

u

(m4, θ4) (m2, θ2)

p3 q3 p2 q2



Eagle AlgorithmUnsupervised learning

Measure degree of monogamy of links [NIK+12]Only works for 1-1 relations, e.g., owl:sameAs

P(M) = |s|∃t : (s, t) ∈ M|∑s|t : (s, t) ∈ M| ,R(M) = |t|∃s : (s, t) ∈ M|∑

t|s : (s, t) ∈ M| . (1)

Fβ(M) = (1 + β2) Pd(M)Rd(M)β2Pd(M) +Rd(M) (2)

s1

s2

s3

t1

t2

t3

t4

linklinklinklink

P = 3/4, R = 2/4, F = 3/5.Ngonga Ngomo et al. (InfAI & FORTH) LD Tutorial: Accuracy October 17, 2016 30 / 54

EagleExperiments and Results

Experimental Setup:Compared batch learning and genetic programmingUsed 3 different data sets

1 Dailymed-Drugbank (LATC)2 DBpedia-LinkedMDB (LATC)3 DBLP-ACM

Compared different sizes of population (20,100)Compared random annotation with active learningMutation and crossover rates = 0.6Maximal number of iterations = 50



Dailymed-Drugbank



DBpedia-LinkedMDB



DBLP-ACM



Larger population leads toBetter results, yetLonger runtimes

For most datasets, population size of 100 seems sufficient for most linkeddata setsEAGLE is more time-efficient than state of the art

337s for ACM-DBLP (n=100) vs.1553s for Marlin (ADTree)2196s for Marlin (SVM)4320s for Febrl (SVM)

Active learning clearly outperforms random annotation


Table of Contents

1 Introduction

2 Raven

3 Eagle

4 Coala



CoalaCorrelation-Aware Active Learning of Link Specifications


CoalaLearning Complex Specifications

Supervised (mostly active, e.g., RAVEN, EAGLE, SILK)Unsupervised (e.g., KnoFuss, EUCLID, EAGLE)


















InsightChoice of right example is key for learningSo far, only use of informativeness

QuestionCan we do better by using more information?Higher F-measureOften slower




QuestionCan we do better by using more information?

Higher F-measureOften slower




QuestionCan we do better by using more information?Higher F-measureOften slower


Coala ApproachBasic Idea

Use similarity of link candidates when selecting most informative examples








CoalaSimilarity of Candidates

Link candidate x = (s, t) can be regarded as vector(σ1(x), . . . , σn(x)) ∈ [0, 1]n.Similarity of link candidates x and y :

sim(x , y) = 1

1 +√

n∑i=1

(σi(x)− σi(y))2

. (3)

Allows exploiting both intra- and inter-class similarity


CoalaGraph Clustering

Rationale: Use intra-class similarityApproach

Cluster elements of S+ and S− independentlyChoose one element per cluster as representativePresent oracle with most informative representatives

0.8

0.9

0.8

S+

S-

0.8

0.9

0.8

0.25

0.25

0.9

0.80.8

0.8

0.25a

b

c

d

e

d

f g

hi

k

l


CoalaBorderFlow

G = (V ,E , ω) with V = S+ or V = S−

ω(x , y) = sim(x , y)Keep best ec edges for each x ∈ V


CoalaBorderFlow

Seed-based algorithmGoal: Maximize borderflow ratio bf (X ) = Ω(b(X),X)

Ω(b(X),n(X))

http://sourceforge.net/projects/cugar-framework/



CoalaBorderFlow


Ω(b(X),n(X))




CoalaBorderFlow


Ω(b(X),n(X))

http://sourceforge.net/projects/cugar-framework/Ngonga Ngomo et al. (InfAI & FORTH) LD Tutorial: Accuracy October 17, 2016 45 / 54


CoalaBorderFlow


Ω(b(X),n(X))




CoalaBorderFlow


Ω(b(X),n(X))

http://sourceforge.net/projects/cugar-framework/Ngonga Ngomo et al. (InfAI & FORTH) LD Tutorial: Accuracy October 17, 2016 46 / 54


CoalaSpreading Activation

Rationale: Use both inter- and intra-class similarityApproach

M0 : mij = sim(xi , xj) with (xi , xj) ∈ (S+ ∪ S−)2

A0 : ai = ifm(xi )

At = At−1 + Mt−1At−1 (spread activation)At = At/max(At) (normalize)Mt = M r©

t−1 (weight decay)

3 iterations 3.9*10-3

0.97

0.691

0.73 1.5*10-5

3.9*10-3

1.5*10-5

3.9*10-3

S+ S-

0.8

0.80.9

0.90.25

0.5 0.5

0.25

0.5

S+ S-





A0 : ai = ifm(xi )At = At−1 + Mt−1At−1 (spread activation)At = At/max(At) (normalize)Mt = M r©



0.97

0.691

0.73 1.5*10-5

3.9*10-3

1.5*10-5

3.9*10-3

S+ S-

0.8

0.80.9

0.90.25

0.5 0.5

0.25

0.5

S+ S-





A0 : ai = ifm(xi )At = At−1 + Mt−1At−1 (spread activation)At = At/max(At) (normalize)Mt = M r©



0.97

0.691

0.73 1.5*10-5

3.9*10-3

1.5*10-5

3.9*10-3

S+ S-

0.8

0.80.9

0.90.25

0.5 0.5

0.25

0.5

S+ S-


Coala EvaluationExperimental Setup

Used EAGLE as active learning approachMutation and crossover rate = 0.6Selection rate = 0.7Not deterministic ⇒ Ran each experiment 5 times5 queries to oracle per iteration10 iterations overall2 populations sizes: 20 and 10050 generations between iterations

Two real-world and three synthetic datasetsSingle thread of a server (JDK1.7, Ubuntu 10.0.4, AMD Opteron 2GHz,2GB/Experiment)


Coala EvaluationParameters for WD

Ran experiments on DBLP-ACMPopulation = 20r ∈ 2, 4, 8, 16, 32

10 20 30 40 50 60 70 80 90 1000.5

0.6

0.7

0.8

0.9

1

F-sc

ore

0

200

400

600

800

1,000

runti

me in s

eco

nds

f(2) f(4) f(8) f(16) f(32)d(2) d(4) d(8) d(16) d(32)


Coala EvaluationParameters for CL

Ran experiments on DBLP-ACMPopulation = 20ec ∈ 1, 2, 3, 4, 5

10 20 30 40 50 60 70 80 90 1000.5

0.6

0.7

0.8

0.9

1

F-sc

ore

0

500

1,000

1,500

2,000

runti

me in s

eco

nd

s

f(1) f(2) f(3) f(4) f(5)d(1) d(2) d(3) d(4) d(5)


Coala EvaluationF-Scores

Population = 100, final valuesBetter results, yet unclear when to use WD or CL

DataSet EAGLE WD CLAbt 0.19±0.04 0.25±0.04 0.23±0.04DBLP 0.91±0.03 0.96±0.01 0.96±0.02Person1 0.86±0.02 0.89±0.01 0.81±0.18Person2 0.74±0.03 0.71±0.08 0.77±0.03Restaurant 0.89±0.0 0.86±0.02 0.89±0.0


Table of Contents

1 Introduction

2 Raven

3 Eagle

4 Coala



Summary and Conclusion

Large number of challenges tolearning accurate specifications

1 Reduce labeling effort⇒ Active learning

2 Learn complex specifications⇒ Genetic programming

3 Learn specifications efficienty⇒ See previous slides

Challenges include1 Determinism2 Deep learning3 Self-checking4 . . .


Acknowledgment

This work was supported by grants from the EU H2020 Framework Programmeprovided for the project HOBBIT (GA no. 688227).


Link Discovery Tutorial Part II: Accuracy

Science