The Lazy Traveling Salesman Memory Management for Large-Scale Link Discovery Axel-Cyrille Ngonga Ngomo and Mofeed Hassan University of Leipzig Institute for Applied Informatics May 28th, 2016 Crete, Greece Ngonga Ngomo and Hassan (InfAI) GNOME November 10, 2016 1 / 22
42
Embed
The Lazy Traveling Salesman Memory Management for Large-Scale Link Discovery
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
The Lazy Traveling SalesmanMemory Management for Large-Scale Link Discovery
Axel-Cyrille Ngonga Ngomo and Mofeed Hassan
University of LeipzigInstitute for Applied Informatics
May 28th, 2016Crete, Greece
Ngonga Ngomo and Hassan (InfAI) GNOME November 10, 2016 1 / 22
31+ billion triples≈ 0.5 billion linksowl:sameAs in mostcases
Ngonga Ngomo and Hassan (InfAI) GNOME November 10, 2016 2 / 22
Why is it difficult?
Definition (Link Discovery)Given sets S and T of resources and relation RFind M = {(s, t) ∈ S × T : R(s, t)}Common approach: Find M ′ = {(s, t) ∈ S × T : σ(s, t) ≥ θ}
Ngonga Ngomo and Hassan (InfAI) GNOME November 10, 2016 3 / 22
Why is it difficult?
Definition (Link Discovery)Given sets S and T of resources and relation RFind M = {(s, t) ∈ S × T : R(s, t)}Common approach: Find M ′ = {(s, t) ∈ S × T : σ(s, t) ≥ θ}
Ngonga Ngomo and Hassan (InfAI) GNOME November 10, 2016 3 / 22
Why is it difficult?
Definition (Link Discovery)Given sets S and T of resources and relation RFind M = {(s, t) ∈ S × T : R(s, t)}Common approach: Find M ′ = {(s, t) ∈ S × T : σ(s, t) ≥ θ}
Ngonga Ngomo and Hassan (InfAI) GNOME November 10, 2016 3 / 22
Why is it difficult?
1 Time complexityLarge number of triplesQuadratic a-priori runtime69 days for mapping cities fromDBpedia to GeonamesSolutions usually in-memoryInsufficient memory oncommodity hardware
2 Complexity of specificationsCombination of several attributesrequired for high precisionTedious discovery of mostadequate mappingDataset-dependent similarityfunctions
Ngonga Ngomo and Hassan (InfAI) GNOME November 10, 2016 4 / 22
Why is it difficult?
1 Time complexityLarge number of triplesQuadratic a-priori runtime69 days for mapping cities fromDBpedia to GeonamesSolutions usually in-memoryInsufficient memory oncommodity hardware
2 Complexity of specificationsCombination of several attributesrequired for high precisionTedious discovery of mostadequate mappingDataset-dependent similarityfunctions
Ngonga Ngomo and Hassan (InfAI) GNOME November 10, 2016 4 / 22
Problem Statement
AssumptionsConstant memory C|S|+ |T | > |C |
GoalDevise time-efficient approach to compute M ′
Ensure completeness of resultsSolution: Gnome
Ngonga Ngomo and Hassan (InfAI) GNOME November 10, 2016 5 / 22
Divide and Merge
Time-Efficient Link DiscoveryInsight: Most approaches rely on divide-and-merge paradigmExample: HR3
σ(s, t) ≥ θ ⇔ δ(s, t) ≤ ∆
Ngonga Ngomo and Hassan (InfAI) GNOME November 10, 2016 6 / 22
Divide and Merge1 Define S = {S1, . . . ,Sn} with Si ⊆ S ∧
⋃iSi = S
2 Define T = {T1, . . . ,Sm} with Tj ⊆ T ∧⋃jTj = T
3 Find mapping function µ : S → 2T withelements of each Si must only be compared with elements of sets in µ(Si )the union of the results over all Si ∈ S is exactly M′.
Ngonga Ngomo and Hassan (InfAI) GNOME November 10, 2016 7 / 22
Divide and Merge1 Define S = {S1, . . . ,Sn} with Si ⊆ S ∧
⋃iSi = S
2 Define T = {T1, . . . ,Sm} with Tj ⊆ T ∧⋃jTj = T
3 Find mapping function µ : S → 2T withelements of each Si must only be compared with elements of sets in µ(Si )the union of the results over all Si ∈ S is exactly M′.
Ngonga Ngomo and Hassan (InfAI) GNOME November 10, 2016 7 / 22
Task Graph
DefinitionA task Eij stands for comparing Si with Tj ∈ µ(Si )Task Graph G = (V ,E ,wv ,we), with
V = S ∪ Twv (v) = |V |we(eij ) = |Si ||Tj |
S1
3
T1
2S2
2
T2
1
S3
4
T3
36
3
4
2
4
12
6
Ngonga Ngomo and Hassan (InfAI) GNOME November 10, 2016 8 / 22
Task Graph
DefinitionA task Eij stands for comparing Si with Tj ∈ µ(Si )Task Graph G = (V ,E ,wv ,we), with
V = S ∪ Twv (v) = |V |we(eij ) = |Si ||Tj |
S1
3
T1
2S2
2
T2
1
S3
4
T3
3
6
3
4
2
4
12
6
Ngonga Ngomo and Hassan (InfAI) GNOME November 10, 2016 8 / 22
Task Graph
DefinitionA task Eij stands for comparing Si with Tj ∈ µ(Si )Task Graph G = (V ,E ,wv ,we), with
V = S ∪ Twv (v) = |V |we(eij ) = |Si ||Tj |
S1
3
T1
2S2
2
T2
1
S3
4
T3
36
3
4
2
4
12
6
Ngonga Ngomo and Hassan (InfAI) GNOME November 10, 2016 8 / 22
Task Graph
DefinitionA task Eij stands for comparing Si with Tj ∈ µ(Si )Task Graph G = (V ,E ,wv ,we), with
V = S ∪ Twv (v) = |V |we(eij ) = |Si ||Tj |
S1
3
T1
2S2
2
T2
1
S3
4
T3
36
3
4
2
4
12
6
Ngonga Ngomo and Hassan (InfAI) GNOME November 10, 2016 8 / 22
Problem
Locality maximizationQuestion: What if V does not fit in memory?Insight: Main bottleneck is access to hard drive.Solution:
1 Find groups of nodes that fit in memory and2 Compute sequence of groups that minimizes hard drive access
S1
3T1
2S2
2
T2
1
S3 4
T3
36
3
4
2
412
6
Ngonga Ngomo and Hassan (InfAI) GNOME November 10, 2016 9 / 22
Clustering: Naïve Approach
ApproachCluster by Si
Example: |C | = 7
S1
3T1
2S2
2
T2
1
S3 4
T3
36
3
4
2
412
6
6
3
4
2
6
412
Ngonga Ngomo and Hassan (InfAI) GNOME November 10, 2016 10 / 22
Clustering: Naïve Approach
ApproachCluster by Si
Example: |C | = 7
S1
3T1
2S2
2
T2
1
S3 4
T3
36
3
4
2
412
6
6
3
4
2
6
412
Ngonga Ngomo and Hassan (InfAI) GNOME November 10, 2016 10 / 22
Clustering: Naïve Approach
ApproachCluster by Si
Example: |C | = 7
S1
3T1
2S2
2
T2
1
S3 4
T3
36
3
4
2
412
6
6
3
4
2
6
412
Ngonga Ngomo and Hassan (InfAI) GNOME November 10, 2016 10 / 22
Clustering: Naïve Approach
ApproachCluster by Si
Example: |C | = 7
S1
3T1
2S2
2
T2
1
S3 4
T3
36
3
4
2
412
6
6
3
4
2
6
412
Ngonga Ngomo and Hassan (InfAI) GNOME November 10, 2016 10 / 22
Clustering: Naïve Approach
ApproachCluster by Si
Example: |C | = 7
S1
3T1
2S2
2
T2
1
S3 4
T3
36
3
4
2
412
6
6
3
4
2
6
412
Ngonga Ngomo and Hassan (InfAI) GNOME November 10, 2016 10 / 22
Clustering: Naïve Approach
Greedy ApproachStart by largest taskAdd connected largest tasks until none fits in C
Example: |C | = 7
S1
3T1
2S2
2
T2
1
S3 4
T3
36
3
4
2
412
6
12
6
46
3 2
4
Ngonga Ngomo and Hassan (InfAI) GNOME November 10, 2016 11 / 22
Clustering: Naïve Approach
Greedy ApproachStart by largest taskAdd connected largest tasks until none fits in C
Example: |C | = 7
S1
3T1
2S2
2
T2
1
S3 4
T3
36
3
4
2
412
6
12
6
46
3 2
4
Ngonga Ngomo and Hassan (InfAI) GNOME November 10, 2016 11 / 22
Clustering: Naïve Approach
Greedy ApproachStart by largest taskAdd connected largest tasks until none fits in C
Example: |C | = 7
S1
3T1
2S2
2
T2
1
S3 4
T3
36
3
4
2
412
6
12
6
46
3 2
4
Ngonga Ngomo and Hassan (InfAI) GNOME November 10, 2016 11 / 22
Clustering: Naïve Approach
Greedy ApproachStart by largest taskAdd connected largest tasks until none fits in C
Example: |C | = 7
S1
3T1
2S2
2
T2
1
S3 4
T3
36
3
4
2
412
6
12
6
46
3 2
4
Ngonga Ngomo and Hassan (InfAI) GNOME November 10, 2016 11 / 22
Clustering: Naïve Approach
Greedy ApproachStart by largest taskAdd connected largest tasks until none fits in C
Example: |C | = 7
S1
3T1
2S2
2
T2
1
S3 4
T3
36
3
4
2
412
6
12
6
4
6
3 2
4
Ngonga Ngomo and Hassan (InfAI) GNOME November 10, 2016 11 / 22
Clustering: Naïve Approach
Greedy ApproachStart by largest taskAdd connected largest tasks until none fits in C
Example: |C | = 7
S1
3T1
2S2
2
T2
1
S3 4
T3
36
3
4
2
412
6
12
6
46
3 2
4
Ngonga Ngomo and Hassan (InfAI) GNOME November 10, 2016 11 / 22
Scheduling
Output of clustering is sequence G1, . . . ,GNof clustersIntuition: Consequent clusters should sharedataOverlap o(Gi ,Gj) =
∑v∈V (Gi )∩V (Gj )
|v |
Overlap o(G1, . . . ,GN) =N−1∑i=1
o(Gi ,Gi+1)
S1
3T1
2S2
2
T2
1
S3 4
T3
36
3
4
2
412
6
12
6
46
3 2
4
GoalMaximize overlap of generated sequence
Ngonga Ngomo and Hassan (InfAI) GNOME November 10, 2016 12 / 22
Scheduling
Best-EffortSelect random pair of clustersIf permutation improves overlap, then permuteRelies on local knowledge for scalability
ConclusionSub-quadratic growth of runtimeRuntime grows linearly with number of mappingsFor LGD, 360 – 370 mappings/s
Ngonga Ngomo and Hassan (InfAI) GNOME November 10, 2016 19 / 22
Conclusion and Future Work
Presented GnomeTwo-step approach for link discoveryRelies on divide-and-merge paradigmEnsure LD on datasets of arbitrary sizeCompared with state-of-the-art cachingFuture Work
Parallel implementationCombination with blocking
Ngonga Ngomo and Hassan (InfAI) GNOME November 10, 2016 20 / 22