Expanding the YAGO knowledge base Rebele The YAGO knowledge base Outline Using YAGO for the humanities Adding Words to Regexes Answering Queries with Unix Shell Conclusion 1/41 Expanding the YAGO knowledge base Thomas Rebele Télécom ParisTech 2018-07-19
92
Embed
Expanding the YAGO knowledge base - Thomas Rebele · YAGO knowledge base Rebele The YAGO knowledge base What is a knowledge base? What is YAGO? Involvement Outline Using YAGO for
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Expanding theYAGO knowledge
base
Rebele
The YAGOknowledge base
Outline
Using YAGO forthe humanities
Adding Words toRegexes
AnsweringQueries with UnixShell
Conclusion
1/41
Expanding the YAGO knowledge base
Thomas Rebele
Télécom ParisTech
2018-07-19
Expanding theYAGO knowledge
base
Rebele
The YAGOknowledge base
What is a knowledgebase?
What is YAGO?
Involvement
Outline
Using YAGO forthe humanities
Adding Words toRegexes
AnsweringQueries with UnixShell
Conclusion
What is a knowledge base?
2/41
Albert EinsteinMileva Maric
marriedAlfred Kleinerhas advisor
Nobel Prize in Physics
won prize
◮
◮
◮
◮
Expanding theYAGO knowledge
base
Rebele
The YAGOknowledge base
What is a knowledgebase?
What is YAGO?
Involvement
Outline
Using YAGO forthe humanities
Adding Words toRegexes
AnsweringQueries with UnixShell
Conclusion
What is a knowledge base?
2/41
Albert EinsteinMileva Maric
marriedAlfred Kleinerhas advisor
Nobel Prize in Physics
won prize
Applications of knowledge bases:
◮ Question answering
◮ Semantic search
◮ Text analysis
◮ Machine translation
Expanding theYAGO knowledge
base
Rebele
The YAGOknowledge base
What is a knowledgebase?
What is YAGO?
Involvement
Outline
Using YAGO forthe humanities
Adding Words toRegexes
AnsweringQueries with UnixShell
Conclusion
What is YAGO?
3/41
◮ Knowledge base with 10 million entities and >210 million facts
◮ Automatically extracted from Wikipedia, Wordnet, and Geonames
◮ Multilingual facts from 10 languages
◮ Focus on precision
◮ Developed by Max-Planck Institute for Informatics and TélécomParisTech
Expanding theYAGO knowledge
base
Rebele
The YAGOknowledge base
What is a knowledgebase?
What is YAGO?
Involvement
Outline
Using YAGO forthe humanities
Adding Words toRegexes
AnsweringQueries with UnixShell
Conclusion
What is YAGO?
4/41
Languages
EnglishGermanFrench
Albert Einstein
Albert Einstein was aphysicist. His work in-fluenced the philosophyof science. He developedthe theory of relativity.
Albert Einstein
Spouse(s) Mileva MaricDoctoral advisor(s) Alfred Kleiner
Categories: Nobel laureates in Physics
Albert EinsteinMileva Maric
marriedAlfred Kleinerhas advisor
Nobel Prize in Physics
won prize
automaticextraction
Expanding theYAGO knowledge
base
Rebele
The YAGOknowledge base
What is a knowledgebase?
What is YAGO?
Involvement
Outline
Using YAGO forthe humanities
Adding Words toRegexes
AnsweringQueries with UnixShell
Conclusion
Involvement
5/41
◮ I joined the project in 2014
◮ Maintenance and development
◮ Contributed to open source release in 2017 athttps://github.com/yago-naga/yago3/
◮ Coordinated / contributed to the evaluation
◮ ground truth: Wikipedia◮ 98% facts of the sample were correct
Publication: ISWC 2016 (resource paper)
Thomas Rebele Fabian Suchanek Johannes Hoffart Joanna Biega Erdal Kuzey Gerhard Weikum
Figure: Relative population size, by century. The y-axis is scaled by a quadraticfunction.
Expanding theYAGO knowledge
base
Rebele
The YAGOknowledge base
Outline
Using YAGO forthe humanities
Problem statement
Extensions
Place of residence
Gender
Evaluation
Births per month
Life span over time
Relative populationsize
Summary
Adding Words toRegexes
AnsweringQueries with UnixShell
Conclusion
Using YAGO for the humanities: Summary
14/41
◮ Extension of YAGO:◮ More people with residences (+201%, 97% precison)◮ More people with genders (+35%, 98% precision)◮ More birth and death dates (+8%/10%, 100% precision)
◮ Case studies:◮ Births per month◮ Life span over time◮ Relative population size over time
◮ Interdisciplinary project
Publication: ISWC 2017 (workshop paper)
Thomas Rebele Arash Nekoei Fabian Suchanek
We often had to repair regular expressions (e.g., for matching dates).Can we automate this step?
100 500 1000 1500 1900
50
60
70
80
Year
Med
ian
age
male
female
Expanding theYAGO knowledge
base
Rebele
The YAGOknowledge base
Outline
Using YAGO forthe humanities
Problem statement
Extensions
Place of residence
Gender
Evaluation
Births per month
Life span over time
Relative populationsize
Summary
Adding Words toRegexes
AnsweringQueries with UnixShell
Conclusion
Using YAGO for the humanities: Summary
15/41
Contributions:
Extracting more information aboutresidences, gender, birth and death dates
Repairing regular expressionsby adding missing words
Preprocessing tabular databy transforming queries to Bash scripts
Expanding theYAGO knowledge
base
Rebele
The YAGOknowledge base
Outline
Using YAGO forthe humanities
Adding Words toRegexes
Introduction
Problem statement
What is new in ourapproach
Approximate regexmatching
Finding the gaps
Add missing parts
Feedback function
Experiments
Summary
AnsweringQueries with UnixShell
Conclusion
Adding Words to Regexes: Introduction
16/41
Why does YAGO not knowthe ISBN numbers of my books?
◮ We want to find ISBN numbers in Wikipedia to include it in YAGO
◮ We try the regex ISBN(978|979)?\d{10}
◮ I978-2-1234-5680-3
◮
Expanding theYAGO knowledge
base
Rebele
The YAGOknowledge base
Outline
Using YAGO forthe humanities
Adding Words toRegexes
Introduction
Problem statement
What is new in ourapproach
Approximate regexmatching
Finding the gaps
Add missing parts
Feedback function
Experiments
Summary
AnsweringQueries with UnixShell
Conclusion
Adding Words to Regexes: Introduction
16/41
Why does YAGO not knowthe ISBN numbers of my books?
◮ We want to find ISBN numbers in Wikipedia to include it in YAGO
◮ We try the regex ISBN(978|979)?\d{10}
◮ Why does the regex not find I978-2-1234-5680-3 ?
◮ How can we modify the regex automatically to match the word?
Expanding theYAGO knowledge
base
Rebele
The YAGOknowledge base
Outline
Using YAGO forthe humanities
Adding Words toRegexes
Introduction
Problem statement
What is new in ourapproach
Approximate regexmatching
Finding the gaps
Add missing parts
Feedback function
Experiments
Summary
AnsweringQueries with UnixShell
Conclusion
Adding Words to Regexes: Problem statement
17/41
Problem statement, first try:
Given
◮ a regular expression r and ISBN(978|979)?\d{10}
◮ a set of positive examples E+, { I978-2-1234-5680-3 }find a regular expression r′ such that
◮ L(r) ⊆ L(r′)
◮ E+ ⊆ L(r′)
′ = .∗
Expanding theYAGO knowledge
base
Rebele
The YAGOknowledge base
Outline
Using YAGO forthe humanities
Adding Words toRegexes
Introduction
Problem statement
What is new in ourapproach
Approximate regexmatching
Finding the gaps
Add missing parts
Feedback function
Experiments
Summary
AnsweringQueries with UnixShell
Conclusion
Adding Words to Regexes: Problem statement
17/41
Problem statement, first try:
Given
◮ a regular expression r and ISBN(978|979)?\d{10}
◮ a set of positive examples E+, { I978-2-1234-5680-3 }find a regular expression r′ such that
◮ L(r) ⊆ L(r′)
◮ E+ ⊆ L(r′)
Solution:
r′ = .∗
Expanding theYAGO knowledge
base
Rebele
The YAGOknowledge base
Outline
Using YAGO forthe humanities
Adding Words toRegexes
Introduction
Problem statement
What is new in ourapproach
Approximate regexmatching
Finding the gaps
Add missing parts
Feedback function
Experiments
Summary
AnsweringQueries with UnixShell
Conclusion
Adding Words to Regexes: Problem statement
18/41
Problem statement:
Given
◮ a regular expression r, ISBN(978|979)?\d{10}
◮ a set of positive examples E+, { I978-2-1234-5680-3 }◮ a set of negative examples E−, { 0612345678 }
find a regular expression r′ such that
◮ L(r) ⊆ L(r′)
◮ E+ ⊆ L(r′)
◮ L(r′) ∩ E− is small
◮ ′ ≥ ≈
◮ ′ ≥
Expanding theYAGO knowledge
base
Rebele
The YAGOknowledge base
Outline
Using YAGO forthe humanities
Adding Words toRegexes
Introduction
Problem statement
What is new in ourapproach
Approximate regexmatching
Finding the gaps
Add missing parts
Feedback function
Experiments
Summary
AnsweringQueries with UnixShell
Conclusion
Adding Words to Regexes: Problem statement
18/41
Problem statement:
Given
◮ a regular expression r, ISBN(978|979)?\d{10}
◮ a set of positive examples E+, { I978-2-1234-5680-3 }◮ a set of negative examples E−, { 0612345678 }
find a regular expression r′ such that
◮ L(r) ⊆ L(r′)
◮ E+ ⊆ L(r′)
◮ L(r′) ∩ E− is small
Evaluation:
◮ Precision of r′ ≥ or ≈ precision of r
◮ Recall of r′ ≥ recall of r(w.r.t. the intended meaning of the regex)
Expanding theYAGO knowledge
base
Rebele
The YAGOknowledge base
Outline
Using YAGO forthe humanities
Adding Words toRegexes
Introduction
Problem statement
What is new in ourapproach
Approximate regexmatching
Finding the gaps
Add missing parts
Feedback function
Experiments
Summary
AnsweringQueries with UnixShell
Conclusion
Adding Words to Regexes: What is new in our approach
19/41
Previous approaches:
regex E+ E− regex+ + −→
Our approach:
regex E+ E− regex+ + −→
Rationale: creating a large set of positive examples is difficult
Expanding theYAGO knowledge
base
Rebele
The YAGOknowledge base
Outline
Using YAGO forthe humanities
Adding Words toRegexes
Introduction
Problem statement
What is new in ourapproach
Approximate regexmatching
Finding the gaps
Add missing parts
Feedback function
Experiments
Summary
AnsweringQueries with UnixShell
Conclusion
Adding Words to Regexes: Approximate regex matching
20/41
Step 1: match string and regex approximately [Myers et al. 1989]
.
.
I S B N
?
|
.
9 7 8
.
9 7 9
.
\d \d ... \d \d
I 9 7 8 - 2 - 1 2 3 4 - 5 6 8 0 - 3
...
Expanding theYAGO knowledge
base
Rebele
The YAGOknowledge base
Outline
Using YAGO forthe humanities
Adding Words toRegexes
Introduction
Problem statement
What is new in ourapproach
Approximate regexmatching
Finding the gaps
Add missing parts
Feedback function
Experiments
Summary
AnsweringQueries with UnixShell
Conclusion
Adding Words to Regexes: Finding the gaps
21/41
Step 2: find the gaps
◮ Between regex leaves
◮
.
.
I S B N
?
|
.
9 7 8
.
9 7 9
.
\d \d ... \d \dS B N
I 9 7 8 - 2 - 1 2 3 4 - 5 6 8 0 - 3
...
Expanding theYAGO knowledge
base
Rebele
The YAGOknowledge base
Outline
Using YAGO forthe humanities
Adding Words toRegexes
Introduction
Problem statement
What is new in ourapproach
Approximate regexmatching
Finding the gaps
Add missing parts
Feedback function
Experiments
Summary
AnsweringQueries with UnixShell
Conclusion
Adding Words to Regexes: Finding the gaps
21/41
Step 2: find the gaps
◮ Between regex leaves
◮ Between characters of the string
.
.
I S B N
?
|
.
9 7 8
.
9 7 9
.
\d \d ... \d \d
I 9 7 8 - 2 - 1 2 3 4 - 5 6 8 0 - 3-
...
Expanding theYAGO knowledge
base
Rebele
The YAGOknowledge base
Outline
Using YAGO forthe humanities
Adding Words toRegexes
Introduction
Problem statement
What is new in ourapproach
Approximate regexmatching
Finding the gaps
Add missing parts
Feedback function
Experiments
Summary
AnsweringQueries with UnixShell
Conclusion
Adding Words to Regexes: Add missing parts
22/41
Step 3 (simple approach): adapt regex, so that it includes the missingparts
.
.
I S B N
?
|
.
9 7 8
.
9 7 9
.
\d \d ... \d \dS B N
I 9 7 8 - 2 - 1 2 3 4 - 5 6 8 0 - 3
...
.
.
I ?
.
S B N
?
|
.
9 7 8
.
9 7 9
.
\d \d ... \d \d
Expanding theYAGO knowledge
base
Rebele
The YAGOknowledge base
Outline
Using YAGO forthe humanities
Adding Words toRegexes
Introduction
Problem statement
What is new in ourapproach
Approximate regexmatching
Finding the gaps
Add missing parts
Feedback function
Experiments
Summary
AnsweringQueries with UnixShell
Conclusion
Adding Words to Regexes: Add missing parts
22/41
Step 3 (simple approach): adapt regex, so that it includes the missingparts
.
.
I S B N
?
|
.
9 7 8
.
9 7 9
.
\d \d ... \d \d
I 9 7 8 - 2 - 1 2 3 4 - 5 6 8 0 - 3-
...
.
.
I ?
.
S B N
?
|
.
9 7 8
.
9 7 9
?
-
.
\d \d ... \d \d
Expanding theYAGO knowledge
base
Rebele
The YAGOknowledge base
Outline
Using YAGO forthe humanities
Adding Words toRegexes
Introduction
Problem statement
What is new in ourapproach
Approximate regexmatching
Finding the gaps
Add missing parts
Feedback function
Experiments
Summary
AnsweringQueries with UnixShell
Conclusion
Adding Words to Regexes: Add missing parts
22/41
Step 3 (simple approach): adapt regex, so that it includes the missingparts
.
.
I S B N
?
|
.
9 7 8
.
9 7 9
.
\d \d ... \d \d
I 9 7 8 - 2 - 1 2 3 4 - 5 6 8 0 - 3-
...
.
.
I ?
.
S B N
?
|
.
9 7 8
.
9 7 9
?
-
.
\d ?
-
\d ... \d \d
Expanding theYAGO knowledge
base
Rebele
The YAGOknowledge base
Outline
Using YAGO forthe humanities
Adding Words toRegexes
Introduction
Problem statement
What is new in ourapproach
Approximate regexmatching
Finding the gaps
Add missing parts
Feedback function
Experiments
Summary
AnsweringQueries with UnixShell
Conclusion
Adding Words to Regexes: Add missing parts
22/41
Step 3 (simple approach): adapt regex, so that it includes the missingparts
.
.
I S B N
?
|
.
9 7 8
.
9 7 9
.
\d \d ... \d \d
I 9 7 8 - 2 - 1 2 3 4 - 5 6 8 0 - 3-
...
.
.
I ?
.
S B N
?
|
.
9 7 8
.
9 7 9
?
-
.
\d ?
-
\d ... \d ?
-
\d
Expanding theYAGO knowledge
base
Rebele
The YAGOknowledge base
Outline
Using YAGO forthe humanities
Adding Words toRegexes
Introduction
Problem statement
What is new in ourapproach
Approximate regexmatching
Finding the gaps
Add missing parts
Feedback function
Experiments
Summary
AnsweringQueries with UnixShell
Conclusion
Adding Words to Regexes: Add missing parts
23/41
Step 3 (adaptive approach): adapt regex, so that it includes the missingparts
Exemplarily for a concatenation:
.
a b ... c d
gs
2g1 g
e
2g3
{g1, g2, g3}
→
.
a
?
b ... c d
{g1, gs
2} {g e
2, g3}
Expanding theYAGO knowledge
base
Rebele
The YAGOknowledge base
Outline
Using YAGO forthe humanities
Adding Words toRegexes
Introduction
Problem statement
What is new in ourapproach
Approximate regexmatching
Finding the gaps
Add missing parts
Feedback function
Experiments
Summary
AnsweringQueries with UnixShell
Conclusion
Adding Words to Regexes: Add missing parts
23/41
Step 3 (adaptive approach): adapt regex, so that it includes the missingparts
Exemplarily for a concatenation:
.
a b ... c d
gs
2g1 g
e
2g3
{g1, g2, g3}
→
.
a
?
b ... c d
{g1, gs
2} {g e
2, g3}
Expanding theYAGO knowledge
base
Rebele
The YAGOknowledge base
Outline
Using YAGO forthe humanities
Adding Words toRegexes
Introduction
Problem statement
What is new in ourapproach
Approximate regexmatching
Finding the gaps
Add missing parts
Feedback function
Experiments
Summary
AnsweringQueries with UnixShell
Conclusion
Adding Words to Regexes: Feedback function
24/41
Now we want to find URLs:
◮ We try regex r = http://[a-zA-Z\.]+
◮ It does not find s = wikipedia.org
◮ Repaired regex r′ = (http://)?[a-zA-Z\.]+
Problem:
◮ r′ finds all words
◮ Precision drops
−
◮
◮
( ′) = | − ∩ ( ′)| ≤ α| − ∩ ( )|
◮ ( ′) =http://[a-zA-Z\.]+|wikipedia.org
Expanding theYAGO knowledge
base
Rebele
The YAGOknowledge base
Outline
Using YAGO forthe humanities
Adding Words toRegexes
Introduction
Problem statement
What is new in ourapproach
Approximate regexmatching
Finding the gaps
Add missing parts
Feedback function
Experiments
Summary
AnsweringQueries with UnixShell
Conclusion
Adding Words to Regexes: Feedback function
24/41
Now we want to find URLs:
◮ We try regex r = http://[a-zA-Z\.]+
◮ It does not find s = wikipedia.org
◮ Repaired regex r′ = (http://)?[a-zA-Z\.]+
Problem:
◮ r′ finds all words
◮ Precision drops
Solution: use feedback on set of negative examples E−
◮ Determine the parts of the regex that we can make optional
◮ We use the number of false positives, i.e.,
f (r′) = |E− ∩ L(r′)| ≤ α|E− ∩ L(r)|
◮ If f (r′) = false, add the word as disjunction instead:http://[a-zA-Z\.]+|wikipedia.org
Expanding theYAGO knowledge
base
Rebele
The YAGOknowledge base
Outline
Using YAGO forthe humanities
Adding Words toRegexes
Introduction
Problem statement
What is new in ourapproach
Approximate regexmatching
Finding the gaps
Add missing parts
Feedback function
Experiments
Summary
AnsweringQueries with UnixShell
Conclusion
Adding Words to Regexes: Experiments
25/41
Input data:
◮ Datasets:ReLIE [Li et al., 2008],Enron [Babbar et al., 2010], andWikipedia infobox attributes
◮ In total 8 tasks (e.g., phone numbers, software names, dates)
◮ In total 52 regexes
Expanding theYAGO knowledge
base
Rebele
The YAGOknowledge base
Outline
Using YAGO forthe humanities
Adding Words toRegexes
Introduction
Problem statement
What is new in ourapproach
Approximate regexmatching
Finding the gaps
Add missing parts
Feedback function
Experiments
Summary
AnsweringQueries with UnixShell
Conclusion
Adding Words to Regexes: Experiments
26/41
Input data:
◮ Datasets:ReLIE [Li et al., 2008],Enron [Babbar et al., 2010], andWikipedia infobox attributes
◮ In total 8 tasks (e.g., phonenumbers, software names, dates)
◮ In total 52 regexes
Approaches:
◮ Dis: r|s1| · · · |sn
◮ Star: .*
◮ B&S: [Babbar et al., 2010](reimplementation)
◮ Simple
◮ Adaptive
baseline adaptive
measure original dis star B&S simple α = 1.0 α = 1.1 α = 1.20