David W. EmbleyBrigham Young University
Provo, Utah, USA
WoK: A Web of Knowledge
A Web of Pages A Web of FactsBirthdate of my great
grandpa Orson
Price and mileage of red Nissans, 1990 or newer
Location and size of chromosome 17
US states with property crime rates above 1%
• Fundamental questions– What is knowledge?– What are facts?– How does one know?
• Philosophy– Ontology– Epistemology– Logic and reasoning
Toward a Web of Knowledge
• Existence asks “What exists?”• Concepts, relationships, and constraints with
formal foundation
Ontology
• The nature of knowledge asks: “What is knowledge?” and “How is knowledge acquired?”
• Populated conceptual model
Epistemology
• Principles of valid inference asks: “What is known?” and “What can be inferred?”
• For us, it answers: what can be inferred (in a formal sense) from conceptualized data.
Logic and Reasoning
Find price and mileage of red Nissans, 1990 or newer
• Distill knowledge from the wealth of digital web data• Annotate web pages
• Need a computational alembic to algorithmically turn raw symbols contained in web pages into knowledge
Making this Work How?
Fact
Fact
Fact
AnnotationAnnotation
…
…
Turning Raw Symbols into Knowledge
• Symbols: $ 11,500 117K Nissan CD AC• Data: price(11,500) mileage(117K)
make(Nissan)• Conceptualized data:
– Car(C123) has Price($11,500)– Car(C123) has Mileage(117,000)– Car(C123) has Make(Nissan)– Car(C123) has Feature(AC)
• Knowledge– “Correct” facts– Provenance
Actualization (with Extraction Ontologies)
Find me the price and mileage of all red Nissans – I want a 1990 or newer.
Data Extraction Demo
Semantic Annotation Demo
Free-Form Query Demo
Explanation: How it Works
• Extraction Ontologies• Semantic Annotation• Free-Form Query Interpretation
Extraction Ontologies
Object sets
Relationship sets
Participation constraints
Lexical
Non-lexical
Primary object set
Aggregation
Generalization/Specialization
Extraction Ontologies
External Rep.: \s*[$]\s*(\d{1,3})*(\.\d{2})?
Key Word Phrase
Left Context: $
Data Frame:
Internal Representation: float
Values
Key Words: ([Pp]rice)|([Cc]ost)| …
Operators
Operator: >
Key Words: (more\s*than)|(more\s*costly)|…
Generality & Resiliency ofExtraction Ontologies
• Generality: assumptions about web pages– Data rich– Narrow domain– Document types
• Single-record documents (hard, but doable)• Multiple-record documents (harder)• Records with scattered components (even harder)
• Resiliency: declarative– Still works when web pages change– Works for new, unseen pages in the same domain– Scalable, but takes work to declare the extraction
ontology
Semantic Annotation
Free-Form Query Interpretation
• Parse Free-Form Query(with respect to data extraction ontology)
• Select Ontology• Formulate Query Expression• Run Query Over Semantically Annotated Data
Parse Free-Form Query “Find me the and of all s – I want a ”
price
mileage
red
Nissan
1996
or newer
>= Operator
Select Ontology“Find me the price and mileage of all red Nissans – I want a 1996 or newer”
• Conjunctive queries and aggregate queries• Projection on mentioned object sets• Selection via values and operator keywords
– Color = “red”– Make = “Nissan”– Year >= 1996
>= Operator
Formulate Query Expression
For
Let
Where
Return
Formulate Query Expression
Run QueryOver Semantically Annotated Data
• How do we create extraction ontologies?– Manual creation requires several dozen person hours– Semi-automatic creation
• TISP (Table Interpretation by Sibling Pages)• TANGO (Table ANalysis for Generating Ontologies)• Nested Schemas with Regular Expressions• Synergistic Bootstrapping• Form-based Information Harvesting
• How do we scale up?– Practicalities of technology transfer and usage– Millions of queries over zillions of facts for thousands of
ontologies
Great!But Problems Still Need Resolution
Manual Creation
Manual Creation
Manual Creation
-Library of instance recognizers-Library of lexicons
Automatic Annotation with TISP(Table Interpretation with Sibling Pages)
• Recognize tables (discard non-tables)• Locate table labels• Locate table values• Find label/value associations
Recognize Tables
Data Table
Layout Tables (discard)
NestedData Tables
Locate Table LabelsExamples: Identification.Gene model(s).Protein Identification.Gene model(s).2
Locate Table LabelsExamples: Identification.Gene model(s).Gene Model Identification.Gene model(s).2
12
Locate Table Values
Value
Find Label/Value AssociationsExample:(Identification.Gene model(s).Protein, Identification.Gene model(s).2) = WP:CE28918
12
Interpretation Technique:Sibling Page Comparison
Interpretation Technique:Sibling Page Comparison
Same
Interpretation Technique:Sibling Page Comparison
Almost Same
Interpretation Technique:Sibling Page Comparison
Different
Same
Technique Details
• Unnest tables• Match tables in sibling pages
– “Perfect” match (table for layout discard )– “Reasonable” match (sibling table)
• Determine & use table-structure pattern– Discover pattern– Pattern usage– Dynamic pattern adjustment
Generated RDF
WoK Demo (via TISP)
Semi-Automatic Annotation with TANGO (Table Analysis for Generating Ontologies)
• Recognize and normalize table information• Construct mini-ontologies from tables• Discover inter-ontology mappings• Merge mini-ontologies into a growing ontology
Recognize Table Information
Religion Population Albanian Roman Shi’a SunniCountry (July 2001 est.) Orthodox Muslim Catholic Muslim Muslim other
Afganistan 26,813,057 15% 84% 1%Albania 3,510,484 20% 70% 10%
Construct Mini-Ontology Religion Population Albanian Roman Shi’a SunniCountry (July 2001 est.) Orthodox Muslim Catholic Muslim Muslim other
Afganistan 26,813,057 15% 84% 1%Albania 3,510,484 20% 70% 10%
Discover Mappings
Merge
• Build a page-layout, pattern-based annotator• Automate layout recognition based on examples• Auto-generate examples with extraction
ontologies• Synergistically run pattern-based annotator &
extraction-ontology annotator
Semi-Automatic Annotation viaSynergistic Bootstrapping
(Based on Nested Schemas with Regular Expressions)
Synergistic ExecutionExtraction Ontology
Document
Conceptual Annotator
(ontology-based annotation)
PartiallyAnnotated Document
Structural Annotator
(layout-driven annotation)
Annotated Document
Layout Patterns
Pattern Generation
Form-Based Information Harvesting• Forms
– General familiarity– Reasonable conceptual framework– Appropriate correspondence
• Transformable to ontological descriptions• Capable of accepting source data
• Instance recognizers– Some pre-existing instance recognizers– Lexicons
• Automated extraction ontology creation?
Form CreationBasic form-construction facilities:• single-entry field• multiple-entry field• nested form• …
Created Sample Form
Generated Ontology View
Source-to-Form Mapping
Source-to-Form Mapping
Source-to-Form Mapping
Source-to-Form Mapping
Almost Ready to Harvest
• Need reading path: DOM-tree structure• Need to resolve mapping problems
– Split/Merge– Union/Selection
Almost Ready to Harvest …
• Need reading path: DOM-tree structure• Need to resolve mapping problems
– Split/Merge– Union/Selection
Voltage-dependent anion-selective channel protein 3VDAC-3hVDAC3Outer mitochondrial membrane Protein porin 3
Name
Almost Ready to Harvest …
• Need reading path: DOM-tree structure• Need to resolve mapping problems
– Split/Merge– Union/Selection
Voltage-dependent anion-selective channel protein 3VDAC-3hVDAC3Outer mitochondrial membrane Protein porin 3
Name
Almost Ready to Harvest …
• Need reading path: DOM-tree structure• Need to resolve mapping problems
– Split/Merge– Union/Selection
Name
T-complex protein 1 subunit thetaTCP-1-thetaCCT-thetaRenal carcinoma antigen NY-REN-15
Almost Ready to Harvest …
• Need reading path: DOM-tree structure• Need to resolve mapping problems
– Split/Merge– Union/Selection
Name
T-complex protein 1 subunit thetaTCP-1-thetaCCT-thetaRenal carcinoma antigen NY-REN-15
Can Now Harvest
Name
Can Now Harvest
Name
14-3-3 protein epsilonMitochondrial import stimulation factor LsubunitProtein kinase C inhibitor protein-1KCIP-114-3-3E
Can Now Harvest
Name
Voltage-dependent anion-selective channel protein 3VDAC-3hVDAC3Outer mitochondrial membrane Protein porin 3
Can Now Harvest
Name
Tryptophanyl-tRNA synthetase, mitochondrial precursorEC 6.1.1.2Tryptophan—tRNA ligaseTrpRS(Mt)TrpRS
Harvesting Populates Ontology
Harvesting Populates Ontology
Also helps adjust ontology constraints
Can Harvest from Additional Sites
Name
T-complex protein 1 subunit thetaTCP-1-thetaCCT-thetaRenal carcinoma antigen NY-REN-15
AutomatingExtraction Ontology Creation
Lexicons
Name
14-3-3 protein epsilonMitochondrial import stimulation factor LsubunitProtein kinase C inhibitor protein-1KCIP-114-3-3E
Name
T-complex protein 1 subunit thetaTCP-1-thetaCCT-thetaRenal carcinoma antigen NY-REN-15
Name
Tryptophanyl-tRNA synthetase, mitochondrial precursorEC 6.1.1.2Tryptophan—tRNA ligaseTrpRS(Mt)TrpRS
…14-3-3 protein epsilonMitochondrial import stimulation factor LsubunitProtein kinase C inhibitor protein-1KCIP-114-3-3E…T-complex protein 1 subunit thetaTCP-1-thetaCCT-thetaRenal carcinoma antigen NY-REN-15…Tryptophanyl-tRNA synthetase, mitochondrial precursorEC 6.1.1.2Tryptophan—tRNA ligaseTrpRS(Mt)TrpRS…
AutomatingExtraction Ontology Creation
Instance RecognizersNumber Patterns Context Keywords and Phrases
Automatic Source-to-Form Mapping
Automatic Semantic Annotation
Recognize and annotate with respect to an ontology
Ontology Transformations
Transformations to and from all
• Advanced free-form queries with disjunction and negation
• Form-based query language• Table-based query languages• Graphical query languages
Practicalities: WoK Query Interfaces(Future Work)
• Won’t just happen without sufficient content• Niche applications
– Historical Data (e.g. Genealogy)– Topical Blogs
• Local WoKs– Intra-organizational effort– Individual interests
Practicalities: Bootstrapping the WoK(Future Work)
• Potential Rapid growth– Thousands of ontologies– Millions of simultaneous queries– Billions of annotated pages– Trillions of facts
• Search-engine-like caching & query processing
Practicalities: Scalability(Future Work)
• Automatic (or near automatic) creation of extraction ontologies
• Automatic (or near automatic) annotation of web pages
• Simple but accurate query specification without specialized training
Key to Success:Simplicity via Automation
www.deg.byu.edu