Top Banner
Information Extraction Information Extraction and Ontology Learning and Ontology Learning Guided by Web Directory Guided by Web Directory Authors: Authors: Martin Martin Kavalec Kavalec Vojtěch Vojtěch Svátek Svátek Presenter: Presenter: Mark Mark Vickers Vickers
20

Information Extraction and Ontology Learning Guided by Web Directory Authors:Martin Kavalec Vojtěch Svátek Presenter: Mark Vickers.

Dec 20, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Information Extraction and Ontology Learning Guided by Web Directory Authors:Martin Kavalec Vojtěch Svátek Presenter: Mark Vickers.

Information Extraction and Information Extraction and Ontology Learning Guided by Ontology Learning Guided by

Web DirectoryWeb Directory

Authors:Authors: Martin Kavalec Martin Kavalec Vojtěch SvátekVojtěch Svátek

Presenter: Presenter: Mark VickersMark Vickers

Page 2: Information Extraction and Ontology Learning Guided by Web Directory Authors:Martin Kavalec Vojtěch Svátek Presenter: Mark Vickers.

OutlineOutline

IntroductionIntroduction– Mining Indicator TermsMining Indicator Terms– Integrating RainbowIntegrating Rainbow– Ontological Analysis of Web DirectoriesOntological Analysis of Web Directories– IE and Ontology LearningIE and Ontology Learning

Future WorkFuture Work

Related WorkRelated Work

AssessmentAssessment

Page 3: Information Extraction and Ontology Learning Guided by Web Directory Authors:Martin Kavalec Vojtěch Svátek Presenter: Mark Vickers.

IntroductionIntroductionGoal:Goal:

“…“…to extract information about (mostly generic) to extract information about (mostly generic) products, services and products, services and areas of competence of companiesareas of competence of companies, from the free text chunks , from the free text chunks

embedded in web presentationsembedded in web presentations.”.”

Taking advantage of:Taking advantage of:– Collections of extraction patternsCollections of extraction patterns– Ontologies of problem domainsOntologies of problem domains

Approach: Combine Information Extraction With Approach: Combine Information Extraction With OntologiesOntologies– Ontologies can improve quality of IE Ontologies can improve quality of IE – Extracted information can improve/extend ontologiesExtracted information can improve/extend ontologies– BootstrappingBootstrapping

Page 4: Information Extraction and Ontology Learning Guided by Web Directory Authors:Martin Kavalec Vojtěch Svátek Presenter: Mark Vickers.

IntroductionIntroduction

Uses Uses Open DirectoryOpen Directory (http://dmoz.org) (http://dmoz.org)– Obtain Obtain labeled labeled training datatraining data– Lightweight ontologiesLightweight ontologies

“The Open Directory Project is the largest, most comprehensive human-edited directory of the Web.”

Page 5: Information Extraction and Ontology Learning Guided by Web Directory Authors:Martin Kavalec Vojtěch Svátek Presenter: Mark Vickers.
Page 6: Information Extraction and Ontology Learning Guided by Web Directory Authors:Martin Kavalec Vojtěch Svátek Presenter: Mark Vickers.

Mining Indicator Terms Mining Indicator Terms

Informative termsInformative terms = generic names of products= generic names of products

Indicator termsIndicator terms = situated near informative terms = situated near informative terms– Example: ‘our assortment Example: ‘our assortment includes…includes…’’

‘‘in our shop you can in our shop you can buybuy…’…’

Assumption: Directory headings coincide with informativesAssumption: Directory headings coincide with informatives

Purpose: Generate extraction patterns based on Indicator Purpose: Generate extraction patterns based on Indicator termsterms

They use deeper linguistic techniquesThey use deeper linguistic techniques

Page 7: Information Extraction and Ontology Learning Guided by Web Directory Authors:Martin Kavalec Vojtěch Svátek Presenter: Mark Vickers.

Mining Indicator Terms Mining Indicator Terms Example:Example:……/Manufacturing/Materials/Metals/Steel/…/Manufacturing/Materials/Metals/Steel/…

Informative terms

Match headings with text pages to find Match headings with text pages to find sentences containing sentences containing informative termsinformative terms

Grab nearby words as Grab nearby words as indicator termsindicator terms

Generate extraction patterns fromGenerate extraction patterns from indicator termsindicator terms

Page 8: Information Extraction and Ontology Learning Guided by Web Directory Authors:Martin Kavalec Vojtěch Svátek Presenter: Mark Vickers.

Mining Indicator TermsMining Indicator Terms

Choosing Indicator TermsChoosing Indicator Terms– Syntactical analysis: Syntactical analysis: Link Grammar ParserLink Grammar Parser– Chose verbs occurring closest in parse tree to Chose verbs occurring closest in parse tree to

informative wordinformative word– Arrange verbs into a frequency tableArrange verbs into a frequency table– Order by ratio of frequency near informative Order by ratio of frequency near informative

term to frequency in generalterm to frequency in general– Chose 8 most promising verbsChose 8 most promising verbs

Page 9: Information Extraction and Ontology Learning Guided by Web Directory Authors:Martin Kavalec Vojtěch Svátek Presenter: Mark Vickers.

Mining Indicator TermsMining Indicator Terms

Preliminary TestingPreliminary Testing– Sampled 14,500 sentences containing heading Sampled 14,500 sentences containing heading

terms terms – Randomly chose 130 sentences with indicatorsRandomly chose 130 sentences with indicators– Manually labeled to estimate if informative term Manually labeled to estimate if informative term

was present or notwas present or notExample: Example:

“ “We are equipped to run any grade of corrugated from We are equipped to run any grade of corrugated from E-flute to Triplewall, E-flute to Triplewall, includingincluding all government all government grades.”grades.”

Page 10: Information Extraction and Ontology Learning Guided by Web Directory Authors:Martin Kavalec Vojtěch Svátek Presenter: Mark Vickers.

Mining Indicator TermsMining Indicator TermsPreliminary Test ResultsPreliminary Test Results

CoverageCoverage

Non-FilteredNon-Filtered 10 – 20 %10 – 20 %

Pre-FilteredPre-Filtered 70 – 80 %70 – 80 %

Page 11: Information Extraction and Ontology Learning Guided by Web Directory Authors:Martin Kavalec Vojtěch Svátek Presenter: Mark Vickers.

Integration into RainbowIntegration into Rainbow

RAINBOWRAINBOW ((RReusable eusable AArchitecture for rchitecture for ININtelligent telligent BBrokering rokering OOf f WWeb information access)eb information access)

– Web Analysis Tasks:Web Analysis Tasks:Sentence ExtractionSentence ExtractionExplicit MetadataExplicit MetadataHTML Structure*HTML Structure*Inline Image *Inline Image *Link Topology Structure*Link Topology Structure*Page SimilarityPage Similarity

– Internal Communication: based on SOAPInternal Communication: based on SOAP

– Will use ontologies for verifying semantic consistency of web Will use ontologies for verifying semantic consistency of web services provided within the distributed systemservices provided within the distributed system

Page 12: Information Extraction and Ontology Learning Guided by Web Directory Authors:Martin Kavalec Vojtěch Svátek Presenter: Mark Vickers.

Integration into RainbowIntegration into Rainbow

Rainbow will help solve “coverage” Rainbow will help solve “coverage” problem of directory links pointing to problem of directory links pointing to ‘barren’ pages‘barren’ pages– Using Analysis of:Using Analysis of:

Keywords and HTML Structure on start-up pagesKeywords and HTML Structure on start-up pages

URLs of embedded linksURLs of embedded links

– Metadata Extractor will be navigated towards Metadata Extractor will be navigated towards promising pages. promising pages.

– Looking for ‘about-us’ or ‘profile’ to find more Looking for ‘about-us’ or ‘profile’ to find more syntactically correct text, for example.syntactically correct text, for example.

Page 13: Information Extraction and Ontology Learning Guided by Web Directory Authors:Martin Kavalec Vojtěch Svátek Presenter: Mark Vickers.

Ontological Analysis of Web DirectoriesOntological Analysis of Web Directories

Terms and Phrases in single heading belong to Terms and Phrases in single heading belong to a small set of a small set of classesclassesParent-child relations belong to particular Parent-child relations belong to particular classes corresponding to ‘deep’ ontological classes corresponding to ‘deep’ ontological relationsrelations..

-Industries

- Construction_and_Maintenance

- Materials_and_supplies

- Masonry_and_Stone

- Natural_Stone

- International_Sources

- Mexico

Page 14: Information Extraction and Ontology Learning Guided by Web Directory Authors:Martin Kavalec Vojtěch Svátek Presenter: Mark Vickers.

Ontological Analysis of Web DirectoriesOntological Analysis of Web Directories

Meta-ontology of directory headings Meta-ontology of directory headings

Class

Named Relations

Class-subclass Relations

Reflexive Binary Relations

Page 15: Information Extraction and Ontology Learning Guided by Web Directory Authors:Martin Kavalec Vojtěch Svátek Presenter: Mark Vickers.

Ontological Analysis of Web DirectoriesOntological Analysis of Web Directories

Interpretation RulesInterpretation Rules

Page 16: Information Extraction and Ontology Learning Guided by Web Directory Authors:Martin Kavalec Vojtěch Svátek Presenter: Mark Vickers.

IE and Ontology LearningIE and Ontology Learning

Extracting with plain indicator terms with Extracting with plain indicator terms with simple heuristics workssimple heuristics works

But Even Better:But Even Better:– Learn indicators for each classLearn indicators for each class– Use ontology analysis to classify indicators Use ontology analysis to classify indicators

foundfound– Fill in database templates: true IEFill in database templates: true IE

Page 17: Information Extraction and Ontology Learning Guided by Web Directory Authors:Martin Kavalec Vojtěch Svátek Presenter: Mark Vickers.

IE and Ontology LearningIE and Ontology Learning

Classify HeadingsLearn class-specific indicators

Human Classifies Directory Headings

(WordNet)

Closed Loop Strategy:

Page 18: Information Extraction and Ontology Learning Guided by Web Directory Authors:Martin Kavalec Vojtěch Svátek Presenter: Mark Vickers.

Future WorkFuture Work

Complete the Complete the Information extraction & ontology Information extraction & ontology learning loop.learning loop.

With relation to With relation to Semantic WebSemantic Web, they want to , they want to adapt technique to the standards of usual adapt technique to the standards of usual explicit explicit metadatametadata

– Example: The information extracted can be forged to Example: The information extracted can be forged to RDF triples, with indicator collections accessible over RDF triples, with indicator collections accessible over the webthe web

Page 19: Information Extraction and Ontology Learning Guided by Web Directory Authors:Martin Kavalec Vojtěch Svátek Presenter: Mark Vickers.

Related WorkRelated WorkCombining IE and Ontologies (without use ofCombining IE and Ontologies (without use of web web directories)directories)

– Bootstrapping an Ontology-Based Information Extraction SystemsBootstrapping an Ontology-Based Information Extraction Systems

Advantages of using Link Grammar ParserAdvantages of using Link Grammar Parser– Learning to Generate Semantic Annotation for Domain Specific Learning to Generate Semantic Annotation for Domain Specific

SentencesSentences

Using Yahoo to classify Using Yahoo to classify whole documentswhole documents– Turning Yahoo into an Automatic Web-Page ClassifierTurning Yahoo into an Automatic Web-Page Classifier

Similar work aimed at more structured information Similar work aimed at more structured information using search enginesusing search engines

– Extracting Patterns and Relations form the World Wide WebExtracting Patterns and Relations form the World Wide Web

Bootstrapping and other statistical methods for IEBootstrapping and other statistical methods for IE– Text Classification by Bootstrapping with KeywordsText Classification by Bootstrapping with Keywords– Learning Dictionaries of Information Extraction by Multi-Level Learning Dictionaries of Information Extraction by Multi-Level

BootstrappingBootstrapping

Page 20: Information Extraction and Ontology Learning Guided by Web Directory Authors:Martin Kavalec Vojtěch Svátek Presenter: Mark Vickers.

AssessmentAssessment

I don’t think indicator term learning is done I don’t think indicator term learning is done (even though they say it is)(even though they say it is)

Counts on not yet decided Ontology Counts on not yet decided Ontology learning techniqueslearning techniques

Need to develop an official directoryNeed to develop an official directory