Top Banner
Web Crawler Agent (WCA) Presented by Kirk Martinez University of Southampton
22

Web Crawler Agent (WCA) Presented by Kirk Martinez University of Southampton.

Jan 15, 2016

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Web Crawler Agent (WCA) Presented by Kirk Martinez University of Southampton.

Web Crawler Agent(WCA)

Presented by

Kirk Martinez

University of Southampton

Page 2: Web Crawler Agent (WCA) Presented by Kirk Martinez University of Southampton.

Introduction

• WCA searches for missing information (fragments) on the Web

• WCA structures information into ontology “place_of_birth” (Person,Place)

• Techniques used: NLP (Natural Language Processing), Information extraction, relation extraction, question answering

Page 3: Web Crawler Agent (WCA) Presented by Kirk Martinez University of Southampton.

OverviewOntology

work_in (Alonso, ‘Granada’)date_of_birth (Rembrandt, ?)

class - relation - class

Person - work_in - PlacePerson - date_of_birth - Place

…..

Ontology instance

web Web CrawlerAgent

date_of_birth (Rembrandt, ?)

missing instance

searchextract “15-July-1606”

as answer

Start Ontotriple

Page 4: Web Crawler Agent (WCA) Presented by Kirk Martinez University of Southampton.

Is it something like “Google”?

• Search “date_of_birth” (when Rembrandt was born) with Google

Page 5: Web Crawler Agent (WCA) Presented by Kirk Martinez University of Southampton.
Page 6: Web Crawler Agent (WCA) Presented by Kirk Martinez University of Southampton.
Page 7: Web Crawler Agent (WCA) Presented by Kirk Martinez University of Southampton.
Page 8: Web Crawler Agent (WCA) Presented by Kirk Martinez University of Southampton.

Searching information with Google

• The “old” Web Search (eg Google) is good for getting documents but NOT for extracting concise answers – (e.g. “15-July-1606”)

• No analysis to “understand” the documents (e.g. “Rembrandt” can mean “hotel” or “bookstore”)

Page 9: Web Crawler Agent (WCA) Presented by Kirk Martinez University of Southampton.

Information extraction on the Web

• data may be low quality and repeated– e.g. Seurat Georges’s date of death– 29, March 1891(http://www.ibiblio.org/wm/paint/auth/seurat/)

– 19, March 1891 (http://www.rickdoble.net/influence/20seurat.htm)

• WCA depends on:– Well-structured sentences and documents– Good named-entity recognisers

Page 10: Web Crawler Agent (WCA) Presented by Kirk Martinez University of Southampton.
Page 11: Web Crawler Agent (WCA) Presented by Kirk Martinez University of Southampton.
Page 12: Web Crawler Agent (WCA) Presented by Kirk Martinez University of Southampton.
Page 13: Web Crawler Agent (WCA) Presented by Kirk Martinez University of Southampton.
Page 14: Web Crawler Agent (WCA) Presented by Kirk Martinez University of Southampton.
Page 15: Web Crawler Agent (WCA) Presented by Kirk Martinez University of Southampton.

Web crawler agentsearches the Web for

the missing value

Page 16: Web Crawler Agent (WCA) Presented by Kirk Martinez University of Southampton.
Page 17: Web Crawler Agent (WCA) Presented by Kirk Martinez University of Southampton.
Page 18: Web Crawler Agent (WCA) Presented by Kirk Martinez University of Southampton.
Page 19: Web Crawler Agent (WCA) Presented by Kirk Martinez University of Southampton.
Page 20: Web Crawler Agent (WCA) Presented by Kirk Martinez University of Southampton.
Page 21: Web Crawler Agent (WCA) Presented by Kirk Martinez University of Southampton.
Page 22: Web Crawler Agent (WCA) Presented by Kirk Martinez University of Southampton.

Future work

• verification

• performance

• autonomous