Top Banner
1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)
57

1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

Dec 20, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

1

Information Extraction

(Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld

and Perry)

Page 2: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

2

Information Extraction (IE)

• Identify specific pieces of information (data) in a unstructured or semi-structured textual document.

• Transform unstructured information in a corpus of documents or web pages into a structured database.

• Applied to different types of text:– Newspaper articles– Web pages– Scientific articles– Newsgroup messages– Classified ads– Medical notes

Page 3: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

3

MUC

• DARPA funded significant efforts in IE in the early to mid 1990’s.

• Message Understanding Conference (MUC) was an annual event/competition where results were presented.

• Focused on extracting information from news articles:– Terrorist events– Industrial joint ventures– Company management changes

• Information extraction of particular interest to the intelligence community (CIA, NSA).

Page 4: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

4

Other Applications

• Job postings:– Newsgroups: Rapier from austin.jobs– Web pages: Flipdog

• Job resumes: – BurningGlass– Mohomine

• Seminar announcements• Company information from the web• Continuing education course info from the web• University information from the web• Apartment rental ads• Molecular biology information from MEDLINE

Page 5: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

5

Subject: US-TN-SOFTWARE PROGRAMMERDate: 17 Nov 1996 17:37:29 GMTOrganization: Reference.Com Posting ServiceMessage-ID: <[email protected]>

SOFTWARE PROGRAMMER

Position available for Software Programmer experienced in generating software for PC-Based Voice Mail systems. Experienced in C Programming. Must be familiar with communicating with and controlling voice cards; preferable Dialogic, however, experience with others such as Rhetorix and Natural Microsystems is okay. Prefer 5 years or more experience with PC Based Voice Mail, but will consider as little as 2 years. Need to find a Senior level person who can come on board and pick up code with very little training. Present Operating System is DOS. May go to OS-2 or UNIX in future.

Please reply to:Kim AndersonAdNET(901) 458-2888 [email protected]

Subject: US-TN-SOFTWARE PROGRAMMERDate: 17 Nov 1996 17:37:29 GMTOrganization: Reference.Com Posting ServiceMessage-ID: <[email protected]>

SOFTWARE PROGRAMMER

Position available for Software Programmer experienced in generating software for PC-Based Voice Mail systems. Experienced in C Programming. Must be familiar with communicating with and controlling voice cards; preferable Dialogic, however, experience with others such as Rhetorix and Natural Microsystems is okay. Prefer 5 years or more experience with PC Based Voice Mail, but will consider as little as 2 years. Need to find a Senior level person who can come on board and pick up code with very little training. Present Operating System is DOS. May go to OS-2 or UNIX in future.

Please reply to:Kim AndersonAdNET(901) 458-2888 [email protected]

Sample Job Posting

Page 6: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

6

Extracted Job Template

computer_science_jobid: [email protected]: SOFTWARE PROGRAMMERsalary:company:recruiter:state: TNcity:country: USlanguage: Cplatform: PC \ DOS \ OS-2 \ UNIXapplication:area: Voice Mailreq_years_experience: 2desired_years_experience: 5req_degree:desired_degree:post_date: 17 Nov 1996

Page 7: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

7

Amazon Book Description

….</td></tr></table><b class="sans">The Age of Spiritual Machines : When Computers Exceed Human Intelligence</b><br><font face=verdana,arial,helvetica size=-1>by <a href="/exec/obidos/search-handle-url/index=books&field-author= Kurzweil%2C%20Ray/002-6235079-4593641">Ray Kurzweil</a><br></font><br><a href="http://images.amazon.com/images/P/0140282025.01.LZZZZZZZ.jpg"><img src="http://images.amazon.com/images/P/0140282025.01.MZZZZZZZ.gif" width=90 height=140 align=left border=0></a><font face=verdana,arial,helvetica size=-1><span class="small"><span class="small"><b>List Price:</b> <span class=listprice>$14.95</span><br><b>Our Price: <font color=#990000>$11.96</font></b><br><b>You Save:</b> <font color=#990000><b>$2.99 </b>(20%)</font><br></span><p> <br>

….</td></tr></table><b class="sans">The Age of Spiritual Machines : When Computers Exceed Human Intelligence</b><br><font face=verdana,arial,helvetica size=-1>by <a href="/exec/obidos/search-handle-url/index=books&field-author= Kurzweil%2C%20Ray/002-6235079-4593641">Ray Kurzweil</a><br></font><br><a href="http://images.amazon.com/images/P/0140282025.01.LZZZZZZZ.jpg"><img src="http://images.amazon.com/images/P/0140282025.01.MZZZZZZZ.gif" width=90 height=140 align=left border=0></a><font face=verdana,arial,helvetica size=-1><span class="small"><span class="small"><b>List Price:</b> <span class=listprice>$14.95</span><br><b>Our Price: <font color=#990000>$11.96</font></b><br><b>You Save:</b> <font color=#990000><b>$2.99 </b>(20%)</font><br></span><p> <br>…

Page 8: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

8

Extracted Book Template

Title: The Age of Spiritual Machines : When Computers Exceed Human IntelligenceAuthor: Ray KurzweilList-Price: $14.95Price: $11.96::

Page 9: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

9

Web Extraction

• Many web pages are generated automatically from an underlying database.

• Therefore, the HTML structure of pages is fairly specific and regular (semi-structured).

• However, output is intended for human consumption, not machine interpretation.

• An IE system for such generated pages allows the web site to be viewed as a structured database.

• An extractor for a semi-structured web site is sometimes referred to as a wrapper.

• Process of extracting from such pages is sometimes referred to as screen scraping.

Page 10: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

10

Web Extraction using DOM Trees

• Web extraction may be aided by first parsing web pages into DOM trees.

• Extraction patterns can then be specified as paths from the root of the DOM tree to the node containing the text to extract.

• May still need regex patterns to identify proper portion of the final CharacterData node.

Page 11: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

11

Sample DOM Tree Extraction

HTML

BODY

FONTB

Age of Spiritual Machines

Ray Kurzweil

Element

Character-DataHEADER

by A

Title: HTMLBODYBCharacterDataAuthor: HTML BODYFONTA CharacterData

Page 12: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

12

Template Types

• Slots in template typically filled by a substring from the document.

• Some slots may have a fixed set of pre-specified possible fillers that may not occur in the text itself.– Terrorist act: threatened, attempted, accomplished.

– Job type: clerical, service, custodial, etc.

– Company type: SEC code

• Some slots may allow multiple fillers.– Programming language

• Some domains may allow multiple extracted templates per document.– Multiple apartment listings in one ad

Page 13: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

13

Simple Extraction Patterns

• Specify an item to extract for a slot using a regular expression pattern.– Price pattern: “\b\$\d+(\.\d{2})?\b”

• May require preceding (pre-filler) pattern to identify proper context.– Amazon list price:

• Pre-filler pattern: “<b>List Price:</b> <span class=listprice>”• Filler pattern: “\$\d+(\.\d{2})?\b”

• May require succeeding (post-filler) pattern to identify the end of the filler.– Amazon list price:

• Pre-filler pattern: “<b>List Price:</b> <span class=listprice>”• Filler pattern: “.+”• Post-filler pattern: “</span>”

Page 14: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

14

Simple Template Extraction

• Extract slots in order, starting the search for the filler of the n+1 slot where the filler for the nth slot ended. Assumes slots always in a fixed order. – Title

– Author

– List price

– …

• Make patterns specific enough to identify each filler always starting from the beginning of the document.

Page 15: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

15

Pre-Specified Filler Extraction

• If a slot has a fixed set of pre-specified possible fillers, text categorization can be used to fill the slot.– Job category– Company type

• Treat each of the possible values of the slot as a category, and classify the entire document to determine the correct filler.

Page 16: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

16

Learning for IE

• Writing accurate patterns for each slot for each domain (e.g. each web site) requires laborious software engineering.

• Alternative is to use machine learning:– Build a training set of documents paired with human-produced

filled extraction templates.– Learn extraction patterns for each slot using an appropriate

machine learning algorithm.

Page 17: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

18

Page 18: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

19

Page 19: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

20

Page 20: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

21

Page 21: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

22

Page 22: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

23

Page 23: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

24

Page 24: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

254th Nov, 2002.

Happy Deepawali!& Haloween

10/31

Page 25: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

Finding“Sweet Spots” in computer-mediated cooperative work

• It is possible to get by with techniques blythely ignorant of semantics, when you have humans in the loop– All you need is to find the right sweet spot, where the

computer plays a pre-processing role and presents “potential solutions”

– …and the human very gratefully does the in-depth analysis on those few potential solutions

• Examples:– The incredible success of “Bag of Words” model!

• Bag of letters would be a disaster ;-)• Bag of sentences and/or NLP would be good

– ..but only to your discriminating and irascible searchers ;-)

Page 26: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

Collaborative Computing AKA Brain Cycle Stealing

AKA Computizing Eyeballs

• A lot of exciting research related to web currently involves “co-opting” the masses to help with large-scale tasks– It is like “cycle stealing”—except we are stealing “human brain

cycles” (the most idle of the computers if there is ever one ;-) • Remember the mice in the Hitch Hikers Guide to the Galaxy?

(..who were running a mass-scale experiment on the humans to figure out the question..)

– Collaborative knowledge compilation (wikipedia!)– Collaborative Curation – Collaborative tagging– Paid collacoration/contracting

• Many big open issues– How do you pose the problem such that it can be solved using

collaborative computing?– How do you “incentivize” people into letting you steal their brain

cycles? • Pay them! (Amazon mturk.com )

Page 27: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

Tapping into the Collective Unconscious

• Another thread of exciting research is driven by the realization that WEB is not random at all!– It is written by humans

– …so analyzing its structure and content allows us to tap into the collective unconscious ..

• Meaning can emerge from syntactic notions such as “co-occurrences” and “connectedness”

• Examples:– Analyzing term co-occurrences in the web-scale corpora to capture

semantic information (today’s paper)

– Analyzing the link-structure of the web graph to discover communities• DoD and NSA are very much into this as a way of breaking terrorist cells

– Analyzing the transaction patterns of customers (collaborative filtering)

Page 28: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

30

Automated Support for “Semantic Web”

• Semantic web needs:– Tagged data– Background knowledge

• (blue sky approaches to) automate both– Automated tagging

• Start with a background ontology and tag other web pages

– Semtag/Seeker

– Knowledge Extraction• Extract base level knowledge (“facts”) directly from

the web

Page 29: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

31

Extraction from Free Text involvesNatural Language Processing

• If extracting from automatically generated web pages, simple regex patterns usually work.

• If extracting from more natural, unstructured, human-written text, some NLP may help.– Part-of-speech (POS) tagging

• Mark each word as a noun, verb, preposition, etc.

– Syntactic parsing• Identify phrases: NP, VP, PP

– Semantic word categories (e.g. from WordNet)• KILL: kill, murder, assassinate, strangle, suffocate

• Off-the-shelf software available to do this!– The “Brill” tagger

• Extraction patterns can use POS or phrase tags.

Page 30: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

32

I. Generate-n-Test Architecture

Generic extraction patterns (Hearst ’92):• “…Cities such as Boston, Los Angeles, and Seattle…”

(“C such as NP1, NP2, and NP3”) => IS-A(each(head(NP)), C), …

•Detailed information for several countries such as maps, …” ProperNoun(head(NP))

• “I listen to pretty much all music but prefer country such as Garth Brooks”

Page 31: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

33

Test

|)(|

|)(|),(

SeattleHits

CitySeattleHitsCitySeattlePMI

Assess candidate extractions using Mutual Information (PMI-IR) (Turney ’01).

Many variations are possible…

Page 32: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

34

Assessment

|)(|

|)(|),(

IHits

DIHitsDIPMI

•PMI = frequency of I & D co-occurrence•5-50 discriminators Di

•Each PMI for Di is a feature fi

•Naïve Bayes evidence combination:

i ii i

i i

n fPPfPP

fPPfffP

)|()()|()(

)|()(),...,|( 21

PMI is used for feature selection. NBC is used for learning. Hits used for assessing PMI as well as conditional probabilities

Page 33: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

35

Assessment In Action

1. I = “Yakima” (1,340,000)

2. D = <class name>

3. I+D = “Yakima city” (2760)

4. PMI = (2760 / 1.34M)= 0.02

•I = “Avocado” (1,000,000)•I+D =“Avocado city” (10)

PMI = 0.00001 << 0.02

Page 34: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

36

Some Sources of ambiguity

• Time: “Clinton is the president” (in 1996).

• Context: “common misconceptions..”

• Opinion: Elvis…

• Multiple word senses: Amazon, Chicago, Chevy Chase, etc.– Dominant senses can mask recessive ones!

– Approach: unmasking. ‘Chicago –City’

Page 35: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

37

Chicago

City Movie

|)|(|

|)|(|),,(

CIHits

CDIHitsCDIPMI

Page 36: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

38

Chicago Unmasked

City sense Movie sense

|)(|

|)(|

CityChicagoHits

CityMovieChicagoHits

Page 37: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

39

Impact of Unmasking on PMI

Name Recessive Original Unmask BoostWashington city 0.50 0.99 96%Casablanca city 0.41 0.93 127%Chevy Chase actor 0.09 0.58 512%Chicago movie 0.02 0.21 972%

Page 38: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

40

CBioC: Collaborative Bio-Curation

Motivation To help get information nuggets of articles and

abstracts and store in a database. The challenge is that the number of articles are

huge and they keep growing, and need to process natural language.

The two existing approaches human curation and use of automatic information

extraction systems They are not able to meet the challenge, as the first is

expensive, while the second is error-prone.

Page 39: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

41

CBioC (cont’d)

Approach: We propose a solution that is inexpensive, and that scales up. Our approach takes advantage of automatic information

extraction methods as a starting point, Based on the premise that if there are a lot of articles, then

there must be a lot of readers and authors of these articles. We provide a mechanism by which the readers of the

articles can participate and collaborate in the curation of information.

We refer to our approach as “Collaborative Curation''.

Page 40: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

42

Using the C-BioCurator System (cont’d)

Extractor Systems

DownloadAgent

TextDBExistingDB

Data Format Exchange System

BioPax

CBioCDatabase

Collaborative Bio-Curation System

CBioC Interface

Browse Facts

Vote FactsAdd/Modify New

FactsAdd New SchemaInvoke

IntExtractorUser

Management

DownloadAgent

DIP

Reactome

… …Nature

SciencePubmed

... ...

Page 41: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

(a)

(b) (d)

(c)

What is the main difference between Knowitall and CBIOC?

Assessment– Knowitall does it by HITS. CBioC by voting

Page 42: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

44

Annotation

“The Chicago Bulls announced yesterday that Michael Jordan will. . . ”

The <resource ref="http://tap.stanford.edu/

BasketballTeam_Bulls">Chicago Bulls</resource>

announced yesterday that <resource ref=

"http://tap.stanford.edu/AthleteJordan,_Michael">

Michael Jordan</resource> will...’’

Page 43: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

45

Semantic Annotation

Picture from http://lsdis.cs.uga.edu/courses/SemWebFall2005/courseMaterials/CSCI8350-Metadata.ppt

This simplest task of meta-data extraction on NL is to establish “type” relation between entities in the NL resources and concepts in ontologies.

Name Entity Identification

Page 44: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

46

Semantics

• Semantic Annotation - The content of annotation consists of some rich semantic information - Targeted not only at human reader of resources but also software agents - formal : metadata following structural standards informal : personal notes written in the margin while reading an article - explicit : carry sufficient information for interpretation tacit : many personal annotations (telegraphic and

incomplete)

http://www-scf.usc.edu/~csci586/slides/6

Page 45: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

47

Uses of Annotation

http://www-scf.usc.edu/~csci586/slides/8

Page 46: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

48

Objectives of Annotation

• Generate Metadata for existing information– e.g., author-tag in HTML– RDF descriptions to HTML– Content description to Multimedia files

• Employ metadata for– Improved search– Navigation– Presentation– Summarization of contents

http://www.aifb.uni-karlsruhe.de/WBS/sst/Teaching/Intelligente%20System%20im%20WWW%20SS%202000/10-Annotation.pdf

Page 47: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

49

Annotation

Current practice of annotation for knowledge identification and extraction

is time consuming

needs annotation by experts

is complex

Reduce burden of text annotation for Knowledge

Managementwww.racai.ro/EUROLAN-2003/html/presentations/SheffieldWilksBrewsterDingli/Eurolan2003AlexieiDingli.ppt

Page 48: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

SemTag & Seeker WWW-03 Best Paper Prize Seeded with TAP ontology (72k concepts)

And ~700 human judgments Crawled 264 million web pages Extracted 434 million semantic tags

Automatically disambiguated

Page 49: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

52

SemTag

• Uses broad, shallow knowledge base

• TAP – lexical and taxonomic information about popular objects– Music– Movies– Sports– Etc.

Page 50: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

53

SemTag

• Problem:– No write access to original document, so how

do you annotate?

• Solution:– Store annotations in a web-available

database

Page 51: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

54

SemTag

• Semantic Label Bureau– Separate store of semantic annotation

information– HTTP server that can be queried for

annotation information– Example

• Find all semantic tags for a given document• Find all semantic tags for a particular object

Page 52: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

55

SemTag

• Methodology

Page 53: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

56

SemTag

• Three phases1. Spotting Pass:

– Tokenize the document– All instances plus 20 word window

2. Learning Pass:– Find corpus-wide distribution of terms at each internal

node of taxonomy– Based on a representative sample

3. Tagging Pass:– Scan windows to disambiguate each reference– Finally determined to be a TAP object

Page 54: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

59

SemTag

• TBD methodology:– Each node in the taxonomy is associated with

a set of labels• Cats, Football, Cars all contain “jaguar”

– Each label in the text is stored with a window of 20 words – the context

– Each node has an associated similarity function mapping a context to a similarity

• Higher similarity more likely to contain a reference

Page 55: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

60

SemTag

• Similarity:– Built a 200,000 word lexicon (200,100 most

common – 100 most common)– 200,000 dimensional vector space– Training: spots (label, context) and correct

node– Estimated the distribution of terms for nodes – Standard cosine similarity– TFIDF vectors (context vs. node)

Page 56: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

62

SemTag

• Some internal nodes very popular:– Associate a measurement of how accurate

Sim is likely to be at a node– Also, how ambiguous the node is overall

(consistency of human judgment)

• TBD Algorithm: returns 1 or 0 to indicate whether a particular context c is on topic for a node v

• 82% accuracy on 434 million spots

Page 57: 1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

64

Summary

• Information extraction can be motivated either as explicating more structure from the data or as an automated way to Semantic Web

• Extraction complexity depends on whether the text you have is “templated” or “free-form”– Extraction from templated text can be done by regular

expressions– Extraction from free form text requires NLP

• Can be done in terms of parts-of-speech-tagging

• “Annotation” involves connecting terms in a free form text to items in the background knowledge– It too can be automated