Top Banner
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004
47

1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.

1

SIMS 290-2: Applied Natural Language Processing

Marti HearstOctober 20, 2004 

 

Page 2: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.

2

Untangling Text Data Mining

(updated from lecture from 1999)

Page 3: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.

3

Outline

Untangling several different fieldsDM, CL, IA, TDM

TDM examplesTDM as Exploratory Data Analysis

New Problems for Computational LinguisticsOur current efforts

Page 4: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.

4

Classifying Application Types

Patterns Non- Novel Nuggets

Novel Nuggets

Non- textual data Standard data

mining Database queries

Automated Reasoning

(AI )

Textual data Computational

linguistics I nf ormation

retrieval Real text

data mining

Page 5: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.

5

What is Data Mining? (Fayyad & Uthurusamy 96, Fayyad 97)

Fitting models to or determining patterns from very large datasets.

A “regime” which enables people to interact effectively with massive data stores.

Deriving new information from data.

Page 6: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.

6

Why Data Mining?

Because the data is there.Because

larger disksfaster cpushigh-powered visualization networked information

are now widely available.

Page 7: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.

7

Knowledge Discovery from Data (KDD)

KDD: The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. (Fayyad, Shapiro, & Smyth, CACM 96)

Note: data mining is just one step in the process

Page 8: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.

8

Data Mining Applications(CACM 39 (11) Special Issue)

Finding patterns across data sets:Reports on changes in retail sales

– to improve sales

Patterns of sizes of TV audiences– for marketing

Patterns in NBA play– to alter, and so improve, performance

Deviations in standard phone calling behavior – to detect fraud– for marketing

Page 9: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.

9

What is Data Mining?

Potential point of confusion:The extracting ore from rock metaphor does not really apply to the practice of data miningIf it did, then standard database queries would fit under the rubric of data miningIn practice, DM refers to:

– finding patterns across large datasets– discovering heretofore unknown information

Page 10: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.

10

What is Text Data Mining?

Many people’s first thought: Make it easier to find things on the Web.But this is information retrieval!

Page 11: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.

11

Needles in HaystacksThe emphasis in IR is in finding documents that already contain answers to questions.

Page 12: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.

12

Information RetrievalA restricted form of Information Access

The system has only pre-existing, “canned” text passages.Its response is limited to selecting from these passages and presenting them to the user.It must select, say, 10 or 20 passages out of millions.

Page 13: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.

13

What is Text Data Mining?

The metaphor of extracting ore from rock:Does make sense for extracting documents of interest from a huge pile.

But does not reflect notions of DM in practice:– finding patterns across large collections– discovering heretofore unknown information

What would finding a pattern across a large text collection really look like …?

Page 14: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.

14

From: “The Internet Diary of the man who cracked the Bible Code” Brendan McKay, Yahoo Internet Life, www.zdnet.com/yil (William Gates, agitator, leader)

Bill Gates + MS-DOS in the Bible!

Page 15: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.

15

From: “The Internet Diary of the man who cracked the Bible Code”Brendan McKay, Yahoo Internet Life, www.zdnet.com/yil

More info: http://cs.anu.edu.au/~bdm/dilugim/gatesdet.txt http://cs.anu.edu.au/~bdm/dilugim/torah.html

Page 16: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.

16

Real Text DM

The point:Discovering heretofore unknown information is not what we usually do with text.(If it weren’t known, it could not have been written by someone!)

However:There is a field whose goal is to learn about patterns in text for their own sake ...

Page 17: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.

17

Computational Linguistics!Goal: automated language understanding

this isn’t possible (yet)instead, go for subgoals, e.g.,

– word sense disambiguation– phrase recognition– semantic associations

Common current approach:statistical analyses over very large text collections

Page 18: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.

18

Why CL Isn’t TDM

A linguist finds it interesting that “cloying” co-occurs significantly with “Jar Jar Binks” ...

… But this doesn’t really answer a question relevant to the world outside the text itself.

Page 19: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.

19

Why CL Isn’t TDM

We need to use the text indirectly to answer questions about the worldDirect:

Analyze patent text; determine which word patterns indicate various subject categories.

Indirect:Analyze patent text; find out whether private or public funding leads to more inventions.

Page 20: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.

20

Why CL Isn’t TDM

Direct:Cluster newswire text; determine which terms are predominant

Indirect:Analyze newswire text; gather evidence about which countries/alliances are dominating which financial sectors

Page 21: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.

21

Nuggets vs. PatternsTDM: we want to discover new information … … As opposed to discovering which statistical patterns characterize occurrence of known information.Example: WSD

not TDM: computing statistics over a corpus to determine what patterns characterize Sense S.TDM: discovering the meaning of a new sense of a word.

Page 22: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.

22

Nuggets vs. PatternsNugget:

a new, heretofore unknown item of information.

Pattern: distributions or rules that characterize the occurrence (or non-occurrence) of a known item of information.

Application of rules can create nuggets in some circumstances.

Page 23: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.

23

Example: Lexicon AugmentationApplication of a lexico-syntactic pattern:

NP0 such as NP1, {NP2 …, (and | or) NPi }

i >= 1, implies thatforall NPi, i>=1, hyponym(NPi, NP0)

Extracts out a new hypernym:“Agar is a substance prepared from a mixture of red algae, such as Gelidium, for laboratory or industrial use.”implies hyponym(“Gelidium”, “red algae”)

However, this fact was already known to the author of the text.

Page 24: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.

24

The Quandry

How do we use text to bothFind new information not known to the author of the textFind information that is not about the text itself

Page 25: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.

25

Idea: Exploratory Data Analysis

Use large text collections to gather evidence to support (or refute) hypotheses

Not known to author: links across many textsNot self-referential: work within the domain of discourse

Page 26: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.

26

Example: Etiology

Given medical titles and abstractsa problem (incurable rare disease)some medical expertise

find causal links among titlessymptomsdrugsresults

Page 27: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.

27

Swanson Example (1991)Problem: Migraine headaches (M)Facts extracted from medical journal titles:

stress associated with Mstress leads to loss of magnesiumcalcium channel blockers prevent some Mmagnesium is a natural calcium channel blockerspreading cortical depression (SCD) implicated in Mhigh levels of magnesium inhibit SCDM patients have high platelet aggregabilitymagnesium can suppress platelet aggregability

Page 28: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.

28

Gathering Evidence

stress

migraine

CCB

magnesium

PA

magnesium

SCD

magnesiummagnesium

Page 29: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.

29

Gathering Evidence

migraine magnesium

stress

CCB

PA

SCD

Page 30: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.

30

Swanson’s TDM

Two of his hypotheses have received some experimental verification.His technique

Only partially automatedRequired medical expertise

Some researchers are pursuing this further.

Page 31: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.

31

How to find functions of genes?

Important problem in molecular biologyHave the genetic sequenceDon’t know what it doesBut …

– Know which genes it coexpresses with– Some of these have known function

So … Infer function based on function of co-expressed genes

– This idea suggested to me by Michael Walker and others at Incyte Pharmaceuticals

Page 32: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.

32

Gene Co-expression:Role in the genetic pathway

g?

PSA

Kall.

PAP

h?

PSA

Kall.

PAP

g?

Other possibilities as well

Page 33: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.

33

Make use of the literature

Look up what is known about the other genes.Different articles in different collectionsLook for commonalities

Similar topics indicated by Subject DescriptorsSimilar words in titles and abstracts

adenocarcinoma, neoplasm, prostate, prostatic neoplasms, tumor markers, antibodies ...

Page 34: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.

34

Developing Strategies

Different strategies seem needed for different situations

First: see what is known about Kallikrein.7341 documents. Too manyAND the result with “disease” category

– If result is non-empty, this might be an interesting gene

Now get 803 documentsAND the result with PSA

– Get 11 documents. Better!

Page 35: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.

35

Developing Strategies

Look for commalities among these documentsManual scan through ~100 category labelsWould have been better if

– Automatically organized– Intersections of “important” categories scanned for first

Page 36: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.

36

Try a new tack

Researcher uses knowledge of field to realize these are related to prostate cancer and diagnostic testsNew tack: intersect search on all three known genes

Hope they all talk about diagnostics and prostate cancerFortunately, 7 documents returnedBingo! A relation to regulation of this cancer

Page 37: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.

37

Formulate a Hypothesis

Hypothesis: mystery gene has to do with regulation of expression of genes leading to prostate cancerNew tack: do some lab tests

See if mystery gene is similar in molecular structure to the othersIf so, it might do some of the same things they do

Page 38: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.

38

Page 39: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.

39

Strategies again

In hindsight, combining all three genes was a good strategy.

Store this for later

Might not have workedNeed a suite of strategiesBuild them up via experience and a good UI

Page 40: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.

40

Text Merging ExampleDiscovering Hypocritical Congresspersons

Page 41: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.

41

Discovering Hypocritical Congresspersons

Feb 1, 1996US House of Reps votes to pass Telecommunications Reform ActThis contains the CDA (Communications Decency Act)

– Sought to criminalize posting to the Internet any material deemed indecent and patently offensive, with no exception for socially redeeming material.

Violaters subject to fines of $250,000 and 5 years in prisonEventually struck down by courts

http://www.tbtf.com/resource/hypocrites.html

Page 42: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.

42

Discovering Hypocritical Congresspersons

Sept 11, 1998US House of Reps votes to place the Starr report onlineThe content would (most likely) have violated the CDA

365 people were members for both votes284 members voted aye both times

– 185 (94%) Republicants voted aye both times– 96 (57%) Democrats voted aye both times

http://www.tbtf.com/resource/hypocrites.html

Page 43: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.

43http://www.tbtf.com/resource/hypocrites.html

Page 44: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.

44http://www.tbtf.com/resource/hypocrites.html

Page 45: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.

45

How to find Hypocritical Congresspersons?

This must have taken a lot of workHand cutting and pastingLots of picky details

– Some people voted on one but not the other bill– Some people share the same name

Check for different county/state Still messed up on “Bono”

Taking stats at the end on various attributes– Which state– Which party

Tools should help streamline, reuse resultsThe hardest part?

Knowing to compare these two sets of voting records in the first place.

Page 46: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.

46

SummaryText Data Mining:

Extracting heretofore undiscovered information from large text collections

Information Access TDMIA: locating already known information that is currently of interest

Finding patterns across text is already done in CL

Tells us about the behavior of languageHelps build very useful tools!

Page 47: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.

47

Summary on Text Data Mining

The future: analyzing what the text is aboutWe don’t know how; text is tough!Idea: bring the user into the loop.Build up piecewise evidence to support hypothesesMake use of partial domain models.

The Truth is Out There!