1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004
3
Outline
Untangling several different fieldsDM, CL, IA, TDM
TDM examplesTDM as Exploratory Data Analysis
New Problems for Computational LinguisticsOur current efforts
4
Classifying Application Types
Patterns Non- Novel Nuggets
Novel Nuggets
Non- textual data Standard data
mining Database queries
Automated Reasoning
(AI )
Textual data Computational
linguistics I nf ormation
retrieval Real text
data mining
5
What is Data Mining? (Fayyad & Uthurusamy 96, Fayyad 97)
Fitting models to or determining patterns from very large datasets.
A “regime” which enables people to interact effectively with massive data stores.
Deriving new information from data.
6
Why Data Mining?
Because the data is there.Because
larger disksfaster cpushigh-powered visualization networked information
are now widely available.
7
Knowledge Discovery from Data (KDD)
KDD: The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. (Fayyad, Shapiro, & Smyth, CACM 96)
Note: data mining is just one step in the process
8
Data Mining Applications(CACM 39 (11) Special Issue)
Finding patterns across data sets:Reports on changes in retail sales
– to improve sales
Patterns of sizes of TV audiences– for marketing
Patterns in NBA play– to alter, and so improve, performance
Deviations in standard phone calling behavior – to detect fraud– for marketing
9
What is Data Mining?
Potential point of confusion:The extracting ore from rock metaphor does not really apply to the practice of data miningIf it did, then standard database queries would fit under the rubric of data miningIn practice, DM refers to:
– finding patterns across large datasets– discovering heretofore unknown information
10
What is Text Data Mining?
Many people’s first thought: Make it easier to find things on the Web.But this is information retrieval!
11
Needles in HaystacksThe emphasis in IR is in finding documents that already contain answers to questions.
12
Information RetrievalA restricted form of Information Access
The system has only pre-existing, “canned” text passages.Its response is limited to selecting from these passages and presenting them to the user.It must select, say, 10 or 20 passages out of millions.
13
What is Text Data Mining?
The metaphor of extracting ore from rock:Does make sense for extracting documents of interest from a huge pile.
But does not reflect notions of DM in practice:– finding patterns across large collections– discovering heretofore unknown information
What would finding a pattern across a large text collection really look like …?
14
From: “The Internet Diary of the man who cracked the Bible Code” Brendan McKay, Yahoo Internet Life, www.zdnet.com/yil (William Gates, agitator, leader)
Bill Gates + MS-DOS in the Bible!
15
From: “The Internet Diary of the man who cracked the Bible Code”Brendan McKay, Yahoo Internet Life, www.zdnet.com/yil
More info: http://cs.anu.edu.au/~bdm/dilugim/gatesdet.txt http://cs.anu.edu.au/~bdm/dilugim/torah.html
16
Real Text DM
The point:Discovering heretofore unknown information is not what we usually do with text.(If it weren’t known, it could not have been written by someone!)
However:There is a field whose goal is to learn about patterns in text for their own sake ...
17
Computational Linguistics!Goal: automated language understanding
this isn’t possible (yet)instead, go for subgoals, e.g.,
– word sense disambiguation– phrase recognition– semantic associations
Common current approach:statistical analyses over very large text collections
18
Why CL Isn’t TDM
A linguist finds it interesting that “cloying” co-occurs significantly with “Jar Jar Binks” ...
… But this doesn’t really answer a question relevant to the world outside the text itself.
19
Why CL Isn’t TDM
We need to use the text indirectly to answer questions about the worldDirect:
Analyze patent text; determine which word patterns indicate various subject categories.
Indirect:Analyze patent text; find out whether private or public funding leads to more inventions.
20
Why CL Isn’t TDM
Direct:Cluster newswire text; determine which terms are predominant
Indirect:Analyze newswire text; gather evidence about which countries/alliances are dominating which financial sectors
21
Nuggets vs. PatternsTDM: we want to discover new information … … As opposed to discovering which statistical patterns characterize occurrence of known information.Example: WSD
not TDM: computing statistics over a corpus to determine what patterns characterize Sense S.TDM: discovering the meaning of a new sense of a word.
22
Nuggets vs. PatternsNugget:
a new, heretofore unknown item of information.
Pattern: distributions or rules that characterize the occurrence (or non-occurrence) of a known item of information.
Application of rules can create nuggets in some circumstances.
23
Example: Lexicon AugmentationApplication of a lexico-syntactic pattern:
NP0 such as NP1, {NP2 …, (and | or) NPi }
i >= 1, implies thatforall NPi, i>=1, hyponym(NPi, NP0)
Extracts out a new hypernym:“Agar is a substance prepared from a mixture of red algae, such as Gelidium, for laboratory or industrial use.”implies hyponym(“Gelidium”, “red algae”)
However, this fact was already known to the author of the text.
24
The Quandry
How do we use text to bothFind new information not known to the author of the textFind information that is not about the text itself
25
Idea: Exploratory Data Analysis
Use large text collections to gather evidence to support (or refute) hypotheses
Not known to author: links across many textsNot self-referential: work within the domain of discourse
26
Example: Etiology
Given medical titles and abstractsa problem (incurable rare disease)some medical expertise
find causal links among titlessymptomsdrugsresults
27
Swanson Example (1991)Problem: Migraine headaches (M)Facts extracted from medical journal titles:
stress associated with Mstress leads to loss of magnesiumcalcium channel blockers prevent some Mmagnesium is a natural calcium channel blockerspreading cortical depression (SCD) implicated in Mhigh levels of magnesium inhibit SCDM patients have high platelet aggregabilitymagnesium can suppress platelet aggregability
30
Swanson’s TDM
Two of his hypotheses have received some experimental verification.His technique
Only partially automatedRequired medical expertise
Some researchers are pursuing this further.
31
How to find functions of genes?
Important problem in molecular biologyHave the genetic sequenceDon’t know what it doesBut …
– Know which genes it coexpresses with– Some of these have known function
So … Infer function based on function of co-expressed genes
– This idea suggested to me by Michael Walker and others at Incyte Pharmaceuticals
32
Gene Co-expression:Role in the genetic pathway
g?
PSA
Kall.
PAP
h?
PSA
Kall.
PAP
g?
Other possibilities as well
33
Make use of the literature
Look up what is known about the other genes.Different articles in different collectionsLook for commonalities
Similar topics indicated by Subject DescriptorsSimilar words in titles and abstracts
adenocarcinoma, neoplasm, prostate, prostatic neoplasms, tumor markers, antibodies ...
34
Developing Strategies
Different strategies seem needed for different situations
First: see what is known about Kallikrein.7341 documents. Too manyAND the result with “disease” category
– If result is non-empty, this might be an interesting gene
Now get 803 documentsAND the result with PSA
– Get 11 documents. Better!
35
Developing Strategies
Look for commalities among these documentsManual scan through ~100 category labelsWould have been better if
– Automatically organized– Intersections of “important” categories scanned for first
36
Try a new tack
Researcher uses knowledge of field to realize these are related to prostate cancer and diagnostic testsNew tack: intersect search on all three known genes
Hope they all talk about diagnostics and prostate cancerFortunately, 7 documents returnedBingo! A relation to regulation of this cancer
37
Formulate a Hypothesis
Hypothesis: mystery gene has to do with regulation of expression of genes leading to prostate cancerNew tack: do some lab tests
See if mystery gene is similar in molecular structure to the othersIf so, it might do some of the same things they do
39
Strategies again
In hindsight, combining all three genes was a good strategy.
Store this for later
Might not have workedNeed a suite of strategiesBuild them up via experience and a good UI
41
Discovering Hypocritical Congresspersons
Feb 1, 1996US House of Reps votes to pass Telecommunications Reform ActThis contains the CDA (Communications Decency Act)
– Sought to criminalize posting to the Internet any material deemed indecent and patently offensive, with no exception for socially redeeming material.
Violaters subject to fines of $250,000 and 5 years in prisonEventually struck down by courts
http://www.tbtf.com/resource/hypocrites.html
42
Discovering Hypocritical Congresspersons
Sept 11, 1998US House of Reps votes to place the Starr report onlineThe content would (most likely) have violated the CDA
365 people were members for both votes284 members voted aye both times
– 185 (94%) Republicants voted aye both times– 96 (57%) Democrats voted aye both times
http://www.tbtf.com/resource/hypocrites.html
45
How to find Hypocritical Congresspersons?
This must have taken a lot of workHand cutting and pastingLots of picky details
– Some people voted on one but not the other bill– Some people share the same name
Check for different county/state Still messed up on “Bono”
Taking stats at the end on various attributes– Which state– Which party
Tools should help streamline, reuse resultsThe hardest part?
Knowing to compare these two sets of voting records in the first place.
46
SummaryText Data Mining:
Extracting heretofore undiscovered information from large text collections
Information Access TDMIA: locating already known information that is currently of interest
Finding patterns across text is already done in CL
Tells us about the behavior of languageHelps build very useful tools!