2002.05.13 인터넷기술 강혜원 1 Evaluating the novelty of text-mined rules using lexical knowledge Sugato Basu, Raymond J. Mooney, Krupakar V. Pasupuleti, Joydeep.

2002.05.13 인터넷기술 강혜원 1

Evaluating the novelty of text-mined rules using lexical knowledge

Sugato Basu , Raymond J. Mooney , Krupakar V. Pasupuleti , Joydeep Ghosh

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining

2001 , San Francisco, California

2002.05.13 인터넷기술 강혜원 2

Index

IntroductionBackgroundScoring the novelty of rulesExperimental resultsFuture workConclusion

2002.05.13 인터넷기술 강혜원 3

Introduction – Mining 된 rule 의

interestingness 평가Simplicity (e.g. rule size) 측정Certainty (e.g. confidence) 측정Utility (e.g. support) 측정Novelty 측정• Mining 된 rule 이 기존에 알려지지 않았던 관계를

나타내는지 여부를 평가하는 것• E.g 컴퓨터과학 구인공고에서의 text -mined rule

SQL Database : 컴퓨터 과학자들에게는 흥미롭지 않은 관계

2002.05.13 인터넷기술 강혜원 4

Introduction – Novelty 측정방법

WordNet 에서의 단어간 semantic distance 측정DiscoTEX 를 사용한 Mined rule : w1 w2WordNet 에서의 semantic distance : d(w1,w2)d(w1,w2) 가 클수록 novelty 한 것e.g. beerdiaper > beerpretzels

2002.05.13 인터넷기술 강혜원 5

Background – Text Mining

Unstructured 혹은 semi-structured textual data (e.g.web-page) 로부터 knowledge 발견자연언어처리 , 기계학습 , 정보검색의 교차점에 위치DiscoTex 를 이용한 text mining• Web-page 의 내용을 미리 정의된 slot (e.g.title,author,s

ubject,etc) 에 따라 structured template 으로 추출• 이 template database 로부터 prediction rule 추출

2002.05.13 인터넷기술 강혜원 6

Background – Text Mining

DiscoTEX rule mined from Amazon.com “romance” book description

도출된 Rule : daring love woman romance historical fiction story read wonderful

<title> daring, love

<synopses> woman

<subject> romance, historical, fiction

<comments> story, read, wonderful

2002.05.13 인터넷기술 강혜원 7

Background – WordNet

130,000 개의 영어단어에 대한 online lexical knowledge-base명사 , 형용사 , 동사 , 부사가 동의어의 집합 (Synsets) 으로 묶임 동의어 , 반의어 , 상의어 / 하의어 , 부분 / 전체 , 함의 관계 등을 규명http://www.cogsci.princeton.edu/~wn/

2002.05.13 인터넷기술 강혜원 8


Drive,rideDivorce,marry

Entailment

March,warkWhisper,speak

Troponomy(manner)

Brim, hatGin, martini

Meronymy(part)

Sugar maple <mapleMaple<treeTree<plant

Hyponymy(subordinate)

Set,dryPowerful,powerlessFriendly,unfriendly

Antonymy(opposite)

Pipe,tubesad,unhappyRapidly,speedily

Synonymy(similar)

ExamplesSemantic Relation

2002.05.13 인터넷기술 강혜원 9


group

family

sister

person

relative

Natural

object

body

Organic

substance

substance

brother arm leg flesh bone

hyponymy antonymy meronymy

2002.05.13 인터넷기술 강혜원 10


2002.05.13 인터넷기술 강혜원 11


2002.05.13 인터넷기술 강혜원 12

Background – 단어간 의미적 유사성 측정

두 단어간의 normalized shortest path 길이에 음수로그• Path 길이 : node 의 수로 측정됨• Normalizing factor : 계층구조에서의 최대깊이

개념적 거리두 단어간 path 길이 ( 노드의 수 ) 같을 때 하위계층에 있는 한 쌍이 상위계층에 있는 한 쌍보다 유사성이 큼Depth-relative scalingPath 길이 뿐 아니라 방향변경의 횟수까지 고려

2002.05.13 인터넷기술 강혜원 13

Scoring the novelty of rules – Semantic Distance Measure

두 단어간 semantic distance d(wi , wj) = Dist (P(wi , wj )) + K * Dir(P (wi , wj ))

P(wi , wj ) : 두 단어 간의 weighted shortest path

Dist(p) : weighting scheme 에 따른 path p 의 거리 Dir(p) : path p 에서 방향이 바뀌는 횟수 K : 상수

2002.05.13 인터넷기술 강혜원 14


Edge weighting (Depth-relative scaling)

w(A,B) =

given w(Xr Y) = maxr -

r : r 타입의 relation

ri : r 타입 relation 의 역방향 relation

d: 전체 트리구조에서 두 개의 노드 중에 더 깊이 위치한 노드의 깊이 maxr , minr : r 타입의 relation 이 가질 수 있는 최대 , 최소 weights

nr (X) : 노드 X 를 떠나는 r 타입 relations 의 갯수

w(Ar B) + w(Bri A)

2dmaxr - minr

nr (X)

A,.. B,..r

ri

r C,..

2002.05.13 인터넷기술 강혜원 15


Relation Direction

Also see Horizontal

Antonymy Horizontal

Attribute Horizontal

Cause Down

Entailment Down

Holonymy Down

Hypernymy Up

Hyponymy Down

Meronymy Up

Pertinence Horizontal

Similiarity Horizontal

방향전환에 따른 WordNet relations 의 분류

Synsets 간에 허용되는 path 패턴들

2002.05.13 인터넷기술 강혜원 16


Direction and weight information for the 15 WordNet relations usedWeight 계산에서 역관계 배제 , 한 방향만 고려

Relations Direction Weight

Synonym,Attribute,Similar,Pertainym Horizontal 0.5

Antonym Horizontal 2.5

Hypernym,(Member|Part|Substance) Meronym

Up 1.5

Hyponym,(Member|Part|Substance) Holonym, Cause, Entailment

Down 1.5

2002.05.13 인터넷기술 강혜원 17

Scoring the novelty of rules –Rule Scoring Algorithm

Rnoun : 11 개의 분리된 명사트리를 하나의 root node 로 연결Rverb : 분리된 동사트리를 하나의 root node 로 연결

Rtop : Rnoun 과 Rverb 를 연결하는 root node

형용사 , 부사는 각각 대응되는 명사와 연결되어 있음

Rtop

Rnoun Rverb

….

2002.05.13 인터넷기술 강혜원 18


For each rule in a rule file Let A = set of antecedent words, C = set of consequent words For each word wi ∈ A and wj ∈ C If wi and wj are not a valid words in WordNet Score(wi , wj ) PathViaRoot(davg,davg ) Elseif wj is not a valid word in WordNet Score(wi , wj ) PathViaRoot(wi ,davg ) ElseIf wi is not a valid word in WordNet Score(wi , wj ) PathViaRoot(davg,wj ) ElseIf path not found between wi and wj (in user-specified time-limit) Score(wi , wj ) PathViaRoot(wi , wj ) Else Score(wi , wj ) d(wi , wj) Score of rule = Average of all (wi , wj ) scoresSort scored rules in descending order

2002.05.13 인터넷기술 강혜원 19


Common

node

A

B

거리 X 거리 Y

PathViaRoot (A,B)= X + Y

PathViaRoot (B,C)= X + Y + Z

만약 Path 가 Rnoun 이나 Rverb 를 지나면

PathViaRoot (A,B) + POSRootPenalty (3.0)

R top 를 지나면

PathViaRoot (A,B) + TOPRootPenalty (4.0)

전제 ) A,B,C 는 모두 WordNet 에 있는 단어

A,B 는 명사나 동사이고 C 는 형용사나 부사일때

C거리 Z

2002.05.13 인터넷기술 강혜원 20


Root

node

A

B

거리 X 거리 Y

전제 ) B 는 WordNet 에 없는 단어 ( 고유명사 , 특수용어 ..)

…..

샘플링테크닉에 의한 B 의 평균 depth≒ 6

2002.05.13 인터넷기술 강혜원 21

Experimental Results - Methodology

Novelty 는 주관적 개념두 가지 타입의 average correlation 계산인간에 의한 novelty 평가와 알고리즘에 의한 novelty 평가를 비교하여 average correlation 산출 A인간에 의한 novelty 평가들을 비교하여 average correlation 산출 BA 와 B 가 큰 차이 없다면 실험은 성공적

2002.05.13 인터넷기술 강혜원 22

Experimental Results - Methodology

http://www.amazon.com

Literature

Science

Remance

FantasyDiscoTEX

Rule set1- 피험자그룹 1

Rule set2 - 피험자그룹 2



사람에 의한 Novelty 평가

9000 개의 rule 추출

Web page

피험자 평가결과간의 평균상관관계 VS

피험자와 알고리즘 평가결과간의 평균상관관계

두가지 상관관계를 비교

…25 * 4set random 샘플링

2002.05.13 인터넷기술 강혜원 23

Experimental Results – Results and Discussion

High score(>8) :

romance love heart midnight

space astrophysics science astronaultics apollo

science computers literature war world

nonlinear science physics

Medium score(4-6) :

author romance -> characters love

applied science mathematics exercises

science physics theory

analysis science nature

love read romance fiction

Low score(<2) :

astronomy science space

geography nature world

mechanics physics science

fiction classics literature

sea geography ocean

2002.05.13 인터넷기술 강혜원 24

Experimental Results – Results and DiscussionHuman-HumanCorrelation

Algorithm-HumanCorrelation

Raw Rank Raw Rank

Group1 0.284 0.269 0.158 0.113

Group2 0.299 0.282 0.357 0.330

Group3 0.217 0.223 0.303 0.297

Human-HumanCorrelation

Algorithm-HumanCorrelation

Raw Rank Raw Rank

Group1 0.350 0.338 0.187 0.137

Group2 0.412 0.393 0.386 0.363

Group3 0.337 0.339 0.339 0.338

Results with all subjects

Results after removing outliers

2002.05.13 인터넷기술 강혜원 25

Experimental Results – Results and Discussion

Group1 을 제외하고 인간 - 인간 / 인간 - 알고리즘 간의 correlation 이 비슷Group1 의 correlation 이 낮은 이유• 고유명사 – WordNet 에 없음 (e.g.ieee societyscienc

e mathematics)• 사람이름 – WordNet 에 있거나 없음 .(e.g. physics scie

nce naturejohn wiley publisher sons)• Relation – 특정 Relation 이 WordNet 에 없음 . (e.g.

seaoceanography)

2002.05.13 인터넷기술 강혜원 26

Future work

Parameters of the algorithm (e.g. WordNet relations 의 weight, 상수 K, POSRootPenalty 와 TopRootPenalty) – 기계학습 테크닉에 의한 자동선택필요WordNet 에 없는 relations 문제 (e.g.”pencil” & “paper”) – 단어의 co-occurrence 에 기반한 통계적 방법으로 해결가능

-WordNet 과 co-occurrence 에 기반한 방법을 조합한 의미거리 측정

2002.05.13 인터넷기술 강혜원 27

Conclusion

텍스트 데이터로부터 마이닝 된 Rule 의 novelty 를 측정하는데 있어서 WordNet 의 lexical knowledge 를 사용한 새로운 방법 제시알고리즘을 평가하는데 있어서 인간 - 인간 correlation 과 인간 - 알고리즘 correlation 비교를 성공적으로 제시

2002.05.13 인터넷기술 강혜원 1 Evaluating the novelty of text-mined rules using lexical knowledge Sugato Basu, Raymond J. Mooney, Krupakar V. Pasupuleti, Joydeep.

Documents