Top Banner
1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER
27

1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER.

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER.

1

A Text Filtering Method For Digital Libraries

Mustafa Zafer BOLAT

Hayri SEVER

Page 2: 1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER.

2

introduction• Information filtering (IF)

– Incoming relevant documents are routed to profilesqueries.

• Information retrieval (IR)– Provides a list of ordered documents based

on the similarity with the user query

Page 3: 1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER.

3

introduction (continued...)

• Linear Separation - partitions relevant and non-relevant

into distinct blocks

• Optimal Queries- all relevant documents are ahead of

nonrelevant ones.

• Steepest Descent Algorithm (SDA)

Page 4: 1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER.

4

preliminaries

• Information retrieval system (S) can be defined as 5 tuple

• S =(T,D,Q,V,f)

-T set of ordered index terms-D set of documents-Q set of queries-V set of real numbers-f:DxQ V retrieval function

Page 5: 1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER.

5

preliminaries (continued)

• Vector Space Model- Transformation of raw text into more computationally useful forms

- Documents and queries are represented as vectors of weighted terms

• d=(t1,wd1;t2,wd2;. . .;tn,wdn) ti T d

• q = (q1, wq1 ; q2, wq2, . . . ; qm, wqm) qi T q

Page 6: 1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER.

6

preliminaries (continued)

• Rnorm value for effectiveness It measures up how relevant documents are distributed over nonrelavent ones.

rank matters.

Page 7: 1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER.

7

preliminaries (continued)predicted actual

relevant non-relevant

relevant a bnon-relevant c d

Contingency Table

•Precision =a / (a+b) •Recall =a / (a+c)

•Breakeven pointWhere precision and recall are equal

Page 8: 1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER.

8

overview of experiment

TrainingWithSDA

Optimal query

...train

test

Reuters -21578Data set Category

labels

Effectivenessmeasures

Preprocessing

Removingstop words

Stemming

Transform to Vectors

Parsing

Reducing Normalizing

Page 9: 1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER.

9

overview of experiment

train

Removingstop words

Stemming

Transform to Vectors

Parsing

TrainingWithSDA

Optimal query

test

...Reuters -21578Data set Category

labels

Effectivenessmeasures

Preprocessing

Consists of 21578 economic news stories thatoriginally appeared on the Reuters newswire in 1987

Each story has been manually assigned one or more indexing labels from a fixed list

There are 135 TOPIC labels for classification.In order to use a text corpus for machine learning

research it splited into sets of training and testing examples

Reuters 21578

train

test

Reuters -21578Data set

Page 10: 1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER.

10

overview of experiment

train

Removingstop words

Stemming

Transform to Vectors

Parsing

TrainingWithSDA

Optimal query

test

...Reuters -21578Data set Category

labels

Effectivenessmeasures

PrePocessing

<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET"

OLDID="9944" NEWID="5031"><DATE>13-MAR-1987 15:45:35.38</DATE>

<TOPICS><D>livestock</D><D>carcass</D></TOPICS><PLACES><D>usa</D></PLACES>

<PEOPLE></PEOPLE><ORGS><D>ec</D></ORGS>

<EXCHANGES></EXCHANGES><COMPANIES></COMPANIES>

<TEXT>&#2;<TITLE>U.S. MEAT GROUP TO FILE TRADE COMPLAINTS</TITLE>

<DATELINE> WASHINGTON, March 13 - </DATELINE><BODY>The American Meat Institute, AME,said it intended to ask the U.S.

government to retaliate against a European Community meat inspection requirement. AME President C. Manly Molpus also said the industry would file a petition challenging Korea's ban of U.S. meat products.

Molpus told a Senate Agriculture subcommittee that AME andother livestock and farm groups intended to file a petition

under Section 301 of the General Agreement on Tariffs and Tradeagainst an EC directive that, effective April 30, will require

U.S. meat processing plants to comply fully with EC standards.

Reuter&#3;</BODY></TEXT>

</REUTERS>

Sample Reuters 21578 Document

train

test

Reuters -21578Data set

Page 11: 1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER.

11

train

test

Reuters -21578Data set

overview of experiment

train

Removingstop words

Stemming

Transform to Vectors

Parsing

TrainingWithSDA

Optimal query

test

...Reuters -21578Data set Category

labels

Effectivenessmeasures

PrePocessing

ParsingHAS TOPICS=YES

LEWISSPLIT=TRAINTOPICS:livestock,carcass

Body: U.S. MEAT GROUP TO FILE TRADE COMPLAINTSThe American Meat Institute, AME,said it intended to ask the U.S.

government to retaliate against a European Community meat inspection requirement. AME President C. Manly Molpus also said the industry would file a petition challenging Korea's ban of U.S.

meat products. Molpus told a Senate Agriculture subcommittee that AME and other livestock and farm groups intended to file a petition under Section 301 of the General

Agreement on Tariffs and Trade against an EC directive that, effective April 30, will require U.S. meat processing plants to comply

fully with EC standards

Page 12: 1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER.

12

train

test

Reuters -21578Data set

overview of experiment

train

Removingstop words

Stemming

Transform to Vectors

Parsing

TrainingWithSDA

Optimal query

test

...Reuters -21578Data set Category

labels

Effectivenessmeasures

PrePocessing

After ParsingHAS TOPICS=YES

LEWISSPLIT=TRAINTOPICS:livestock,carcass

Body: U S MEAT GROUP TO FILE TRADE COMPLAINTSThe American Meat Institute AME said it intended to ask the U S

government to retaliate against a European Community meat inspection requirement AME President C Manly Molpus also said the industry would file a petition challenging Korea's ban of U S

meat products Molpus told a Senate Agriculture subcommittee that AME and other livestock and farm groups

intended to file a petition under Section of the General Agreement on Tariffs and Trade against an EC directive that

effective April will require U S meat processing plants to comply fully with EC standards

Page 13: 1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER.

13

train

test

Reuters -21578Data set

overview of experiment

train

Removingstop words

Stemming

Transform to Vectors

Parsing

TrainingWithSDA

Optimal query

test

...Reuters -21578Data set Category

labels

Effectivenessmeasures

PrePocessing

Removing Stop WordsHAS TOPICS=YES

LEWISSPLIT=TRAINTOPICS:livestock,carcass

Body: U.S. MEAT GROUP FILE TRADE COMPLAINTSThe American Meat Institute, AME,said it intended to ask the U.S.

government to retaliate against a European Community meat inspection requirement. AME President C. Manly Molpus also said the industry would file a petition challenging Korea's ban of U.S.

meat products. Molpus told a Senate Agriculture subcommittee that AME and other livestock and farm groups intended to file a petition under Section 301 of the General

Agreement on Tariffs and Trade against an EC directive that, effective April 30, will require U.S. meat processing plants to comply

fully with EC standards

Page 14: 1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER.

14

train

test

Reuters -21578Data set

overview of experiment

train

Removingstop words

Stemming

Transform to Vectors

Parsing

TrainingWithSDA

Optimal query

test

...Reuters -21578Data set Category

labels

Effectivenessmeasures

PrePocessing

After Removing Stop WordsHAS TOPICS=YES

LEWISSPLIT=TRAINTOPICS:livestock,carcass

Body: . MEAT GROUP FILE TRADE COMPLAINTSAmerican Meat Institute AME intended ask

government retaliate European Community meat inspection requirement. AME President Manly Molpus

industry file petition challenging Korea's ban U.S. meat products Molpus Senate Agriculture subcommittee AME livestock farm groups

intended file petition Section General Agreement Tariffs Trade EC directive

effective April require meat processing plants comply fully EC standards

Page 15: 1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER.

15

overview of experiment

train

TrainingWithSDA

Optimal query

test

...Reuters -21578Data set Category

labels

Effectivenessmeasures

PrePocessing

StemmingHAS TOPICS=YES

LEWISSPLIT=TRAINTOPICS:livestock,carcass

Body: MEAT GROUP FILE TRADE COMPLAINTAmerican Meat Institute AME intended ask

government retaliate European Community meat inspection requirement. AME President Manly

Molpus industry file petition challeng Korea ban meat product Molpus Senate Agriculture subcommittee AME livestock farm group intended file petition Section General

Agreement Tariff Trade EC direct effect April require meat process plant compli

fulli EC standard

Removingstop words

Stemming

Transform to Vectors

Parsing

ReducingNormalizing

train

test

Reuters -21578Data set

Page 16: 1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER.

16

overview of experiment

train

TrainingWithSDA

Optimal query

test

...Reuters -21578Data set Category

labels

Effectivenessmeasures

PrePocessing

Transform To VectorsHAS TOPICS=YES

LEWISSPLIT=TRAINTOPICS:livestock,carcass

Removingstop words

Stemming

Transform to Vectors

Parsing

Reducing Normalizing

meat 5group 1

... ...Molpus 1

... ...

... ...standard 1

train

test

Reuters -21578Data set

Page 17: 1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER.

17

overview of experiment

train

TrainingWithSDA

Optimal query

test

...Reuters -21578Data set Category

labels

Effectivenessmeasures

PrePocessing

Create Dictionary (only in training)

Removingstop words

Stemming

Transform to Vectors

Parsing

Reducing Normalizing

approv 1236chairman 1225

... ...

... ...

... ...

... ...ptd 5

train

test

Reuters -21578Data set

Page 18: 1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER.

18

overview of experiment

train

TrainingWithSDA

Optimal query

test

...Reuters -21578Data set Category

labels

Effectivenessmeasures

PrePocessing

ReducingHAS TOPICS=YES

LEWISSPLIT=TRAINTOPICS:livestock,carcass

Removingstop words

Stemming

Transform to Vectors

Parsing

Reducing Normalizing

... ...group 1meat 5Molpus ...

... ...standard 1

... ...

train

test

Reuters -21578Data set

Page 19: 1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER.

19

overview of experiment

train

TrainingWithSDA

Optimal query

test

...Reuters -21578Data set Category

labels

Effectivenessmeasures

PrePocessing

After ReducingHAS TOPICS=YES

LEWISSPLIT=TRAINTOPICS:livestock,carcass

Removingstop words

Stemming

Transform to Vectors

Parsing

Reducing Normalizing

... ...group 1meat 5

... ...standard 1

... ...

train

test

Reuters -21578Data set

Page 20: 1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER.

20

overview of experiment

train

TrainingWithSDA

Optimal query

test

...Reuters -21578Data set Category

labels

Effectivenessmeasures

PrePocessing

Normalizing HAS TOPICS=YES

LEWISSPLIT=TRAINTOPICS:livestock,carcass

Removingstop words

Stemming

Transform to Vectors

Parsing

Reducing Normalizing

... ...group 1meat 5

... ...standard 1

... ...

train

test

Reuters -21578Data set

wk =tk x log (ND /nk)

tk term frequency

ND Number of documents in collection

nk number of documents containing tk

is normalized weight of term k

unnormalized weight of term k

2' / www kk'kw

kw

Page 21: 1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER.

21

overview of experiment

train

TrainingWithSDA

Optimal query

test

...Reuters -21578Data set Category

labels

Effectivenessmeasures

PrePocessing

After Normalizing HAS TOPICS=YES

LEWISSPLIT=TRAINTOPICS:livestock,carcass

Removingstop words

Stemming

Transform to Vectors

Parsing

Reducing Normalizing

... ...group 0.127meat 0.278

... ...standard 0.012

... ...

train

test

Reuters -21578Data set

wk =tk x log (ND /nk)

tk term frequency

ND Number of documents in collection

nk number of documents containing tk

is normalized weight of term k

unnormalized weight of term k

2' / www kk'kw

kw

Page 22: 1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER.

22

overview of experiment

train

test

...Reuters -21578Data set Category

labels

Effectivenessmeasures

PrePocessing

Removingstop words

Stemming

Transform to Vectors

Parsing

Reducing

Training

1. Choose a starting query vector Q0; let k = 0.

2. Let Qk be a query vector at the start of

the (k+1)th iteration; identify thefollowing set of difference vectors:   (Qk) ={b=d- d’ :d d’ and

f(Qk,b) 0}; if (Qk)= ,

Qopt = Qk is a solution

and exit, otherwise, 3. Let Qk+1 = Qk +

 4. k = k+1; go back to Step (2).

)(Qkb

b

TrainingWithSDA

Optimal query

Page 23: 1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER.

23

overview of experiment

train

Optimal query

test

...Reuters -21578Data set Category

labels

Effectivenessmeasures

PrePocessing

Removingstop words

Stemming

Transform to Vectors

Parsing

Reducing Normalizing

Training• All the category examples as positive examples • Random 60% from other topicsas negative examples

• If maximum Rnorm value (1)is not reached at maximum 150 iterations set optimal query as the query that produces maximum Rnorm value available

TrainingWithSDA

Page 24: 1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER.

24

overview of experiment

TrainingWithSDA

Optimal query

...train

test

Reuters -21578Data set Category

labels

Effectivenessmeasures

PrePocessing

Removingstop words

Stemming

Transform to Vectors

Parsing

Reducing Normalizing

There are 135 categories

Topic # of + earn 2877acq 1650moneyfx 538grain 433crude 389trade 369interest 347wheat 212ship 197corn 182

Topic # of earn 1087acq 719moneyfx 179grain 149crude 189trade 118interest 131wheat 71ship 89corn 56

traintest

Page 25: 1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER.

25

overview of experiment

TrainingWithSDA

Optimal query

...train

test

Reuters -21578Data set Category

labels

Effectivenessmeasures

PrePocessing

Removingstop words

Stemming

Transform to Vectors

Parsing

Reducing Normalizing

Create contingency tables

Find breakeven points

Page 26: 1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER.

26

ResultsTopic Findism Nbayes SDA Bnets Trees SVM

earn 92,9 95,9 96,32 95,8 97,8 98,0

acq 64,7 87,8 85,26 88,3 89,7 93,6

money-fx 46,7 56,6 68,72 58,8 66,2 74,5

grain 67,5 78,8 71,81 81,4 85,0 94,6

crude 70,1 79,5 82,54 79,6 85,0 88,9

trade 65,1 63,5 65,25 69,0 72,5 75,9

interest 63,4 64,9 61,07 71,3 67,1 77,7

wheat 68,9 69,7 76,06 82,7 92,5 91,9

ship 49,2 85,4 65,17 84,4 74,2 85,6

corn 48,2 65,3 75,00 76,4 91,8 90,3

Avg.Top 10 64,6 81,5 84,54 85,0 88.4 92,0

Avg.All 61,7 75,2 76,37 80,0 N/A 87,0

breakevens

Page 27: 1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER.

27

Thank you!