Top Banner
Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu Phan Le-Minh Nguyen Susumu Horiguchi GSIS, Tohoku University GSIS, JAIST GSIS, Tohoku University WWW 2008 NLG Seminar 2008/12/31 Reporter:Kai-Jie Ko 1
27

Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,

Dec 29, 2015

Download

Documents

Nickolas Snow
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,

1

Learning to Classify Short and Sparse Text & Web withHidden Topics from Large-

scale Data CollectionsXuan-Hieu Phan Le-Minh Nguyen Susumu HoriguchiGSIS, Tohoku University GSIS, JAIST GSIS, Tohoku

University

WWW 2008

NLG Seminar 2008/12/31Reporter:Kai-Jie Ko

Page 2: Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,

2

Motivation

Many classification tasks working with short segments of text & Web, such as search snippets, forum & chat messages, blog & news feeds, product reviews, and book & movie summaries, fail to achieve high accuracy due to the data sparseness

Page 3: Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,

3

Previous works to overcome data sparsenessEmploy search engines to expand and

enrich the context of data

Page 4: Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,

4

Previous works to overcome data sparsenessEmploy search engines to expand and

enrich the context of data

Time consuming!

Page 5: Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,

5

Previous works to overcome data sparsenessTo utilize online data repositories, such as

Wikipedia or Open Directory Project,as external knowledge sources

Page 6: Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,

6

Previous works to overcome data sparsenessTo utilize online data repositories, such as

Wikipedia or Open Directory Project,as external knowledge sources

Only used the user defined categories and concepts in those repositories, not general enough

Page 7: Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,

7

General framework

Page 8: Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,

8

(a)Choose an universal data

•Must large and rich enough to cover words, concepts that are related to the classification problem.•Wikipedia & MEDLINE are chosen in this paper.

Page 9: Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,

9

(a)Choose an universal data

Use topic oriented keywords to crawl Wikipedia with maximum depth of hyperlink 4◦240MB◦71,968 documents◦882,376 paragraphs◦60,649 vocabulary◦30,492,305 words

Page 10: Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,

10

(a)Choose an universal data

Ohsumed : a test collection of medical journal abstracts to assist IR research◦156MB◦233,442 abstracts

Page 11: Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,

11

(b)Doing topic analysis for the universal dataset

Page 12: Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,

12

(b)Doing topic analysis for the universal dataset

Using GibbsLDA++, a C/C++ implementation of LDA using Gibbs Sampling

The number of topics ranges from 10, 20 . . . to 100, 150, and 200

The hyperparameters alpha and beta were set to 0.5 and 0.1, respectively

Page 13: Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,

13

Hidden topics analysis for Wikipedia data

Page 14: Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,

14

Hidden topics analysis for the Ohsumed-MEDLINE data

Page 15: Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,

15

(c)Building a moderate size labeled training dataset

•Words/terms in this dataset should be relevant to as many hidden topics as possible.

Page 16: Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,

16

(d)Doing topic inference for training and future data

•To transform the original data into a set of topics

Page 17: Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,

17

Sample Google search snippets

Page 18: Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,

18

Snippets word co-occurence

This show the sparseness of web snippetsin that only small fraction of words are shared by the 2 or 3 different snippets

Page 19: Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,

19

Shared topics among snippets after inferenceAfter doing inference and integration,

snippets are more related in semantic way

Page 20: Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,

20

(e) Building the classifier

•Choose from different learning methods•Integrate hidden topics into the training, test, or future data according to the data representation of the chosen learning technique•Train the classifier on the integrated training data

Page 21: Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,

21

Evaluation

Domain disambiguation for Web search results◦To classify Google search snippets into different

domains, such as Business, Computers, Health, etc.

Disease classification for medical abstracts◦Classifies each MEDLINE medical abstract into

one of five disease categories that are related to neoplasms, digestive system, etc.

Page 22: Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,

22

Domain disambiguation for Web search results

Obtain Google snippet as training and testing data, the search phrase of the two data are totally exclusive

Page 23: Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,

23

Domain disambiguation for Web search results

The result of doing 5-fold cross validation on the training data

Reduce 19% of error on average

Page 24: Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,

24

Domain disambiguation for Web search results

Page 25: Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,

25

Domain disambiguation for Web search results

Page 26: Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,

26

Disease Classification for Medical Abstracts with MEDLINE Topics

The proposed method requires only 4500 training data to reachthe accuracy of the baseline which uses 22500 training data!

Page 27: Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,

27

Conclusion

Advantages of proposed framework:◦A good method to classify sparse and previous

unseen data Utilizing the large universal dataset

◦Expanding the coverage of the classifier Topics coming from external data cover a lot of

terms/words that do not exist in training dataset◦Easy to implement

Only have to prepare a small set of labeled training example to attain high accuracy