Machine Learning Methods for Text / Web Data Mining Byoung-Tak Zhang School of Computer Science and Engineering Seoul National University E-mail: [email protected]This material is available at http://scai.snu.ac.kr./~btzhang/ 2 Overview Introduction 4Web Information Retrieval 4Machine Learning (ML) 4ML Methods for Text/Web Data Mining Text/Web Data Analysis 4Text Mining Using Helmholtz Machines 4Web Mining Using Bayesian Networks Summary 4Current and Future Work
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
Machine Learning Methodsfor
Text / Web Data Mining
Byoung-Tak ZhangSchool of Computer Science and Engineering
Other Machine Learning Methods4Evolutionary Algorithms (EAs)4Reinforcement Learning (RL)4Boosting Algorithms4Decision Trees (DTs)
6
ML for Text/Web Data Mining
Bayesian Networks for Text ClassificationHelmholtz Machines for Text Clustering/CategorizationLatent Variable Models for Topic Word ExtractionBoosted Learning for TREC Filtering TaskEvolutionary Learning for Web Document RetrievalReinforcement Learning for Web Filtering AgentsBayesian Networks for Web Customer Data Mining
4
7
Preprocessing for Text LearningFrom: [email protected]:comp.graphicsSubject: Need specs on Apple QT
I need to the specs, or at least a very verbose interpretation of the specs, for QuckTime. Technical articles from magazines and references to books would be nice, too.
I also need the specs in a format usable on a Unix or MS-Dos system. I can’t do much with the QuickTime stuff they have on..
specs3unix1
hockey0
computer0
.
space0references1
.
.quicktime2
graphics0
clinton0car0baseball0
8
Text Mining: Data Sets
Usenet Newsgroup Data420 categories 41000 documents for each category 420000 documents in total.
TDT2 Corpus4Target detection and tracking (TDT): NIST4Used 6,169 documents in experiments
5
9
d1 d2 d3 dn
h1 h2 hm
…
…
4 [Chang and Zhang, 2000]4 Input nodes
• Binary values• Represent the existence or
absence of words in documents.
Text Mining:Helmholtz Machine Architecture
−−+
==
∑=
n
jjiji
i
dwbhP
1exp1
1)1(
−−+
==
∑=
m
jjiji
i
hwbdP
1exp1
1)1(
4Latent nodes• Binary values• Extract the underlying causal
structure in the document set. • Capture correlations of the words
in documents.
: recognition weight
: generative weight
10
Text Mining: Learning Helmholtz Machines
4Introduce a recognition network for estimation of a generative network.
4Wake-Sleep Algorithm• Train the recognition and generative models alternately.• Update the weight in network iteratively by simple local delta rule.
( ) ( ) ( )( )
( ) ( )( )∑∑
∑ ∑∑ ∑
=
==
≥
=
=
T
tt
ttt
tt
ttt
T
t t
tt
t QdPQ
QdPQdPD
1)(
)()()(
T
1t )()(
)()()(
1 )(
)()(
)(
|,log
|,log |,log)|log(
α
αα
αθαα
αθααθαθ
))1(( =−=∆
∆+=
jjiij
ijoldij
newij
spssw
www
γ
6
11
Text Mining: Methods
Text Categorization4Train a Helmholtz machine for each category.4Total N machines for N categories.4Once the N machines have been estimated, classification of a test
document proceeds by estimating the likelihood of the document for each machine.
Topic Words Extraction4For the entire document sets, train a Helmholtz machine.4After training, examine the weights of connections from a latent
node to input nodes, that is words.
)]|([logmaxargˆ cdPcCc∈
=
12
4Usenet Newsgroup Data• 20 categories, 1000 documents for each category, 20000 documents in
Bayesian network4DAG (Directed Acyclic Graph) 4Express dependence relations between variables 4Can use prior knowledge on the data (parameters)
A B C P(A,B,C,D,E) = P(A)P(B|A)P(C|B)
P(D|A,B)P(E|B,C,D)
D E
4Examples of conjugate priors:Dirichlet for multinomial data, Normal-Wishart for normal data
10
19
Web Mining: Results
A Bayesian net for KDD web data
V229 (Order-Average) and V240 (Friend) directly influence V312 (Target)
V19 (Date) was influenced by V240 (Friend) reflecting the TV advertisement.
20
SummaryWe study machine learning methods, such as4Probabilistic neural networks4Evolutionary algorithms4Reinforcement learning
Application areas include4Text mining4Web mining4Bioinformatics (not addressed in this talk)
Recent work focuses on probabilistic graphical models for web/text/bio data mining, including4Bayesian networks4Helmholtz machines4Latent variable models
11
21
22
Bayesian Networks:Architecture
A Bayesian network represents the probabilistic relationships between the variables.
B
G M
L
∏=
=n
iiiXPP
1
)|()( paX pai is the set of parent nodes of Xi.
),|()|()()( ),,|(),|()|()(),,,(
LBMPBGPBPLPGBLMPBLGPLBPLPMGBLP
==
12
23
The network structure represents the naïve Bayes assumption.All nodes are binary.[Hwang & Zhang, 2000]
C
t1 t2 t8754
• C: document class
• ti: ith term
Bayesian Networks:Applications in IR – A Simple BN for Text Classification
24
Dataset4The acq dataset from Reuters-2157848754 terms were selected by TFIDF.4Training data: 8762 documents4Test data: 3009 documents
Parametric Learning4Dirichlet prior assumptions for the network parameter
distributions.
4Parameter distributions are updated with training data.),...,|(Dir)|( 1 iijrijij
Extracted Topic Words (top 35 words with highest P(wj|zk) Topics
Latent Variable Models:Applications in IR – Experimental Results
30
Boosting: Algorithms
A general method of converting rough rules into a highly accurate prediction ruleLearning procedure4 Examine the training set4 Derive a rough rule (weak learner)4 Re-weight the examples in the training set, concentrating on the hard cases for
previous rules4 Repeat T times
Learner Learner Learner Learner
h1 h2 h3 h4),,,( 4321 hhhhf
Importance weightsof training documents
16
31
Boosting: Applied to Text Filtering
Naïve Bayes4Traditional algorithm for text filtering