03/15/22 1 Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer Science University of Illinois at Urbana-Champaign www.cs.uiuc.edu/~hanj Acknowledgements: Slides by students at CS512 (Spring 2009)
75
Embed
Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II)
Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II). Jiawei Han and Micheline Kamber Department of Computer Science University of Illinois at Urbana-Champaign www.cs.uiuc.edu/~hanj Acknowledgements: Slides by students at CS512 (Spring 2009). 9/10/2014. - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
04/24/23 1
Data Mining: Concepts and Techniques
— Chapter 10 —10.3.2 Mining Text and Web Data (II)
Jiawei Han and Micheline Kamber
Department of Computer Science
University of Illinois at Urbana-Champaign
www.cs.uiuc.edu/~hanjAcknowledgements: Slides by students at CS512 (Spring 2009)
• Usage of a topic model:– Summarize themes/aspects– Navigate documents– Retrieve documents– Segment documents– Document classification– Document clustering
Topic 1
Topic k
Topic 2
…
Background B
government 0.3 response 0.2
...
donate 0.1relief 0.05help 0.02
...
city 0.2new 0.1
orleans 0.05 ...
is 0.05the 0.04a 0.03
...
[ Criticism of government response to the hurricane primarily consisted of criticism of its response to the
approach of the storm and its aftermath, specifically in the delayed response ] to the [ flooding of New Orleans. …
80% of the 1.3 million residents of the greater New Orleans metropolitan area evacuated ] …[ Over seventy countries
pledged monetary donations or other assistance]. …
8
General Idea of Probabilistic Topic Models
• Cast intuition into a generative probabilistic process (Generative Process)– Each document is a mixture of corpus-wide topics
(multinomial distribution/unigram LM)– Each word is drawn from one of those topics
• Since we only observe the documents, need to figure out (Estimation/Inference)– What are the topics?– How are the documents divided according to those
Parameters: B=noise-level (manually set)’s and ’s need to be estimated
PLSA: Estimation
w
Topics
Collection background
B
B
Document
Is ?the ?a ?
…
1
2
k
d1
d2
dk
battery ? life ?
design ?screen ?
price ?purchase ?
Generate a word in a document
[Hofmann 99], [Zhai et al. 04]
?
?
? Log-likelihood of
the collection
Estimated with Maximum Likelihood Estimator (MLE) through an EM algorithm
Problems with PLSA
– “Documents have no generative probabilistic semantics”
•i.e., document is just a symbol
– Model has many parameters•linear in number of documents•need heuristic methods to prevent overfitting
– Cannot generalize to new documents
Latent Dirichlet Allocation [Blei et al. 03]
Basic Idea of LDA
• Adding a Dirichlet Prior α on topic distribution in documents
• Adding a Dirichlet Prior β on word distribution in topics
• α, β can be vectors, but for convenience, α = α1= α2=…; β = β1 = β2=… (Smoothed LDA)
w
Topics
…
1
2
k
d1
d2
dk
Document
[Blei et al. 03], [Griffiths&Steyvers 02, 03, 04]
β
β
Dirichlet Hyperparameters α, β
• Generally have a smoothing effect on multinomial parameters
• Large α, β : more smoothed topic/word distribution
• Small α, β: more skewed topic/word distribution (e.g. bias towards a few words for each topic)
• Common settings: α=50/K, β=0.01
• PLSA is maximum a posteriori estimated LDA when using uniform prior: α=1, β=1
Inference
• Exact inference is intractable
• Approximation techniques:– Mean field variational methods (Blei et al., 2001, 2003)
– Expectation propagation (Minka and Lafferty, 2002)
– Collapsed Gibbs sampling (Griffiths and Steyvers, 2002)
– Collapsed variational inference (Teh et al., 2006)
Would like to know more?
• “Parameter estimation for text analysis” by Gregor Heinrich
• “Probabilistic topic models” by Mark Steyvers
Opinion Mining
Hyun Duk Kim
04/24/23Data Mining: Principles and
Algorithms 18
Agenda Overview Opinion finding & sentiment classification Opinion Summarization Other works Discussion & Conclusion
04/24/23Data Mining: Principles and
Algorithms 19
Web 2.0 “ Web 2.0 is the business revolution in the
computer industry caused by the move to the Internet as a platform, and an attempt to understand the rules for success on that new platform.” [Wikipedia]
Users participate in content creation ex. Blog, review, Q&A forum
04/24/23Data Mining: Principles and
Algorithms 20
Opinion Mining Huge volume of
opinions on the Web Ex. Product
reviews, Blog posts about politic issues
Need a good technique to summarize them
Example of commercial system (MS live search)
04/24/23Data Mining: Principles and
Algorithms 21
Usefulness of opinion mining Individuals
Purchasing a product/ service Tracking political topics Other decision making tasks
Businesses and organizations product and service benchmarking survey on a topic
Ads placements Place an ad when one praises an product Place an ad from a competitor if one criticizes a
product[Kavita Ganesan & Hyun Duk Kim, Opinion Mining: A Short Tutorial, 2008]
Rule based approach Context-dependent orientation finding using Pros
and Cons reviews.
04/24/23Data Mining: Principles and
Algorithms 33
Other works Opinion Integration [Lu & Zhai, WWW '08]
Integrate expert reviews with arbitrary text collection
Expert reviews: well structured, easy to find features, not often updated
Arbitrary: not structured, various & updated data
Semi-supervised topic model Extract structure aspects (features) data from the
expert review to cluster general documents Add supplementary opinions from general
documents04/24/23
Data Mining: Principles and Algorithms 34
Agenda Overview Opinion finding & sentiment classification Opinion Summarization Other works Discussion & Conclusion
04/24/23Data Mining: Principles and
Algorithms 35
Challenges in opinion mining Polarity terms are context sensitive
Ex. Small can be good for ipod size, but can be bad for LCD monitor size
Even in the same domain, use different words depending on target feature
Ex. Long ‘ipod’ battery life vs. long ‘ipod’ loading time Partially solved (query dependent sentiment classification)
Implicit and complex opinion expressions Rhetoric expression, metaphor, double negation Ex. The food was like a stone Need both good IR and NLP techniques for opinion mining.
Cannot divide into pos/neg clearly Not all opinions can be classified into two categories Interpretation can be changed based on conditions Ex. 1) The battery life is ‘long’ if you do not use LCD a lot (pos)
2) The battery life is ‘short’ if you use LCD a lot (neg)Current system classify the first one as positive and second one as negative. However, actually both are saying the same fact.
[Kavita Ganesan & Hyun Duk Kim, Opinion Mining: A Short Tutorial, 2008]
04/24/23Data Mining: Principles and
Algorithms 36
Discussion A difficult task Essential for many blog or review mining
techniques Current stage of opinion finding
Good performance in sentence level, specific domain, sub-problem.
Still low accuracy in general case MAP score of TREC ‘08 top performed system
References I. Ounis, C. Macdonald and I. Soboroff, Overview of the TREC 2008 Blog Track , TREC, 2008. Opinion Mining and Summarization: Sentiment Analysis. Tutorial given at WWW-2008, April
21, 2008 in Beijing, China. Qiaozhu Mei, Xu Ling, Matthew Wondra, Hang Su, ChengXiang Zhai. Topic Sentiment
Mixture: Modeling Facets and Opinions in Weblogs, Proceedings of the 16th International World Wide Web Conference (WWW' 07), pages 171-180, 2007.
Minqing Hu and Bing Liu. "Mining and summarizing customer reviews". To appear in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-2004, full paper), Seattle, Washington, USA, Aug 22-25, 2004.
Minqing Hu and Bing Liu. "Mining Opinion Features in Customer Reviews." To appear in Proceedings of Nineteeth National Conference on Artificial Intellgience (AAAI-2004), San Jose, USA, July 2004.
Yue Lu and ChengXiang Zhai. "Opinion Integration Through Semisupervised Topic Modeling", In Proceedings of the 17th International World Wide Web Conference (WWW'08)
Kavita Ganesan, Hyun Duk Kim, Opinion Mining: A Short Tutorial, 2008 Hyun Duk Kim, Dae Hoon Park, V.G.Vinod Vydiswaran, and ChengXiang Zhai,Opinion
Summarization Using Entity Features and Probabilistic Sentence Coherence Optimization: UIUC at TAC 2008 Opinion Summarization Pilot, Text Analysis Conference (TAC), Maryland, USA.
04/24/23Data Mining: Principles and
Algorithms 38
References Y. Lee, S.-H. Na, J. Kim, S.-H. Nam, H.-Y. Jung and J.-H. Lee , KLE at TREC 2008
Blog Track: Blog Post and Feed Retrieval , TREC, 2008. L. Jia, C. Yu and W. Zhang, UIC at TREC 208 Blog Track, TREC, 2008. Nitin Jindal and Bing Liu. "Identifying Comparative Sentences in Text
Documents" To appear in Proceedings of the 29th Annual International ACM SIGIR Conference on Research & Development on Information Retrieval (SIGIR-06), Seattle 2006.
Opinion Mining and Summarization (including review spam detection), tutorial given at WWW-2008, April 21, 2008 in Beijing, China.
Murthy Ganapathibhotla and Bing Liu, Mining opinions in comparative sentences, Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pages 241–248, Manchester, August 2008
04/24/23Data Mining: Principles and
Algorithms 39
Thank you
04/24/23Data Mining: Principles and
Algorithms 40
Mining User Query Logs for Personalized Search
Yuanhua Lv
(Some slides are taken from Xuehua Shen, Bin Tan, and ChengXiang Zhai’s presentation)
2. Long-term query logs: “car” occurs far more frequently than “Apple” in the user’s query logs of the recent 2 months. [Tan et al. 06]
43
Problem Definition
Q2
{C2,1 , C2,2 ,C2,3 ,… } C2
…
Q1 User Query{C1,1 , C1,2 ,C1,3 ,…} C1 User Clickthrough
? User Information Need
How to model and mine user query logs?Qk
e.g., Apple software
e.g., Apple - Mac OS X The Apple Mac OS X product page. Describes features in the current version of Mac OS X, a screenshot gallery, latest software downloads, and a directory of ...
( | ) ( | )k kp w p w Q 1 1 1 1,..., , ,...( | ) ,( | , )k kk kQ Qp Cw p w CQ
U
Mining query logs to update query model
'kQ
'( || )kQ DD
Query Logs
45
Mining Short-term User Query Logs [Shen et al. 05]
Qk
Q1
Qk-1
…
C1
Ck-1
…
Average user’s previous clickthrough
CH
QH
111
1
( | ) ( | )i k
Q iki
p w H p w Q
111
1
( | ) ( | )i k
C iki
p w H p w C
Average user’s previous queries
1 H
Combine previous clickthrough and previous queries
( | ) ( | ) (1 ) ( | )C Qp w H p w H p w H
k
1
Linearly interpolate current queryand history model
( | ) ( | ) (1 ) ( | )k kp w p w Q p w H
Four Heuristic Variants
• FixInt: fixed coefficient interpolation( | ) ( | ) (1 ) ( | )k kp w p w Q p w H
47
Mining Short-term User Query Logs [Shen et al. 05]
Qk
Q1
Qk-1
…
C1
Ck-1
…
Average user’s previous clickthrough
CH
QH
111
1
( | ) ( | )i k
Q iki
p w H p w Q
111
1
( | ) ( | )i k
C iki
p w H p w C
Average user’s previous queries
1 H
Combine previous clickthrough and previous queries
( | ) ( | ) (1 ) ( | )C Qp w H p w H p w H
k
1
Linearly interpolate current queryand history model
( | ) ( | ) (1 ) ( | )k kp w p w Q p w H
Fixed α?
Four Heuristic Variants
• FixInt: fixed coefficient interpolation• BayesInt: adapt the interpolation coefficient to
different query length – Intuition: if the current query Qk is longer, we
should trust Qk more
49
Mining Short-term User Query Logs [Shen et al. 05]
Qk
Q1
Qk-1
…
C1
Ck-1
…
Average user’s previous clickthrough
CH
QH
111
1
( | ) ( | )i k
Q iki
p w H p w Q
111
1
( | ) ( | )i k
C iki
p w H p w C
Average user’s previous queries
1 H
Combine previous clickthrough and previous queries
( | ) ( | ) (1 ) ( | )C Qp w H p w H p w H
k
1
Linearly interpolate current queryand history model
( | ) ( | ) (1 ) ( | )k kp w p w Q p w H
Fixed α?
Average?
Four Heuristic Variants• FixInt: fixed coefficient interpolation• BayesInt: adapt the interpolation coefficient to
different query length – Intuition: if the current query Qk is longer, we
should trust Qk more• OnlineUp: assign more weight to more recent
records.• BatchUp: the user becomes better and better at
query formulation as time goes on, but we do not need to “decay” the clickthrough.
51
Data Set of Evaluation
• Data collection: TREC AP88-90• Topics: 30 hard topics of TREC topics 1-150• System: search engine + RDBMS• Context: Query and clickthrough history of 3
Online Analytical Processing onMultidimensional Text Database
Motivation
Text Cube: Computing IR Measures for Multidimensional Text Database Analysis
Topic Cube: Topic Modeling for OLAP on Multidimensional Text Databases
Motivation• Industry and commercial applications often
collect huge amount of data containing both structured data records and unstructured text data in a multidimensional text database
• Incident reports• Job descriptions• Product reviews• Service feedback
• It is highly desirable and strategically important to support high-performance search and mining over such databases
04/24/23 62
Examples Aviation Safety Reporting System
How to organize the data to help experts efficiently explore and digest text information?
e.g. compare the reports in 1998 and reports in 1999? How to help experts analyze a specific type of anomaly
in different contexts? e.g. what did pilots say about anomaly “landing without
clearance” during daylight v.s. night?
Time Location Environment … Narrative
199801 TX Daylight … …… I TOLD HIM I WAS AT 2000 FT AND HE SAID OK……
199801 LA Daylight … ……WE STOPPED THE DSCNT AT CIRCLING MINIMUMS……
199801 LA Night … ……THE TAXI/LNDG LIGHTS VERY DIM. NO OTHER VISIBLE TFC IN SIGHT……
199902 FL Night … ……I FEEL WE SHOULD ALL EDUCATE OURSELVES ON CHKLISTS……
04/24/23 64
Online Analytical Processing onMultidimensional Text Database
Motivation
Text Cube: Computing IR Measures for Multidimensional Text Database AnalysisC. Lin, B. Ding, J. Han, F. Zhu, and B. Zhao (ICDE’08)
Topic Cube: Topic Modeling for OLAP on Multidimensional Text Databases
Text Cube Text Cube
A novel data cube model integrating the power of traditional data cube and IR techniques for effective text mining
Computing IR measures for multidimensional text database analysis
Heterogeneous records to be examined Structured categorical attributes Unstructured free text
IR statistics are evaluated TF-IDF Inverted Index
04/24/23 65
Text Cube - Implementation Preprocessing
stemming, stop words elimination, TF-IDF weighting Concept hierarchy construction
A dimension hierarchy takes the form of a tree or a DAG. An attribute at a lower level reveals more details
Four operations are supported: roll-up, drill-down, slice and dice
Term hierarchy construction A term hierarchy represents semantic levels of
terms in the text and their correlations Infusion with expert knowledge Two novel operations: Pull-up & Push-down
04/24/23 66
Text Cube - Implementation Partial materialization: if a non-materialized cell is
retrieved, we compute it on-the-fly based on the partially materialized cuboids
A balance between time and space: given a time threshold δ, we minimize storage size within the query time bound δ for retrieving all cells to be interested in
04/24/23 67
Experiment – Efficiency and Effectiveness
68
Compare avgTF under different“Environment: Weather Elements”
Compare avgTF under different“Supplementary: Problem Areas”
04/24/23 69
Online Analytical Processing onMultidimensional Text Database
Motivation
Text Cube: Computing IR Measures for Multidimensional Text Database Analysis
Topic Cube: Topic Modeling for OLAP on Multidimensional Text DatabasesD. Zhang, C. Zhai, and J. Han (SDM’09)
Motivation Aviation Safety Reporting System
How to organize the data to help experts efficiently explore and digest text information?
e.g. compare the reports in 1998 and reports in 1999? How to help experts analyze a specific type of anomaly
in different contexts? e.g. what did pilots say about anomaly “landing without
clearance” during daylight v.s. night?
Time Location Environment … Narrative
199801 TX Daylight … …… I TOLD HIM I WAS AT 2000 FT AND HE SAID OK……
199801 LA Daylight … ……WE STOPPED THE DSCNT AT CIRCLING MINIMUMS……
199801 LA Night … ……THE TAXI/LNDG LIGHTS VERY DIM. NO OTHER VISIBLE TFC IN SIGHT……
199902 FL Night … ……I FEEL WE SHOULD ALL EDUCATE OURSELVES ON CHKLISTS……
Solution: Topic Cube
Challenges: How to support operations along the topic dimension? How to quickly extract semantic topics?
98.0199.0299.01
98.02
LAX SJC MIA AUS
overshootundershootbirds
turbulence
Time
Location
Topic
CA FL TX
Location
19981999
Time
Deviation
Encounter
Topic
drill-down
roll-up
Constructing Topic Cube
Time Loc Env … Narrative
98.01 TX Daylight …
98.01 LA Daylight …
98.01 LA Night …
99.02 FL Night …
ALL
Anomaly Altitude Deviation
…… Anomaly Maintenance Problem
…… Anomaly Inflight Encounter
Undershoot
…… Overshoot
Improper Documentation
Improper Maintenance
Birds Turbulence
…… ……
Descent 0.06Cloud 0.03Ft 0.01… ….
Descent 0.05System 0.02View 0.01… ….
Altitude 0.03Ft 0.02Climb 0.01… ….
Altitude 0.04Ft 0.03Instruct 0.01… ….
drill-down
roll-up
Materialization
StandardDimension(Location)
Topic Dimension (Anomaly Event)
CLAX-overshoot CLAX-altitude CLAX-
all
CCA-overshoot CCA-altitude CCA-all
CUS-overshoot CUS-altitude CUS-all
Mtopic-agg
Mtopic-agg Mtopic-agg
Mtopic-
agg
Mtopic-
agg
Mstd-agg Mstd-agg Mstd-agg
Mstd-agg Mstd-agg Mstd-agg
Mtopic-agg
( 1) ( 1)
( 1) ( 1)
,' { , , }(0) ( )
, '' ' { , , }
( , ) ( ')
( | )( ', ) ( ')
L Ls ei i
L Ls ei i
d wdjL
c id w
w dj
c w d p z j
p wc w d p z j
,( )(0)
, ''
( , ) ( )( | )
( ', ) ( )i cin
a
i ci
d wc d DL
c jd w
w c d D
c w d p z jp w
c w d p z j
Mtopic-
agg:Mstd-
agg:
Experimental ResultsContex
t Word p(w|θ)
daylight
Tower 0.075Pattern 0.061Final 0.060
Runway 0.053Land 0.052
Downwind 0.039
night
Tower 0.035Runway 0.029
Light 0.027Instrument Landing System 0.015
Beacon 0.014
landing without clearance
ObjectiveFunction
Iterations
Time (sec.)
Closeness to the optimum point
…WINDS ALOFT AT PATTERN ALT OF 1000 FT MSL, WERE MUCH STRONGER AND A DIRECT XWIND. NEEDLESS TO SAY, THE PATTERNS AND LNDGS WERE DIFFICULT FOR MY STUDENT AND THERE WAS LIGHT TURB ON THE DOWNWIND…
…I LISTENED TO HWD ATIS AND FOUND THE TWR CLOSED AND AN ANNOUNCEMENT THAT THE HIGH INTENSITY LIGHTS FOR RWY 28L WERE INOP. BROADCASTING IN THE BLIND AND LOOKING FOR THE TWR BEACON AND LOW INTENSITY LIGHTS AGAINST A VERY BRIGHT BACKGROUND CLUTTER OF STREET LIGHTS, ETC…