1/20 Introduction Method Experiments Conclusions Latent Topic-semantic Indexing based Automatic Text Summarization Jiangsheng Yu, Xue-wen Chen Presenter: Elaheh Barati Futurewei Technologies - Wayne State University December 18, 2016 Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University Latent Topic-semantic Indexing based Automatic Text Summarization
45
Embed
Latent Topic-semantic Indexing based Automatic Text Summarization
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1/20
Introduction Method Experiments Conclusions
Latent Topic-semantic Indexing basedAutomatic Text Summarization
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
4/20
Introduction Method Experiments Conclusions
Automatic summarization
An Introduction to Automatic summarization (AS)
� Automatic summarization (AS), or text summarization, isa challenging task of natural language processing (NLP)and machine learning.
� It transforms source text to summary text, while retainingthe most important information in the source.
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
4/20
Introduction Method Experiments Conclusions
Automatic summarization
An Introduction to Automatic summarization (AS)
� Many extraction methods have been proposed inliterature, and some of them are implemented as opensource tools, or online services.
� In the last decade, the topic-driven approaches becamepopular, and some work based on pLSI and LDA hasachieved significantly better performance.
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
5/20
Introduction Method Experiments Conclusions
Latent Dirichlet Allocation
An Introduction to Latent Dirichlet Allocation
The plate notation of LDA, a three-level HB model, in whichθ1:M � Dir(α),ϕ1:K � Dir(β), zm,1:Nm � xθmy and wmn � xϕzmny.
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
5/20
Introduction Method Experiments Conclusions
Latent Dirichlet Allocation
An Introduction to Latent Dirichlet Allocation
For the n-th word in the m-th document, denoted by wmn, wherem = 1, � � � , M, n = 1, � � � , Nm, its topic zmn is a latent variablevarying in the set of t1, � � � , Ku, satisfying wmn � xϕzmny.
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
5/20
Introduction Method Experiments Conclusions
Latent Dirichlet Allocation
An Introduction to Latent Dirichlet Allocation
The discrete distribution of words: wmn� Multin(1; ϕzmn ).
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
5/20
Introduction Method Experiments Conclusions
Latent Dirichlet Allocation
An Introduction to Latent Dirichlet Allocation
The Nm latent topics in the m-th document: zm,1:Nm� xθmy,where θm, the (vector) parameter of multinomial distribution oftopics for the m-th document, is also Dirichlet-distributed inthe way of θ1:M � Dir(α).
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
5/20
Introduction Method Experiments Conclusions
Latent Dirichlet Allocation
An Introduction to Latent Dirichlet Allocation
LDA models adopt the conjugate prior of multinomialdistribution, to describe the priors of parameters ofmultinomial distributions.
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
6/20
Introduction Method Experiments Conclusions
Latent Dirichlet Allocation
Limitations on LDA
While topic models have been successfully applied toautomatic summarization, they are limited in several aspects.
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
6/20
Introduction Method Experiments Conclusions
Latent Dirichlet Allocation
Limitations on LDA
While topic models have been successfully applied toautomatic summarization, they are limited in several aspects.
� in the LDA-type models, the observation units arerestricted to words.
� a topic is usually defined by a discrete distribution overmany polysemous words.
� ...
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
6/20
Introduction Method Experiments Conclusions
Latent Dirichlet Allocation
Limitations on LDA
While topic models have been successfully applied toautomatic summarization, they are limited in several aspects.
� in the LDA-type models, the observation units arerestricted to words.
� a topic is usually defined by a discrete distribution overmany polysemous words.
� ...
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
6/20
Introduction Method Experiments Conclusions
Latent Dirichlet Allocation
Limitations on LDA
While topic models have been successfully applied toautomatic summarization, they are limited in several aspects.
� in the LDA-type models, the observation units arerestricted to words.
� a topic is usually defined by a discrete distribution overmany polysemous words.
� ...
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
6/20
Introduction Method Experiments Conclusions
Latent Dirichlet Allocation
Limitations on LDA
These limitations make the learned topics lack of practicalsignificance in many cases, and prevent the topic models fromfurther applications.
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
T is a discrete distribution overall words in the vocabulary, l = 1, � � � , L.
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
10/20
Introduction Method Experiments Conclusions
Latent Topic-Semantic Indexing (cont.)
(a) ψ1:L are given
Assumption of TSI model: in each window (m, n), thesemantics of words w(1)
mn � � �w(Dmn)mn are drawn from a same but
unknown topic zmn
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
10/20
Introduction Method Experiments Conclusions
Latent Topic-Semantic Indexing (cont.)
(a) ψ1:L are given
For a TSI model, the m-th document is generated in:(1) Choose θm � Dir(α), where θ1:M � Dir(α).
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
10/20
Introduction Method Experiments Conclusions
Latent Topic-Semantic Indexing (cont.)
(a) ψ1:L are given
For a TSI model, the m-th document is generated in:(2) zmn, the topic of window (m, n), is drawn from zmn � xθmy.
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
10/20
Introduction Method Experiments Conclusions
Latent Topic-Semantic Indexing (cont.)
(a) ψ1:L are given
For a TSI model, the m-th document is generated in:(3) The semantics in window (m, n) are generated via
s(1)mn , � � � , s(dmn)
mn , � � � , s(Dmn)mn � xϕzmny
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
10/20
Introduction Method Experiments Conclusions
Latent Topic-Semantic Indexing (cont.)
(a) ψ1:L are given
For a TSI model, the m-th document is generated in:(4) The word w(d)
mn is drawn from semantic category s(d)mn
independently
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
11/20
Introduction Method Experiments Conclusions
TSI vs LDA
LDA is a special case of TSI when:� the observation window is a word,
� the semantic labels are the words themselves, and� the semantic matrix is an identity matrix.
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
11/20
Introduction Method Experiments Conclusions
TSI vs LDA
LDA is a special case of TSI when:� the observation window is a word,� the semantic labels are the words themselves, and
� the semantic matrix is an identity matrix.
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
11/20
Introduction Method Experiments Conclusions
TSI vs LDA
LDA is a special case of TSI when:� the observation window is a word,� the semantic labels are the words themselves, and� the semantic matrix is an identity matrix.
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
13/20
Introduction Method Experiments Conclusions
Experiments Setup
� Topic-based summarizations are tested on Brown corpusin the public dataset of SemCor-3.0, which contains 186documents classified in 15 categories.
� The semantic indexing is restricted to nouns and nounphrases. For this, all the fourth-level noun SynSets in thehypernymy tree of WordNet-3.0 .
� A total of L = 2017 semantic categories are used in the TSImodel.
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
13/20
Introduction Method Experiments Conclusions
Experiments Setup
� The prior semantic matrix is set by ψlv = nlv/nl, where nl isthe number of total SynSets in semantic category l, and nlvis the number of SynSets of word v in category l.
� We set α = (0.1, � � � , 0.1)T,β = (0.1, � � � , 0.1)T, which arecommonly set as default values in many applications.
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
14/20
Introduction Method Experiments Conclusions
Evaluation of Summarizers
� Suppose there are T summarizers under test, M documentsto review and a number of reviewers.
� Which summerizer has the best performance?� We use one-way analysis of variance (ANOVA).
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
14/20
Introduction Method Experiments Conclusions
Evaluation of Summarizers� AM�T: the index matrix, where M, T are the numbers of
documents and summarizers,� the m-th row of AM�T (i.e., (am1, � � � , amt, � � � , amT))
indicates the ordering of T summary results ofdocument m
� The results are scored by 1, 2, � � � , T, from the worst tothe best.
� BM�T: one human review matrix, in which bmt is the scoreof summarizer [amt].
� CM�T: the feedback matrix, in which cmt is the score ofsummarizer [t] on document m, is recovered by
cm,amt = bmt, where m = 1, � � � , M, t = 1, � � � , T
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
15/20
Introduction Method Experiments Conclusions
Evaluation by One-way ANOVA
1: Input: data AM�T, BM�T, and significance level α.2: Output: Ranks of summarizers.3: The comparison matrix HT�T is initialized as zero.4: We get the feedback matrix C and the mean scores of T sum-
marizers, then initialize s1 ¨ � � � ¨ sT.5: for all possible pairs (i, j) satisfying i j do6: if H(i,j)
0 is rejected at a given level α then7: si sj, where means “is worse than".8: Let hit = 1 for all t ¥ j.9: end if
10: end for11: The summarizer st is ranked by the sum of t-th column of H,
where t = 1, � � � , T.Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
16/20
Introduction Method Experiments Conclusions
Mean scores of four summarizers on the testing Browncorpus.
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
19/20
Introduction Method Experiments Conclusions
Conclusions
� We proposed a novel deep probabilistic approach to:� indexing the latent topics and semantics of words in a
collection of documents� apply the topic-semantic indexing (TSI) model to
automatic summarization.
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
19/20
Introduction Method Experiments Conclusions
Conclusions
� The topic-based summarizers, together with two othernon-topic-driven summarizers, FS and OTS, are tested onBrown corpus in the public dataset of SemCor-3.0.
� The summaries are reviewed by human.� The performance of summarization is analyzed by a
well-designed blind experiment� the summarizer is evaluated by ranks derived from
some hypothesis testings of one-way ANOVA.� The experimental results show that TSI is a promising
method for topic-driven summarization.� In the present TSI-based summarization, each observation
window is a word.
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
19/20
Introduction Method Experiments Conclusions
Conclusions
� The further work includes more experiments on severaldistinct sizes of observation windows, efficient extractionstrategies and their ensemble learning, etc.
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization