A probabilistic model for retrospective news event detection Zhiwei Li, Bin Wang, Mingjing Li and We i-Ying Ma. A probabilistic model for ret rospective news event detection. In the 28 th Annual International ACM SIGIR Conference ( SIGIR'2005 ), 2005. Presenter: Suhan Yu
25
Embed
A probabilistic model for retrospective news event detection Zhiwei Li, Bin Wang, Mingjing Li and Wei-Ying Ma. A probabilistic model for retrospective.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
A probabilistic model for retrospective news event detection
Zhiwei Li, Bin Wang, Mingjing Li and Wei-Ying Ma. A probabilistic model for retrospective news event detection. In the 28th Annual International ACM SIGIR Co
• We assume only the salient peaks are corresponding to events.– Initial estimate of events number can be set as the number of
peaks• Use hill-climbing approach to detect all peaks• Compute salient score for each of them• The top 20% peaks are defined as salient peaks.• Spitting/merging initial
peaks
to detect salient peaks,
we define salient scores
for peaks as:)()()( peakrightpeakleftpeakscore
Splitting/merging initial salient peaks
• MDL (Minimum Description Length)
))log(2
));((log(maxarg Mm
Xpk k
k
)1()1()1(13 nlpk NkNkNkkm
penalty
articlesallM
Np=person vocabulary size
Event summarization
• Maximum a Posterior (MAP)
– is the label of news article
))((maxarg ijj
i xepy
iy ix
Algorithm summary
Multi-modal RED algorithm application
• HISCOVERY system– HISCOVERY (HIStory disCOVERY)– Two useful function
• Photo Story• Chronicle
– News article come from 12 news sites (such as CNN, MSNBC, BBC…)
HISCOVERY system
Experimental methods
• Data– TDT
• Benchmarks for event detection. – TDT4
• Run experiments• Contain 80 events annotated from 28500 news articles.• These articles collected from the period of 2000/10~2001/1
• Each year’s reports can be
regarded as an events.• Extracting named entities.
– Extracted by BBN NLP tool,
which can extract seven
types of named entities.
Experimental design
• To compare the approach with other algorithm:– Group Average Clustering (GAC)
• It is the best algorithm in TDT evaluations.• A hierarchical clustering method
• Baseline– kNN algorithm
Results
• Probabilistic model
gains the best
results, but the
improvement are
not significant.
Results
• Named entities
result
result
result
39 events
result
46 events
Conclusion
• Study 2 characteristics of news articles and events.• Proposed a multi-modal RED algorithm
• Future work:– Use fitful dynamic models to model news events.