Top Banner
Mining Named Entities with Temporally Correlated Bursts from Multilingual Web News Streams Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign
25

Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign.

Mar 29, 2015

Download

Documents

Alec Raef
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign.

Mining Named Entities with Temporally Correlated Bursts from Multilingual Web

News StreamsAlexander Kotov, ChengXiang Zhai, Richard

Sproat

University of Illinois at Urbana-Champaign

Page 2: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign.

Roadmap

Problem definitionPrevious workApproachExperimentsSummary

Page 3: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign.

MotivationWeb data is generated by a large number of

textual streams (news, blogs, tweets, etc.)Bursts of entity mentions (people, locations)

correspond to a particular eventBursts of entity mentions are influenced by

bursts of other entities

Intuition: bursts of semantically related entities should be temporally correlated

Page 4: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign.

Problem definition

time

13

25

31

46

9 8

3

96

21

21

15

14 1

0

13

12

6

11

10

457 8

54 3 2

𝑡 0 𝑡𝑇

2 13 2

11 7

24 3

5

1 2

63

time

𝑡 0 𝑡𝑇

sparsity

magnitude

time lag

entity 1

entity 2

=

?

Page 5: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign.

Temporally correlated bursts

Problem: given a collection of textual streams discover named entities with correlated bursts

Provide multilingual summaries of real life events

Estimate social impact of a particular event in different countries

Differentiate between local and global eventsDiscover transliterations of named entities

Page 6: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign.

Roadmap

Problem definitionPrevious workApproachExperimentsSummary

Page 7: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign.

Previous workBurst detection:

infinite-state automation (Kleinberg ’02)factorial HMMs (Krause ‘06)wavelet transformation (Zhu ’03)

Stream correlation: distance-based measures: Pearson coefficient

(Chien’05)singular spectrum transformation (Ide’05)topic based (PLSA, LDA) (Wang’09)

Page 8: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign.

Previous work

Smoothing is efficient for large amount of data, but not precise

Do not abstract away from the raw dataDistance based measures suffer from

magnitude and sparsity problemsTemporal lags are not considered

Page 9: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign.

Roadmap

Problem definitionPrevious workApproachExperimentsSummary

Page 10: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign.

Approach

Difference in magnitude: normalization with Markov Modulated Poisson Process

Temporal lag: flexible alignment of bursts using dynamic programming

Page 11: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign.

Markov-Modulated Poisson Process

• Ergodic Markov chain over finite number of states

• Each state is associated with Poisson distribution

• “Burstiness’’ of a state is represented by the intensity parameter of Poisson distribution

• States are labeled by the rank of the intensity parameter

Page 12: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign.

Normalization

time

25

31

46

9 8

3

96

21

21

15

14 1

0

13

12

6

11

10

457 8

54 3 2

1 1 1 1 1 1 2 2 2 2 2 1 1 1 3 3 3 3 3 3 2 1 1 1 13 3 3 31

2 13 2

13 1

1 7

24 3

5

1 2

63

time

21 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 1 2 2 1 1 2 1 1 12 21

mention counts

MMPP states

Page 13: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign.

Normalization

• MMPP consistently outperforms the baseline• The optimal performance is achieved when the

number of states is 3

Page 14: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign.

Burst AlignmentInput: -pair of normalized MC streams of length - threshold for ``bursty’’ states; - reward constant; - penalty function.Output: a table :

Page 15: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign.

Burst alignment

perfect alignement

exponential penalty

logarithmic penalty

Page 16: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign.

Burst alignment

• quadratic penalty function in combination with reward constant of 2 is optimal•maximum permitted temporal gap is 1 day

Page 17: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign.

Roadmap

Problem definitionPrevious workApproachExperimentsSummary

Page 18: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign.

Dataset

News data crawled from RSS feeds over 4 month

Basic named entity recognitionBasic stemming

Page 19: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign.

Correlated Bursts

Pattern 1: World Economic Forum in Davos, Switzerland and death of actor Heath Ledger;Pattern 2: death of Bobby FischerPattern 3: assassination of Benazir BhuttoPattern 4: French bank major trading loss incident and death of George Habash

Real life events:

Page 20: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign.

Mining transliterationsStatic aligned corpora:

+ identical or semantically related contents + temporal topical alignment - limited coverage

Web: + covers almost any domain - difference in burst magnitude - temporal lag between bursts

Page 21: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign.

Transliteration

•MMPP+DP outperforms one baseline (CS) in all entropy categories and the other baseline (PC) for low- and medium-entropy (more “bursty’’) entities;• Combination of MMPP+DP performs better than MMPP alone.

Page 22: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign.

Roadmap

Problem definitionPrevious workApproachExperimentsSummary

Page 23: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign.

Summary

Novel multi-stream text mining problemOur approach can effectively discover

correlated bursts corresponding to major and minor real life events

Effective for unsupervised discovery of transliterations

Method is data independent and not limited to textual domain

Page 24: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign.

Contributions

First method to use MMPP for burst detection in textual streams

Algorithm for temporally flexible stream correlation based on bursts

Unsupervised method for language-independent transliteration without any linguistic knowledge

Page 25: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign.

Future work

Applying proposed method to non-textual data (e.g., sensor streams)

Burst correlations between entities different types of Web 2.0 data (news and tweets, news and blogs, news and tags, etc.)