Top Banner
Multi-step Classification Approaches to Cumulative Citation Recommendation Krisztian Balog University of Stavanger Open research Areas in Information Retrieval (OAIR) 2013 | Lisbon, Portugal, May 2013 Naimdjon Takhirov, Heri Ramampiaro, Kjetil Nørvåg Norwegian University of Science and Technology
24

Multi-step Classification Approaches to Cumulative Citation Recommendation

May 17, 2015

Download

Technology

krisztianbalog

Presented at the Open research Areas in Information Retrieval (OAIR 2013) conference
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Multi-step Classification Approaches to Cumulative Citation Recommendation

Multi-step Classification Approaches to Cumulative Citation RecommendationKrisztian Balog University of Stavanger

Open research Areas in Information Retrieval (OAIR) 2013 | Lisbon, Portugal, May 2013

Naimdjon Takhirov, Heri Ramampiaro, Kjetil NørvågNorwegian University of Science and Technology

Page 2: Multi-step Classification Approaches to Cumulative Citation Recommendation

Motivation- Maintain the accuracy and high quality of

knowledge bases- Develop automated methods to discover (and

process) new information as it becomes available

Page 3: Multi-step Classification Approaches to Cumulative Citation Recommendation

TREC 2012 KBA track

Page 4: Multi-step Classification Approaches to Cumulative Citation Recommendation

Central?

Page 5: Multi-step Classification Approaches to Cumulative Citation Recommendation

TaskCumulative Citation Recommendation

- Filter a time-ordered corpus for documents that are highly relevant to a predefined set of entities- For each entity, provide a ranked list of documents

based on their “citation-worthiness”

Page 6: Multi-step Classification Approaches to Cumulative Citation Recommendation

Collection and topics- KBA stream corpus

- Oct 2011 - Apr 2012- Split into training and testing periods - Three sources: news, social, linking- raw data 8.7TB- cleansed version 1.2TB (270G compressed)- stream documents uniquely identified by stream_id

- Test topics (“target entities”)- 29 entities from Wikipedia (27 persons, 2 org)- uniquely identified by urlname

Page 7: Multi-step Classification Approaches to Cumulative Citation Recommendation

Annotation matrixye

s

cont

ains

men

tion

no

garbage neutral relevant centralnon-relevant

G N R C

relevant

Page 8: Multi-step Classification Approaches to Cumulative Citation Recommendation

Scoring

1328055120'f6462409e60d2748a0adef82fe68b86d1328057880'79cdee3c9218ec77f6580183cb16e0451328057280'80fb850c089caa381a796c34e23d9af81328056560'450983d117c5a7903a3a27c959cc682a1328056560'450983d117c5a7903a3a27c959cc682a1328056260'684e2f8fc90de6ef949946f5061a91e01328056560'be417475cca57b6557a7d5db0bbc69591328057520'4e92eb721bfbfdfa0b1d9476b1ecb0091328058660'807e4aaeca58000f6889c31c247122471328060040'7a8c209ad36bbb9c946348996f8c616b1328063280'1ac4b6f3a58004d1596d6e42c4746e211328064660'1a0167925256b32d715c1a3a2ee0730c1328062980'7324a71469556bcd1f3904ba090ab685

Pos

itive

Neg

ative

Aharon_BarakAharon_BarakAharon_BarakAharon_BarakAharon_BarakAharon_BarakAharon_BarakAharon_BarakAharon_BarakAharon_BarakAharon_BarakAharon_BarakAharon_Barak

score

Target entity: Aharon Barakurlname stream_id

Cutoff

1000500500480450430428428380380375315263

1328055120'f6462409e60d2748a0adef82fe68b86d1328057880'79cdee3c9218ec77f6580183cb16e0451328057280'80fb850c089caa381a796c34e23d9af81328056560'450983d117c5a7903a3a27c959cc682a1328056560'450983d117c5a7903a3a27c959cc682a1328056260'684e2f8fc90de6ef949946f5061a91e01328056560'be417475cca57b6557a7d5db0bbc69591328057520'4e92eb721bfbfdfa0b1d9476b1ecb0091328058660'807e4aaeca58000f6889c31c247122471328060040'7a8c209ad36bbb9c946348996f8c616b1328063280'1ac4b6f3a58004d1596d6e42c4746e211328064660'1a0167925256b32d715c1a3a2ee0730c1328062980'7324a71469556bcd1f3904ba090ab685

Pos

itive

Neg

ative

Aharon_BarakAharon_BarakAharon_BarakAharon_BarakAharon_BarakAharon_BarakAharon_BarakAharon_BarakAharon_BarakAharon_BarakAharon_BarakAharon_BarakAharon_Barak

Page 9: Multi-step Classification Approaches to Cumulative Citation Recommendation

Approach

Page 10: Multi-step Classification Approaches to Cumulative Citation Recommendation

Overview- “Is this document central for this entity?”- Binary classification task- Multi-step approach

- Classifying every document-entity pair is not feasible- First step do decide whether the document contains

the entity- Subsequent step(s) to decide centrality

Page 11: Multi-step Classification Approaches to Cumulative Citation Recommendation

2-step classification

Mention? Central?Y

score0 1000

N

Y

Page 12: Multi-step Classification Approaches to Cumulative Citation Recommendation

3-step classification

Mention? Relevant? Central?Y Y

score0 1000

N N Y

Page 13: Multi-step Classification Approaches to Cumulative Citation Recommendation

Components

Mention? Central?Y

score0 1000

N

Y

Mention? Relevant? Central?Y Y

score0 1000

N N Y

mention detection supervised learning

Page 14: Multi-step Classification Approaches to Cumulative Citation Recommendation

Identifying entity mentions- Goals

- High recall- Keep false positive rate low- Efficiency

- Detection based on known surface forms of the entity- urlname (i.e., Wikipedia title)- name variants from DBpedia- DBpedia-loose: only last names for people

- No disambiguation

Page 15: Multi-step Classification Approaches to Cumulative Citation Recommendation

Features1.Document (5)

- Length of document fields (body, title, anchors)- Type (news/social/linking)

2.Entity (1)- Number of related entities in KB

3.Document-entity (28)- Occurrences of entity in document - Number of related entity mentions- Similarity between doc and the entity’s WP page

Page 16: Multi-step Classification Approaches to Cumulative Citation Recommendation

Features4.Temporal (38)

- Wikipedia pageviews- Average pageviews- Change in pageviews- Bursts

- Mentions in document stream- Average volume- Change in volume- Bursts

Page 17: Multi-step Classification Approaches to Cumulative Citation Recommendation

Results

Page 18: Multi-step Classification Approaches to Cumulative Citation Recommendation

Identifying entity mentionsResults on testing period

Identification Document-entity pairs Recall False

positive rate

urlname

DBpedia

DBpedia-loose

41.2K 0.842 0.559

70.4K 0.974 0.701

12.5M 0.994 0.998

Page 19: Multi-step Classification Approaches to Cumulative Citation Recommendation

End-to-end taskF1 score, single cutoff

0

0.15

0.3

0.45

J48 RF

2-step 3-step

TREC

Central

0

0.25

0.5

0.75

J48 RF TREC

Central+Relevant

Page 20: Multi-step Classification Approaches to Cumulative Citation Recommendation

End-to-end taskF1 score, per-entity cutoff

0

0.15

0.3

0.45

J48 RF

2-step 3-step

TREC

Central

0

0.25

0.5

0.75

J48 RF TREC

Central+Relevant

Page 21: Multi-step Classification Approaches to Cumulative Citation Recommendation

Summary- Cumulative Citation Recommendation task

@TREC 2012 KBA- Two multi-step classification approaches- Four groups of features- Differentiating between relevant and central is

difficult

Page 22: Multi-step Classification Approaches to Cumulative Citation Recommendation

Classification vs. Ranking[Balog & Ramampiaro, SIGIR’13]

- Approach CCR as a ranking task- Learning-to-rank

- Pointwise, pairwise, listwise

- Pointwise LTR outperforms classification approaches using the same set of features

http://krisztianbalog.com/files/sigir2013-kba.pdf

Page 23: Multi-step Classification Approaches to Cumulative Citation Recommendation

AABKKnowledge Base Acceleration Assessment & Analysis

http://research.idi.ntnu.no/wislab/kbaaa

Page 24: Multi-step Classification Approaches to Cumulative Citation Recommendation

Questions?

Contact | @krisztianbalog | krisztianbalog.com