Top Banner
COLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj (1000663882)
28

C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj (1000663882)

Dec 27, 2015

Download

Documents

Jayson Wilkins
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj (1000663882)

COLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT

- Presented by Avinash S Bharadwaj (1000663882)

Page 2: C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj (1000663882)

ABSTRACT

The aim of the paper Annotation of open domain unstructured web

text with uniquely identified entities in a social media like Wikipedia.

Use of annotations for search and mining tasks

Page 3: C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj (1000663882)

WHAT IS ENTITY DISAMBIGUATION?

An entity is something that is real and has a distinct existence.

Wikipedia articles can be considered as entities.

Entity disambiguation is the art of resolving correspondence between mentions of entities in natural language and real world entities.

In this paper the disambiguation is carried out between annotations in web pages along and Wikipedia articles.

Page 4: C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj (1000663882)

ENTITY DISAMBIGUATION EXAMPLE

Page 5: C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj (1000663882)

PREVIOUS WORK IN DISAMBIGUATION SemTag:

First webscale disambiguation system. Annotated about 250 million web pages with IDs from the Stanford

TAP. SemTag preferred high precision over recall, with an average of two

annotations per page Wikify!

Wikify performed both keyword extraction and disambiguation. Wikify could not achieve collective disambiguation across spots

Milne and Witten (M&W): It’s a form of collective disambiguation which results better than

Wikify. M&W achieves a F1 measure of 0.53, unlike Wikify which has a F1

measure of 0.83 Cucerzan’s algorithm:

Each entity is represented as a high dimensional feature vector. Cucerzan annotates sparingly about 4.5% of all possible tokens are

annotated.

Page 6: C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj (1000663882)

TERMINOLOGIES

Spots Occurrence of text on a page that can be

possibly linked to a Wikipedia article Attachment

Possible entities in Wikipedia to which a spot can be linked

Annotation Process of making an attachment to spots on a

page Gama list

List of all possible annotations

Page 7: C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj (1000663882)

TERMINOLOGIES ILLUSTRATED

Spots Attachment Gama list

Page 8: C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj (1000663882)

COLLECTIVE ENTITY DISAMBIGUATION

Sometimes disambiguation can not be carried out by using single spots in a page.

Multiple spots in a page are required to disambiguate an entity

All spots in an article are considered to be related

Page 9: C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj (1000663882)

COLLECTIVE ENTITY DISAMBIGUATION EXAMPLE

Page 10: C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj (1000663882)

CALCULATING RELATEDNESS BETWEEN WIKIPEDIA ENTITIES

Relatedness between two entities is defined as r(γ, γ’)= g(γ) · g(γ’).

Cucerzan’s proposal defined relatedness between entity based on cosine measure

Milne et al. proposal: c = number of Wikipedia pages; g(γ)[p] = 1 if page p links to page γ, 0 otherwise.

Page 11: C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj (1000663882)

CONTRIBUTIONS OF THIS PAPER

The paper proposes posing entity disambiguation as an optimization problem.

The paper provides a single optimization objective. Using integer linear programs Using heuristics for approximate solutions

Paper also describes about rich node features with systematic learning

Paper also describes about back off strategy for controlled annotations

Page 12: C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj (1000663882)

MODELING COMPATIBILITY BETWEEN WIKIPEDIA ARTICLES

Entities modeled using a feature vector defined as fs(γ). The feature vector expresses local textual compatibility

between (context of) spot s and candidate label γ. Components of the feature vector

Spot side Context of the spot

Wikipedia side Snippet Full text Anchor text Anchor text with context

Similarity Measures Dot product Cosine Similarity Jaccard Similarity

Page 13: C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj (1000663882)

METHODS FOR EVALUATING THE MODEL

Authors use two ways for evaluating the model, Node score and Clique Score

Node Score Defined by the function W is a training set obtained from linear

adaptation of rank SVM Clique score

Uses the related measure of Milne and Witten. Total objective

Page 14: C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj (1000663882)

BACK-OFF METHOD

Not all spots in a web page may be tagged. Uses a special tag “NA” for articles that can’t

be tagged Spots in the webpage marked “NA” will not

contribute to the clique potential. A factor called “RNA” defines the

aggressiveness of the tagging algorithm.

Page 15: C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj (1000663882)

IMPLEMENTATION

Integer linear program (ILP) based formulation Casting as 0/1 integer linear program Relaxing it to an LP

Simpler heuristics Hill climbing for optimization

Page 16: C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj (1000663882)

EVALUATING THE ALGORITHM

Evaluation measures used Precision

Number of spots tagged correctly out of total number of spots tagged

Recall Number of spots tagged correctly out of total number

of spots in ground truth F1

F1 is described using the following formula

Page 17: C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj (1000663882)

DATASETS USED FOR EVALUATION

The authors use WebPages crawled and stored in the IITB database.

Publicly available data from Cucerzan’s experiments (CZ)

Page 18: C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj (1000663882)

EXPERIMENTAL RESULTS

Page 19: C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj (1000663882)

NAMED ENTITY DISAMBIGUATION IN WIKIPEDIA

Named ambiguity problem has resulted in a demand for efficient high quality disambiguation methods

Not a trivial task, the application should be capable of deciding whether the group of name occurrences belong to the same entity

Traditional methods of named entity disambiguation uses the Bag Of Words (BOW) method

Page 20: C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj (1000663882)

WIKIPEDIA AS A SEMANTIC NETWORK

Wikipedia is an open database covering most of the useful topics in the world.

The title of Wikipedia article describes the content within the article.

The title may sometimes be noisy. These are filtered using rules from Hu, et al.

Page 21: C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj (1000663882)

SEMANTIC RELATIONS BETWEEN WIKIPEDIA CONCEPTS

Wikipedia contains rich relation structures within the page

The relatedness is represented by links between the Wikipedia pages.

Page 22: C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj (1000663882)

WORKING OF NAMED ENTITY DISAMBIGUATION USING WIKIPEDIA

Uses vectors as to represent a Wikipedia entity.

Similarity between each vector is measured for named entity disambiguation.

Page 23: C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj (1000663882)

MEASURING SIMILARITY BETWEEN TWO WIKIPEDIA ENTITIES

The similarity measure takes into account the full semantic relations indicated by hyperlinks in Wikipedia.

The algorithm works in three steps. Described as follows

Page 24: C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj (1000663882)

STEP 1

In order to measure the similarity between two vector representations, the correspondence between the concepts of one vector to another have to be defined

Semantic relations between articles is used to match the articles.

Page 25: C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj (1000663882)

STEP 2 Compute the semantic relatedness from one

concept vector representation to another

Using the alignments shown in previous step SR(MJ1→MJ2) is computed as (0.42×0.47×0.54 + 0.54×0.51×0.66 + 0.51×0.51×0.65)/(0.42×0.47 + 0.54×0.51 + 0.51×0.51)=0.62, and

SR(MJ2→MJ1) is computed as (0.47×0.42×0.54 + 0.52×0.54×0.58 + 0.52 × 0.51 × 0.60 + 0.51 × 0.54 × 0.66 )/(0.47×0.42 + 0.52×0.54 + 0.52 × 0.51 + 0.51 × 0.54)=0.60.

Page 26: C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj (1000663882)

STEP 3

Compute the similarity between two concept vector representations.

Similarity SIM(MJ1, MJ2) is computed as (0.60 + 0.62)/2 = 0.61, SIM(MJ2, MJ3) is computed as 0.10 and SIM(MJ1, MJ3) is computed as 0.0.

Page 27: C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj (1000663882)

RESULTS

Page 28: C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj (1000663882)

QUESTIONS ????