Top Banner
Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans. on Knowledge and Data Engineering 2008 Presenter : 陳陳陳 Citation : 21 (Google Scholar) 1
99

Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

Dec 24, 2015

Download

Documents

Phebe McCarthy
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

1

Web People Search via Connection Analysis

Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans. on Knowledge and Data Engineering 2008Presenter : 陳仲詠Citation : 21 (Google Scholar)

Page 2: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

2

Outline

• 1. Introduction• 2. Overview of the approach• 3. Generating a graph representation• 4. Disambiguation algorithm• 5. Interpreting clustering results• 6. Related work• 7. Experimental Results• 8. Conclusions and Future work

Page 3: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

3

Introduction (1/7)

• Searching for web pages related to a person accounts for more than 5 percent of the current Web searches [24].

• A search for a person such as say “Andrew McCallum” will return pages relevant to any person with the name Andrew McCallum.

[24] R. Guha and A. Garg, Disambiguating People in Search. Stanford Univ., 2004.

Page 4: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

4

Introduction (2/7)

• Assume (for now) that for each such web page, the search-engine could determine which real entity (i.e., which Andrew McCallum) the page refers to.

• Provide a capability of clustered person search, the returned results are clustered by associating each cluster to a real person.

Page 5: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

5

Introduction (3/7)

• The user can hone in on the cluster of interest to her and get all pages in that cluster.

• For example, only the pages associated with that Andrew McCallum.

Page 6: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

6

Introduction (4/7)

• In reality, it is not obvious that it indeed is a better option compared to searching for people using keyword-based search.

• If clusters identified by the search engine corresponded to a single person, then the clustered-based approach would be a good choice.

Page 7: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

7

Introduction (5/7)

• The key issue is the quality of clustering algorithms in disambiguating different web pages of the namesake.

Page 8: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

8

Introduction (6/7)

• 1. Develop a novel algorithm for disambiguating among people that have the same name.

• 2. Design a cluster-based people search approach based on the disambiguation algorithm.

Page 9: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

9

Introduction (7/7)

• The main contributions of this paper are the following :

• A new approach for Web People Search that shows high-quality clustering.

• A thorough empirical evaluation of the proposed solution (Section 7), and

• A new study of the impact on search of the proposed approach (Section 7.3).

Page 10: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

10

Overview of the approach (1/4)

• The processing of a user query consists of the following steps:

• 1. User input : A user submits a query.• 2. Web page retrieval : Retrieves a fixed

number (top K) of relevant web pages.

Page 11: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

11

Overview of the approach (2/4)

• 3. Preprocessing : – TF/IDF. noun phrase identification.– Extraction. Named entities (NEs) and Web-related

information.• 4. Graph creation : The entity-relationship

(ER) graph is generated based on data extracted.

Page 12: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

12

Overview of the approach (3/4)

• 5. Clustering : The result is a set of clusters of these pages with the aim being to cluster web pages based on association to real person.

Page 13: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

13

Overview of the approach (4/4)

• 6. Cluster processing : – Sketches : A set of keywords that represent the

web pages within a cluster.– Cluster ranking.– Web page ranking.

• 7. Visualization of results

Page 14: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

14

Generating a graph representation (1/6)

• Extracted : • 1)the entities• 2)relationships• 3)hyperlinks• 4)e-mail addresses • from the web pages.

Page 15: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

15

Generating a graph representation (2/6)

• For example, a person “John Smith” might be extracted from two different web pages.

Doc1

Doc2

John Smith

Regardless whether the two pages refer to the same person or to two different people.

Page 16: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

16

Generating a graph representation (3/6)

Page 17: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

17

Generating a graph representation (3/6)

Page 18: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

18

Generating a graph representation (3/6)

Page 19: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

19

Generating a graph representation (3/6)

Page 20: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

20

Generating a graph representation (4/6)

• The relationship edges are typed.• Any hyperlinks and e-mail addresses

extracted from the web page are handled in an analogous fashion.

Page 21: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

21

Generating a graph representation (5/6)

• A hyperlink has the form :

• For example, for the URL : www.cs.umass.edu/~ mccallum/

have d3 = cs, d2 = umass, d1 = edu

p1 = ~mccallum.

Page 22: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

22

Generating a graph representation (6/6)

Page 23: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

23

Disambiguation algorithm

• 1. Input the entity relationship graph.• 2. Uses a Correlation Clustering (CC) algorithm

to cluster the pages.• 3. The outcome is a set of clusters with each

cluster corresponding to a person.

Page 24: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

24

Disambiguation algorithmCorrelation Clustering (1/3)

• CC has been applied in the past to group documents of the same topic and to other problems.

• It assumes that there is a similarity function s(u, v) learned on the past data.

• Each (u, v) edge is assigned a “+” (similar) or “-” (different) label, according to the similarity function s(u, v).

Page 25: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

25

Disambiguation algorithmCorrelation Clustering (2/3)

• The goal is to find the partition of the graph into clusters that agrees the most with the assigned labels.

• The CC does not take k (the number of the resulting clusters) as its input parameter.

Page 26: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

26

Disambiguation algorithmCorrelation Clustering (3/3)

• The goal of CC is formulated formally :– maximize the agreement– minimize the disagreement.

• The problem of CC is known to be NP-hard.

Page 27: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

27

Disambiguation algorithmConnection Strength (1/6)

• Use the notion of the Connection Strength c(u, v) between two objects u and v to define the similarity function s(u, v).

• The disambiguation algorithm is based on analyzing : – object features and – the ER graph for the data set.

Page 28: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

28

Disambiguation algorithmConnection Strength (2/6)

• A path between u and v semantically captures interactions between them via intermediate entities.

• If the combined attraction of all these paths is sufficiently large, the objects are likely to be the same.

Page 29: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

29

Disambiguation algorithmConnection Strength (3/6)

• Analyzing paths : • The assumption is that each path between

two objects carries in itself a certain degree of attraction.

Page 30: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

30

Disambiguation algorithmConnection Strength (4/6)

• The attraction between two nodes u and v via paths is measured using the connection strength measure c(u, v).

• Defined as the sum of attractions contributed by each path:

Page 31: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

31

Disambiguation algorithmConnection Strength (5/6)

• Puv denotes the set of all L-short simple paths between u and v. – A path is L-short if its length does not exceed L and

is simple if it does not contain duplicate nodes.

• wp denotes the weight contributed by path p.– The weight path p contributes is derived from the

type of that path.

Page 32: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

32

Disambiguation algorithmConnection Strength (6/6)

• Let Puv consist of c1 paths of type 1, c2 paths of type 2, . . . ; cn paths of type n.

Page 33: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

33

Disambiguation algorithmSimilarity Function (1/4)

• The goal is to design a powerful similarity function s(u, v) that would minimize mislabeling of the data.

• Design a flexible function s(u, v), such that it will be able to automatically self-tune itself to the particular domain being processed.

Page 34: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

34

Disambiguation algorithmSimilarity Function (2/4)• The similarity function s(u, v) labels data by

comparing the s(u, v) value against the threshold γ.• Use the δ - band (“clear margin”) approach, label

the edge (u, v).

• To avoid committing to + or - decision, when it does not have enough evidence for that.

Page 35: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

35

Disambiguation algorithmSimilarity Function (3/4)

• Employs the standard TF/IDF scheme to compute its feature-based similarity f(u, v).– Noun phrases– Larger terms

• The entire document corpus consists of K documents

• N distinct terms T = {t1, t2, . . . ,tN}.

Page 36: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

36

Disambiguation algorithmSimilarity Function (4/4)

• Each document u :

• wui is the weight

Page 37: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

37

Disambiguation algorithmTraining the Similarity Function (1/2)

• For each (u, v) edge, require that :

• In practice, s(u, v) is unlikely to be perfect and that would manifest itself in cases where the inequalities in (5) will be violated for some of the (u, v) edges

• It can be resolved in a similar manner by adding slack to each inequality in (5).

Page 38: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

38

Disambiguation algorithmTraining the Similarity Function (2/2)

• The task becomes to solve the linear programming problem (6) to determine the optimal values for path type weights w1, w2,…,wn and threshold γ.

Page 39: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

39

Disambiguation algorithmChoosing Negative Weight (1/7)

• A CC algorithm will assign an entity u to a cluster if the number of positive edges between u and the other entities in the cluster outnumbers that of the negative edges.

• The number of positive edges is more than half (i.e., 50 percent).

Page 40: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

40

Disambiguation algorithmChoosing Negative Weight (2/7)

• To keep an entity in a cluster, it is sufficient to have only 25 percent of positive edges.

• Using the w+=+1 weight for all positive edges and w-=-1/3 weight for all negative edges will achieve the desired effect.

Page 41: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

41

Disambiguation algorithmChoosing Negative Weight (3/7)

• One solution for choosing a good value for the weight of negative edges w is to learn it on past data.

• The number of namesakes n in the top k web pages.– If n = 1, w- = 0– All the pair connected via positive edges will be

merged.

Page 42: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

42

Disambiguation algorithmChoosing Negative Weight (4/7)– If n = k, it is best to choose w- = 1. – This would produce maximum negative evidence

for pairs not to be merged.• w- = w-(n)

Page 43: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

43

Disambiguation algorithmChoosing Negative Weight (5/7)

• This observation raises two issues :– 1) n is not known to the algorithm beforehand. – 2) how to choose the w-(n) function.

Page 44: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

44

Disambiguation algorithmChoosing Negative Weight (6/7)

• n is not known, compute its estimated value ^n by running the disambiguation algorithm with a fixed value of w-.

• The algorithm would output certain number of clusters ^n, which can be employed as an estimation of n.

Page 45: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

45

Disambiguation algorithmChoosing Negative Weight (7/7)

• The value of w-(^n) : – when ^n < threshold, w-(^n) = 0.– when ^n > threshold, w-(^n) = -1.

• This threshold is learned from the data.

Page 46: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

46

A brief Summary

Page 47: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

47

Interpreting Clustering Results (1/4)

• Now describe how these clusters are used to build people search.

• The goal is to provide the user with a set of clusters based on association to real person.– 1. Rank the clusters.– 2. Provide a summary description with each

cluster.

Page 48: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

48

Interpreting Clustering Results (2/4)

• Cluster rank : – Select the highest ranked page.

• Cluster sketch : – The set of terms above a certain threshold is

selected and used as a summary for the cluster.

Page 49: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

49

Interpreting Clustering Results (3/4)

• Web page rank :– These pages are displayed according to their

original search engine order.

Page 50: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

50

Interpreting Clustering Results (4/4)

• Affinity to cluster : – Defined as the sum of the similarity values

between the page p and each page v in the cluster C :

• The remainder pages are displayed, the user has the option to get to these web pages too.

Page 51: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

51

Experimental ResultsExperimental Setup (1/8)

• The three data sets :– 1. WWW 2005 data set[8] : 12 different people

names.– 2. WEPS data set : SemEval workshop [3], consist

of :• Trail : 9 person names.• Training : 49 person names. • Test : 30 persons names.

[3] J. Artiles, J. Gonzalo, and S. Sekine, “The SemEval-2007 WePS Evaluation: Establishing a Benchmark for the Web People Search Task,” Proc. Int’l Workshop Semantic Evaluations (SemEval ’07), June 2007.[8] R. Bekkerman and A. McCallum, “Disambiguating Web Appearances of People in a Social Network,” Proc. Int’l World Wide Web Conf. (WWW), 2005.

Page 52: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

52

Experimental ResultsExperimental Setup (2/8)– 3. Context data set : • Issuing nine queries to Google, each in the form of a

person name along with context keywords.

• The top 100 returned web pages of the Web search were gathered for each person.

Page 53: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

53

Experimental ResultsExperimental Setup (3/8)

• To get the “ground truth” for these data sets, the pages for each person name have then been assigned to distinct real persons by manual examination.

Page 54: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

54

Experimental ResultsExperimental Setup (4/8)

• Used the GATE [19] system for the extraction of NEs from the web pages in the data set.

• To train the free parameters of algorithm, apply leave-one-out cross validation on WWW 2005, WEPS Trial, and Context data sets.

[19] H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan, “GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications,” Proc. Ann. Meeting of the Assoc. Computational Linguistics (ACL), 2002.

Page 55: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

55

Experimental ResultsExperimental Setup (5/8)

• Before the “ground truth” for its WEPS Test portion was released, tested the approach on the WEPS Training set by a twofold cross validation.

Page 56: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

56

Experimental ResultsExperimental Setup (6/8)

• After the “ground truth” of the WEPS Test portion became available, trained the algorithm on the whole WEPS Training portion and tested on the WEPS Test portion.

Page 57: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

57

Experimental ResultsExperimental Setup (7/8)

• Quality evaluation measures : – the B-cubed [6] and FP measures.

• Baseline methods : – the Agglomerative Vector Space clustering

algorithm with TF/IDF as the Baseline method.– The threshold parameter for this method is

trained the same way

Page 58: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

58

Experimental ResultsExperimental Setup (8/8)

• Statistical significance test : – 1-tailed paired t-test, with α = 0.05.

Page 59: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

59

Testing Disambiguation QualityExperiment 1 (Disambiguation quality : overall) (1/7)

Page 60: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

60

Testing Disambiguation QualityExperiment 1 (Disambiguation quality : overall) (2/7)

* s(u, v) = c(u, v) represents the approach where only the connection strength is employed for disambiguation. * Relies only on the extracted NEs and hyperlink information, and it does not use the TF/IDF.

Page 61: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

61

Testing Disambiguation QualityExperiment 1 (Disambiguation quality : overall) (3/7)

* With the analysis of the features of web pages f(u, v), in the form of their TF/ IDF similarity.

Page 62: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

62

Testing Disambiguation QualityExperiment 1 (Disambiguation quality : overall) (4/7)

Picks w- according to the function w-(^n) of the predicted number of namesakes.

Gains 7.8 percent improvement in terms of B-cubed over the baseline (WWW 2005 ).

Gets 6.1 percent improvement (WEPS Training) and 10.7 percent improvement (WEPS Test).

Page 63: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

63

Testing Disambiguation QualityExperiment 1 (Disambiguation quality : overall) (5/7)

Also compare the results with the top runners in the WEPS challenge [3]. The first runner in the challenge reports 0.78 for Fp and 0.70 for B-cubed measures.

[3] J. Artiles, J. Gonzalo, and S. Sekine, “The SemEval-2007 WePS Evaluation: Establishing a Benchmark for the Web People Search Task,” Proc. Int’l Workshop Semantic Evaluations (SemEval ’07), June 2007.

Page 64: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

64

Testing Disambiguation QualityExperiment 1 (Disambiguation quality per namesake) (6/7)

The “#” field shows the number of namesakes for a particular name in the corresponding 100 web pages.

[4]J. Artiles, J. Gonzalo, and F. Verdejo, “A Testbed for People Searching Strategies in the WWW,” Proc. SIGIR, 2005. (C : 39)

Page 65: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

65

Testing Disambiguation QualityExperiment 1 (Disambiguation quality per namesake) (7/7)

The table shows that the proposed approach outperforms that in [4] by 9.5 percent in terms of the FP measure.

[4]J. Artiles, J. Gonzalo, and F. Verdejo, “A Testbed for People Searching Strategies in the WWW,” Proc. SIGIR, 2005. (C : 39)

Page 66: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

66

Testing Disambiguation QualityExperiment 2 (Disambiguation quality : group identification)

The 1,085 web pages of the WWW 2005 data set.The task is to find the web pages related to the meant N people.

Page 67: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

67

Testing Disambiguation QualityExperiment 2 (Disambiguation quality : group identification)

The field “#W” in Table 3 is the number of the to-be-found web pages related to the namesake of interest.

The field “#C” is the number of web pages found correctly.The field “#I” is the number of pages found incorrectly in the resulting groups.

Page 68: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

68

Testing Disambiguation QualityExperiment 3 (Disambiguation quality: queries with context)

Generated a data set by querying Google with a person name and context keyword(s) that is related to that person.

Used nine different queries.

Page 69: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

69

Testing Disambiguation QualityExperiment 4 (Quality of generating cluster sketches)

The set of terms above a certain threshold (or top N terms) is selected and used as a summary for the cluster.

If the search is for UMass professor Andrew McCallum, his cluster can easily be identified with the terms like “machine learning” and “artificial intelligence.”

Page 70: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

70

Impact on Search

In case of a traditional search interface, at each observation i, where i = 1, 2,…,K, the user looks at the sketch provided for the i-th returned web page.

Page 71: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

71

Impact on Search

For the new interface, supported by a cluster-based people search, the user first looks at the “people search” interface.

1. The user sequentially reads cluster sketches/ descriptions, until on the m-th observation the user find the cluster of interest.

2. Clicks on that cluster. 3. Shows the original set of K web pages returned by the

search engine.

Page 72: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

72

Impact on Search

Measures :Compare the quality of the new and standard

interface using Precision, Recall, and F-measure.In general, the fewer observations are needed in

a given interface, the faster the user can find the related pages.

Page 73: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

73

Experiment 5 (Impact on search)Case 1 : First-dominant cluster

ObservationStandard New interface

To discover 50 percent of the relevant pages.

44 33

To discover 90 percent of the relevant pages.

92 55

Page 74: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

74

Experiment 5 (Impact on search)Case 2 : Regular cluster

ObservationStandard New interface

To discover 50 percent of the relevant pages.

51 16

To discover 90 percent of the relevant pages.

79 17

Andrew McCallum the Customer Support person. His cluster consists of three pages.

Page 75: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

75

Experiment 5 (Impact on search)Case 3 : Average

The average of Recall, Precision, and F measures for search impact on the WWW 2005.

Some of the person names have many namesakes.

Show that, even with the imperfect clustering, the curves for the new interface largely dominate those for the standard interface.

Page 76: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

76

Experiment 5 (Impact on search)Impact on search with context

In that case, one can expect to see no difference between the new and the standard interface.

The query is “Andrew McCallum” musicThe number of namesakes for that query is

surprisingly large: 23.Andrew McCallum the UMass professor, who

is interested in music.

Page 77: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

77

Experiment 5 (Impact on search)Impact on search with context

Andrew McCallum the DJ/ musician. In both cases, the new interface performs

better than the standard one.

ObservationStandard New interface

To discover 90 percent of the prof.

90 60

To discover 90 percent of the DJ.

90 20

Page 78: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

78

Experiment 6 (efficiency)

That takes 3.82 seconds per web page (downloads and preprocesses pages.)

The clustering algorithm itself executes in 4.7 seconds on the average per queried name.

Page 79: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

79

CONCLUSIONS AND FUTURE WORK

•Attempted to answer the question of which maximum quality the approach can get if it uses only the information stored in the top-k web pages being processed.

•Future work : 1. Employ external data sources for disambiguation.2. Use more advances extraction capabilities.3. Work on algorithms for a generic entity search, where entities are not limited to people.

Page 80: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

80

Related Work

• Disambiguation and entity resolution techniques are key to any Web people search applications.

Page 81: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

81

The differences among the disambiguation methodology in this paper and most related existing work are multilevel (see Table 1).

Page 82: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

82

Level 1: Problem type.Two different common types of the disambiguation challenge: (fuzzy) Lookup [27], [28], and (fuzzy) Grouping [10], [13].

Page 83: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

83

Level 2: Data with respect to GLuv.

*The methodology is based on analyzing GLuv in this paper.

*The majority of the existing techniques do not analyze GLuv.

Page 84: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

84

Name co-occurrence.[12] analyzes only co-occurrences of names of authors via publications for a publication data set.

[12]I. Bhattacharya and L. Getoor, “Collective Entity Resolution in Relational Data,” Bull. IEEE CS Technical Committee Data Eng., vol. 29, no. 2, pp. 4-12, June 2006.

Page 85: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

85

Name co-occurrence.When analyzing authors A1 and A5, the approach in [10], [11], and [13] would only be interested in author A3, which is a co-occurring author in publications P1 and P2, which are connected to A1 and A5, respectively.

[10] I. Bhattacharya and L. Getoor, “Iterative Record Linkage for Cleaning and Integration,” Proc. ACM SIGMOD Workshop Research Issues in Data Mining and Knowledge Discovery (DMKD), 2004.[11] I. Bhattacharya and L. Getoor, “Relational Clustering for Multi- Type Entity Resolution,” Proc. Multi-Relational Data Mining Workshop (MRDM), 2005.[13] I. Bhattacharya and L. Getoor, “A Latent Dirichlet Model for Unsupervised Entity Resolution,” Proc. SIAM Data Mining Conf. (SDM), 2006.

Page 86: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

86

Name co-occurrence.[12] would be interested only in the sub-graph shown in Fig. 5. The methodology in this paper instead analyzes the whole GL

uv.

[12]I. Bhattacharya and L. Getoor, “Collective Entity Resolution in Relational Data,” Bull. IEEE CS Technical Committee Data Eng., vol. 29, no. 2, pp. 4-12, June 2006.

Page 87: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

87

Restrictions on types.[12] understands only one type of relationship. The approach proposed here can analyze all of the types of relationships and entities.

Page 88: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

88

*[26], [31], and [33] often still analyzes just portions of GLuv.

* The adaptive approach in [33] analyzes G2uv, see Fig. 7.

[26] R. Holzer, B. Malin, and L. Sweeney, “Email Alias Detection Using Social Network Analysis,” Proc. ACM SIGKDD, 2005.[31] B. Malin, “Unsupervised Name Disambiguation via Social Network Similarity,” Proc. Workshop Link Analysis, Counterterrorism, and Security, 2005.[33] E. Minkov, W. Cohen, and A. Ng, “Contextual Search and Name Disambiguation in Email Using Graphs,” Proc. SIGIR, 2006.

Page 89: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

89

*[31] simply looks at people and connects them via “are-related”

[31] B. Malin, “Unsupervised Name Disambiguation via Social Network Similarity,” Proc. Workshop Link Analysis, Counterterrorism, and Security, 2005.

Page 90: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

90

*Level 3: Analysis of GLuv.

*The methodology in this paper is based on analyzing paths in Puv and building mathematical models for c(u, v).* The existing work (e.g., [27], [28]) analyze the direct neighbors and [26] analyzes the shortest u-v path.

[26] R. Holzer, B. Malin, and L. Sweeney, “Email Alias Detection Using Social Network Analysis,” Proc. ACM SIGKDD, 2005.[27] D.V. Kalashnikov and S. Mehrotra, “Domain-Independent Data Cleaning via Analysis of Entity-Relationship Graph,” ACM Trans. Database Systems, vol. 31, no. 2, pp. 716-767, June 2006.[28] D.V. Kalashnikov, S. Mehrotra, and Z. Chen, “Exploiting Relationships for Domain-Independent Data Cleaning,” Proc. SIAM Int’l Conf. Data Mining (SDM ’05), Apr. 2005.

Page 91: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

91

*Level 4 : Way to use c(u, v).*[10] and [11] employ agglomerative clustering.*[27], [28], the disambiguation problem is converted into an optimization problem, which is then solved iteratively.

[10] I. Bhattacharya and L. Getoor, “Iterative Record Linkage for Cleaning and Integration,” Proc. ACM SIGMOD Workshop Research Issues in Data Mining and Knowledge Discovery (DMKD), 2004.[11] I. Bhattacharya and L. Getoor, “Relational Clustering for Multi- Type Entity Resolution,” Proc. Multi-Relational Data Mining Workshop (MRDM), 2005.[27] D.V. Kalashnikov and S. Mehrotra, “Domain-Independent Data Cleaning via Analysis of Entity-Relationship Graph,” ACM Trans. Database Systems, vol. 31, no. 2, pp. 716-767, June 2006.[28] D.V. Kalashnikov, S. Mehrotra, and Z. Chen, “Exploiting Relationships for Domain-Independent Data Cleaning,” Proc. SIAM Int’l Conf. Data Mining (SDM ’05), Apr. 2005.

Page 92: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

92

*Level 5: Domain independence.*Some of the existing techniques are applicable to only certain types of domains or just one domain.

Page 93: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

93

Related WorkWSD (1/3)

• Word Sense Disambiguation : – determine the exact sense of an ambiguous word

given a list of word senses.• Word Sense Discrimination : – determine, which instances of the ambiguous

word can be clustered as sharing the same meaning.

Page 94: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

94

Related WorkWSD (2/3)

• External knowledge sources : – Using lexical knowledge associated with a

dictionary and WordNet.• Approach : – supervised– unsupervised

Page 95: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

95

Related WorkWSD (3/3)

• If view the ambiguous word as a reference and the word sense as an entity.

• The two instances of WSD problem are similar to the Lookup and Grouping instances of Entity Resolution/WePS.

Page 96: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

96

Related WorkWePS (1/4)

• WePS can be implemented in two different settings.– Server-side setting : the disambiguation

mechanism is integrated into the search-engine directly.

– Middleware approach : build people search capabilities on top of an existing search-engine such as Google by “wrapping” the original engine.

Page 97: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

97

Related WorkWePS (2/4)

• Clusty (http://www.clusty.com)• Grokker (http://www.grokker.com)• Kartoo (http://www.kartoo.com)

Page 98: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

98

Related WorkWePS (3/4)

• ZoomInfo (http://www.zoominfo.com)

Page 99: Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.

99

Related WorkWePS (4/4)

• But, this system has a high cost and low scalability.• Because the person information in the systems is collected

primarily manually.

• Does not rely on any such pre-compiled knowledge and thus will scale to person search for any person on the Web.