08/22/2004 MRDM 2004 Workshop 1
Link MiningLink Mining
Lise GetoorUniversity of Maryland, College
Park
joint work with Indrajit Bhattacharya, Qing Lu and Prithviraj Sen
08/22/2004 MRDM 2004 Workshop 2
RoadmapRoadmap
• Intro to Link Mining– Link Mining Tasks– Link Mining Challenges
• Some Current Projects– Link-based Classification
• Link-based classification using a variety of link descriptions
• Link-based classification using labeled and unlabeled data
– Link-based Clustering• Entity detection• Group Detection
• Conclusion
08/22/2004 MRDM 2004 Workshop 3
Link MiningLink Mining• Traditional machine learning and data mining
approaches assume:– A random sample of homogeneous objects from single
relation
• Real world data sets:– Multi-relational, heterogeneous and semi-structured
• Link Mining– newly emerging research area at the intersection of
research in social network and link analysis, hypertext and web mining, graph mining, relational learning and inductive logic programming
08/22/2004 MRDM 2004 Workshop 4
Linked DataLinked Data
• Heterogeneous, multi-relational data represented as a graph or network– Nodes are objects
• May have different kinds of objects• Objects have attributes• Objects may have labels or classes
– Edges are links• May have different kinds of links• Links may have attributes• Links may be directed, are not required to be binary
08/22/2004 MRDM 2004 Workshop 5
Sample DomainsSample Domains
• web data (web)• bibliographic data (cite)• epidimiological data (epi)• communication data (comm)• customer networks (cust)• collaborative filtering problems (cf)• trust networks (trust)• biological data (bio)
08/22/2004 MRDM 2004 Workshop 6
Link Mining TasksLink Mining Tasks
• Link-based Object Classification• Object Type Prediction• Link Type Prediction• Predicting Link Existence• Link Cardinality Estimation• Object Consolidation• Group Detection • Subgraph Discovery• Metadata Mining
08/22/2004 MRDM 2004 Workshop 7
Link-based Object ClassificationLink-based Object Classification
• Predicting the category of an object based on its attributes and its links and attributes of linked objects
• web: Predict the category of a web page, based on words that occur on the page, links between pages, anchor text, html tags, etc.
• cite: Predict the topic of a paper, based on word occurrence, citations, co-citations
• epi: Predict disease type based on characteristics of the patients infected by the disease
08/22/2004 MRDM 2004 Workshop 8
Object Class PredictionObject Class Prediction
• Predicting the type of an object based on its attributes and its links and attributes of linked objects
• comm: Predict whether a communication contact is by email, phone call or mail.
• cite: Predict the venue type of a publication (conference, journal, workshop)
08/22/2004 MRDM 2004 Workshop 9
Link Type ClassificationLink Type Classification
• Predicting type or purpose of link based on properties of the participating objects
• web: predict advertising link or navigational link; predict an advisor-advisee relationship
• epi: predicting whether contact is familial, co-worker or acquaintance
08/22/2004 MRDM 2004 Workshop 10
Predicting Link ExistencePredicting Link Existence
• Predicting whether a link exists between two objects
• web: predict whether there will be a link between two pages
• cite: predicting whether a paper will cite another paper• epi: predicting who a patient’s contacts are
08/22/2004 MRDM 2004 Workshop 11
Link Cardinality Estimation ILink Cardinality Estimation I• Predicting the number of links to an object
• web: predict the authoratativeness of a page based on the number of in-links; identifying hubs based on the number of out-links
• cite: predicting the impact of a paper based on the number of citations
• epi: predicting the number of people that will be infected based on the infectiousness of a disease.
08/22/2004 MRDM 2004 Workshop 12
Link Cardinality Estimation IILink Cardinality Estimation II• Predicting the number of objects reached along a
path from an object• Important for estimating the number of objects
that will be returned by a query
• web: predicting number of pages retrieved by crawling a site
• cite: predicting the number of citations of a particular author in a specific journal
08/22/2004 MRDM 2004 Workshop 13
Object ConsolidationObject Consolidation
• Predicting when two objects are the same, based on their attributes and their links
• aka: record linkage, duplicate elimination, identity uncertainty
• web: predict when two sites are mirrors of each other.• cite: predicting when two citations are referring to the
same paper. • epi: predicting when two disease strains are the same• bio: learning when two names refer to the same protein
08/22/2004 MRDM 2004 Workshop 14
Group DetectionGroup Detection
• Predicting when a set of entities belong to the same group based on clustering both object attribute values and link structure
• web – identifying communities • cite – identifying research communities
08/22/2004 MRDM 2004 Workshop 15
Subgraph IdentificationSubgraph Identification
• Find characteristic subgraphs• Focus of graph-based data mining (Cook &
Holder, Inokuchi, Washio & Motoda, Kuramochi & Karypis, Yan & Han)
• bio – protein structure discovery• comm – legitimate vs. illegitimate groups• chem – chemical substructure discovery
08/22/2004 MRDM 2004 Workshop 16
Metadata MiningMetadata Mining
• Schema mapping, schema discovery, schema reformulation
• cite – matching between two bibliographic sources
• web - discovering schema from unstructured or semi-structured data
• bio – mapping between two medical ontologies
08/22/2004 MRDM 2004 Workshop 17
Link Mining TasksLink Mining Tasks
• Link-based Object Classification• Object Type Prediction• Link Type Prediction• Predicting Link Existence• Link Cardinality Estimation• Object Consolidation• Group Detection • Subgraph Discovery• Metadata Mining
08/22/2004 MRDM 2004 Workshop 18
Link Mining ChallengesLink Mining Challenges
• Logical vs. Statistical dependencies• Feature construction• Instances vs. Classes• Collective Classification• Collective Consolidation• Effective Use of Labeled & Unlabeled Data• Link Prediction• Closed vs. Open World
Challenges common to any link-based statistical model (Bayesian Logic Programs, Conditional Random Fields, Probabilistic Relational Models, Relational Markov Networks, Relational Probability Trees, Stochastic Logic Programming to name a few)
08/22/2004 MRDM 2004 Workshop 19
Logical vs. Statistical DependenceLogical vs. Statistical Dependence
• Coherently handling two types of dependence structures:– Link structure - the logical relationships
between objects– Probabilistic dependence - statistical
relationships between attributes
• Challenge: statistical models that support rich logical relationships
• Model search complicated by the fact that attributes can depend on arbitrarily linked attributes -- issue: how to search this huge space
08/22/2004 MRDM 2004 Workshop 20
Model SearchModel Search
P2
P
A1
P3
P1
?
A1
P2
P3
P1
I1I1
08/22/2004 MRDM 2004 Workshop 21
Feature ConstructionFeature Construction
• In many cases, objects are linked to a set of objects. To construct a single feature from this set of objects, we may either use:– Aggregation– Selection
08/22/2004 MRDM 2004 Workshop 22
P2
P1
P3
AggregationAggregation
I1
mode
P2
P3
P1
P
A1
?
P2
P1
I2
mode
P9
P4
P5
P
A2
?
P6
P8
P7
P
08/22/2004 MRDM 2004 Workshop 23
P2
P1
P3
SelectionSelection
I1
P2
P3
P1
P
A1
?
P2
P3
P
08/22/2004 MRDM 2004 Workshop 24
Individuals vs. ClassesIndividuals vs. Classes
• Does model refer – explicitly to individuals– classes or generic categories of individuals
• On one hand, we’d like to be able to model that a connection to a particular individual may be highly predictive
• On the other hand, we’d like our models to generalize to new situations, with different individuals
08/22/2004 MRDM 2004 Workshop 25
Instance-based DependenciesInstance-based Dependencies
A1
P3
I1
Papers that cite P3 are likely to be
P3
08/22/2004 MRDM 2004 Workshop 26
Class-based DependenciesClass-based Dependencies
A1
?I1
Papers that cite are likely to be
?
08/22/2004 MRDM 2004 Workshop 27
Collective classificationCollective classification
• Using a link-based statistical model for classification
• Inference using learned model is complicated by the fact that there is correlation between the object labels
08/22/2004 MRDM 2004 Workshop 28
Collective consolidationCollective consolidation
• Using a link-based statistical model for object consolidation
• Consolidation decisions should not be made independently
08/22/2004 MRDM 2004 Workshop 29
Labeled & Unlabeled DataLabeled & Unlabeled Data
• In link-based domains, unlabeled data provide three sources of information:– Helps us infer object attribute distribution– Links between unlabeled data allow us to make
use of attributes of linked objects– Links between labeled data and unlabeled data
(training data and test data) help us make more accurate inferences
08/22/2004 MRDM 2004 Workshop 30
Link Prior ProbabilityLink Prior Probability
• The prior probability of any particular link is typically extraordinarily low
• For medium-sized data sets, we have had success with building explicit models of link existence
• It may be more effective to model links at higher level--required for large data sets!
08/22/2004 MRDM 2004 Workshop 31
Closed World vs. Open World Closed World vs. Open World
• The majority of SRL approaches make a closed world assumption, which assumes that we know all the potential entities in the domain
• In many cases, this is unrealistic • Work by Milch, Marti, Russell on BLOG
08/22/2004 MRDM 2004 Workshop 32
Link Mining SummaryLink Mining Summary• Link Mining Tasks
– Link-based Object Classification
– Object Type Prediction– Link Type Prediction– Predicting Link Existence
• Link Mining Challenges– Logical vs. Statistical
dependencies– Feature construction– Instances vs. Classes– Collective Classification
– Link Cardinality Estimation– Object Consolidation– Group Detection – Subgraph Discovery– Metadata Mining
– Collective Consolidation– Effective Use of Labeled &
Unlabeled Data– Link Prediction– Closed vs. Open World
08/22/2004 MRDM 2004 Workshop 33
RoadmapRoadmap
• Intro to Link Mining– Link Mining Tasks– Link Mining Challenges
• Some Current ProjectsLink-based Classification
• work with Qing Lu and Prithviraj Sen• Link-based classification using a variety of link
descriptions • Link-based classification using labeled and unlabeled
data
– Link-based Clustering• Entity detection• Group Detection
• Conclusion
08/22/2004 MRDM 2004 Workshop 34
Object ClassificationObject Classification
• Traditional Object Classification– Assume objects sampled
from a single relation– Object Attributes (OA)
X4
X3
X5
X1
X2
08/22/2004 MRDM 2004 Workshop 35
Object Classification with Linked DataObject Classification with Linked Data
• Traditional Object Classification– Assume objects sampled
from a single relation– Object Attributes (OA)
X4
X3
X5
X1
X2
• Linked Data Links among objects Represented as a graph
08/22/2004 MRDM 2004 Workshop 36
Link-based Object ClassificationLink-based Object Classification
• Predicting the category of an object based on its attributes and its links and attributes of linked objects
• Citation domain: Predict the topic of a paper, based on word occurrence, citations and co-citations
08/22/2004 MRDM 2004 Workshop 37
Related Work: Link-based ClassificationRelated Work: Link-based Classification
• Hypertext Classification using Links– Class labels of linked objects
• Soumen Chakrabarti (1998)• Oh, et al. (1999)
– Unique document ID, Popescul et al. (2002)– Regularities, Yang et al. (2002)
• Use of Unlabeled Data Co-training, Blum and Mitchel (1998) EM-algorithm, Nigam et al. (2000) Systematical investigation of EM and Co-training,
Ghani (2001) TSVM, Joachims (1999)
08/22/2004 MRDM 2004 Workshop 38
Our ApproachOur Approach
• Link-based models– Integrate link features with object attributes
using logistic regression– Investigate use of labeled and unlabeled data
for link-based classification
08/22/2004 MRDM 2004 Workshop 39
FeaturesFeatures
• Object Attributes– Notation: OA(X)
• Link Descriptions– Notation: LD(X)– Statistics computed from linked objects– Computed separately for each of:
• In-Links(X)• Out-Links(X)• Co-In(X)• Co-Out(X)
– Three types of Link Descriptions:• Mode, Binary, Count
08/22/2004 MRDM 2004 Workshop 40
X
Link DescriptionsLink DescriptionsCategories
08/22/2004 MRDM 2004 Workshop 41
X
Link DescriptionsLink DescriptionsCategories In-Links(X)
mode:
08/22/2004 MRDM 2004 Workshop 42
X
Link DescriptionsLink DescriptionsCategories In-Links(X)
Out-Links(X) mode:
mode:
08/22/2004 MRDM 2004 Workshop 43
X
Link DescriptionsLink DescriptionsCategories In-Links(X)
Out-Links(X) mode:
mode:
CO(X)
mode:
08/22/2004 MRDM 2004 Workshop 44
X
Link DescriptionsLink DescriptionsCategories In-Links(X)
Out-Links(X) mode:
mode:
CO(X)
mode:
CI(X)
mode:
08/22/2004 MRDM 2004 Workshop 45
X
Link DescriptionsLink DescriptionsCategories In-Links(X)
Out-Links(X) mode:
mode:
CO(X)
binary: (1,1,1)
binary: (1,1,0)
binary: (1,1,0)
mode:
CI(X)
binary: (1,0,0)
mode:
08/22/2004 MRDM 2004 Workshop 46
X
Link DescriptionsLink DescriptionsCategories In-Links(X)
Out-Links(X) mode:
mode:
CO(X)
binary: (1,1,1)
binary: (1,1,0)
binary: (1,1,0)
count: (1,2,0)
count: (2,1,0)
mode:
CI(X)
binary: (1,0,0)
count: (2,0,0)
mode:
count: (3,1,1)
08/22/2004 MRDM 2004 Workshop 47
Predictive Model for ClassificationPredictive Model for Classification
• A structured logistic regression– Compute P(c | OA(X)) and P(c | LD(X)) separately using
separate logistic regression models
– where OA(X) are the object attributes and LDf(X) are the link features
}CO,CI,Out,In{f
f
c cPxLD|cP
xOA|cPmaxarg)x(c
08/22/2004 MRDM 2004 Workshop 48
PredictionPrediction
• category set { }
P5
P4
P3
P2
P1
P5
P4
P3
P2
P1
Step 1: Bootstrap using object attributes only
08/22/2004 MRDM 2004 Workshop 49
PredictionPrediction
P5
P3
P2
P1
P5
P4
P3
P2
P1
Step 2: Iteratively update the category of each object, based on linked object’s categories
P4P4
08/22/2004 MRDM 2004 Workshop 50
Data SetsData Sets
Data Set
paperscitations
categories
vocabulary
CoraI 3181 6185 7 1400
CoraII 3300 11794 10 3174
CiteSeer 3600 7522 6 3000
08/22/2004 MRDM 2004 Workshop 51
Experiment IExperiment I
Content and Link Effectiveness
0
10
20
30
40
50
60
70
80
90
CoraI CoraII CiteSeer
Data Sets
Avg
F1
Mea
sure
Content Only
Links Only
Content + Links
08/22/2004 MRDM 2004 Workshop 52
Experiment IIExperiment II
Effectiveness of Different Link Types and Models
0
10
20
30
40
50
60
70
80
90
Mode Binary Count Mode Binary Count Mode Binary Count
Data Sets and M odels
Avg
F1
Mea
sure IN-links
OUT-links
CI-links
CO-links
ALL
CoraI CoraII CiteSeer
08/22/2004 MRDM 2004 Workshop 53
Experiment IIIExperiment III
• Setup– 20% data as test data– remaining data: 20%, 40%, 60%, 80% labeled
data
• Link-based classification using labeled and unlabeled data– Labeled-only: learn model using only labeled
data– Labeled and Unlabeled: learn model using
both labeled and unlabeled data
08/22/2004 MRDM 2004 Workshop 54
Learning with Labeled andLearning with Labeled and Unlabeled Data Unlabeled Data
0.620.640.660.68
0.70.720.740.760.78
0.80.82
20 40 60 80
Percentage Labeled Data
Acc
ura
cy labeled & unlabeled
labeled only
content only
08/22/2004 MRDM 2004 Workshop 55
Ordering StrategiesOrdering Strategies
0.65
0.67
0.69
0.71
0.73
0.75
0.77
0.79
0.81
0.83
0.85
0 100 200 300 400 500 600 700 800 900 1000 1100
Iteration
Av
era
ge
Ac
cu
rac
y
RAND 1
DEC PP 1
DEC OUTLINKS 1
INC OUTLINKS 1
DEC PP 2
DEC OUTLINKS 2
INC OUTLINKS 2
RAND (HARD CLASSIF.)
DEC PP (HARD CLASSIF.)
08/22/2004 MRDM 2004 Workshop 56
LBC: SummaryLBC: Summary
• Variety of ways of describing link neighborhoods– Mode, Binary, Count– In-links, Out-links, CI-links and CO-links
• In link-based classification, unlabeled data provide useful information:– Helps us infer object attribute distribution– Links between unlabeled data allow us to make use of
attributes of linked objects– Links between labeled data and unlabeled data (training
data and test data) help us make more accurate inferences
• Link-based Challenges addressed:– Feature construction– Collective classification– Use of labeled and unlabeled data
08/22/2004 MRDM 2004 Workshop 57
RoadmapRoadmap
• Intro to Link Mining– Link Mining Tasks– Link Mining Challenges
• Some Current Projects– Link-based Classification
• Link-based classification using a variety of link descriptions
• Link-based classification using labeled and unlabeled data
Link-based Clustering• work with Indrajit Bhattacharya• Entity detection• Group Detection
• Conclusion
08/22/2004 MRDM 2004 Workshop 58
Deduplication and Group Deduplication and Group DetectionDetection
• Object Consolidation– Observations come with noise or multiple
representations• Multiple entries for the same person in a customer
database
• Group Detection– Identify groups of similar entities
• Group authors by research interest
08/22/2004 MRDM 2004 Workshop 59
TerminologyTerminology
Alfred V AhoEntities
Alfred Aho AV AhoAho, A. V.References
LinksAlfred Aho, John Hopcroft, Jeffrey Ullman
AV Aho, BW Kernighan, PJ Weinberger
Entity Groups G1(Programming Languages)
G2(Databases)
G3(Algorithms)
08/22/2004 MRDM 2004 Workshop 60
• The two problems need to be addressed together – Goldberg and Senator, KDD 95
Deduplication and Group Deduplication and Group DetectionDetection
DB DB´
Consolidation&
Link Formation
KDD Tools
Knowledge
08/22/2004 MRDM 2004 Workshop 61
Related Work: DeduplicationRelated Work: Deduplication
• Statistics– Blocking, Newcombe– “Match/non-match”, Fellegi & Sunter– EM with match variable, Winkler
• AI, Machine Learning– String similarity measures, Monge & Elkan; Cohen – Object Consolidation, ejada et al.– Learning string distances, Bilenko & Mooney – Active learning, Sarawagi et al– Coreference resolution, Mccallum & Wellner – Identity uncertainty, Pasula et al
• Databases– Efficient record linkage, Hernandez &
Stolfo, Monge & Elkan– Use of co-occurrence, Chaudhuri et al,
Ananthakrishna et al
08/22/2004 MRDM 2004 Workshop 62
Related Work: Group DetectionRelated Work: Group Detection
• Hypertext Mining– Eigen decomposition for ranking, Brin & Page; Kleinberg– Finding web communities, Gibson et al– “Missing link”, Cohn & Hofmann
• Probabilistic Link Modeling– Generative model for links, Getoor et al.; Kubica et
al
• Text Retrieval– Spectral techniques, Ng, Jordan & Weiss; Dhillon et
al; Ding et al– Probabilistic modeling with latent variables,
Hofmann; Blei, Ng & Jordan; Rosen-Zvi et al
08/22/2004 MRDM 2004 Workshop 63
Paper Resolution ProblemPaper Resolution Problem
• Example
– R. Agrawal, R. Srikant. Fast algorithms for mining association rules in large databases. In VLDB-94, 1994.
– Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules. In Proc. of the 20th Int'l Conference on Very Large Databases, Santiago, Chile, September 1994.
• Traditionally, string similarity
08/22/2004 MRDM 2004 Workshop 64
Author Resolution ProblemAuthor Resolution Problem
• Given a set of papers, determine the set of authors
• First and middle names vary– Check common name transforms
• Problems remain– How about ‘J. Smith’ and ‘John Smith’?– Do two instances of ‘J. Smith’ refer to the same
author?
• Use co-author relationships
08/22/2004 MRDM 2004 Workshop 65
Author Deduplication: ExampleAuthor Deduplication: Example
Alfred V Aho
Jeffrey D Ullman
S C Johnson
A V Aho
J D Ullman
Alfred V Aho
Jeffrey D Ullman
S C Johnson
A V Aho
J D Ullman
P1: Code generation for machines with multiregister operations
P2: The universality of database languages
P3: Optimal partial-match retrieval when fields are independently specified
P4: Code generation for expressions with common subexpressions
Aho P1 Aho P3Aho P2 Aho P4
Ullman P1 Ullman P2 Ullman P3 Ullman P4
Johnson P1 Johnson P4
Aho P1,P2,P3,P4
Ullman P1,P2,P3,P4
Johnson P1,P4
08/22/2004 MRDM 2004 Workshop 66
Deduplication Using LinksDeduplication Using Links
• Cluster similar author references into duplicates – Problem: define appropriate distance
measure
• Weighted combination of attribute and link distances
08/22/2004 MRDM 2004 Workshop 67
Link Distances for DeduplicationLink Distances for Deduplication• To compare two author references, compare all
their links/relations • Distance between two links
– How many duplicates do they share?– d(l1,l2) = 1 – |duplicates(l1,l2)| / max(|l1|,|l2|)
• Distance between Link Sets: Link detail distance – d(l,L) = minl’ in L d(l,l’)– ddetail(L1,L2) = avg[avgl in L1 d(l,L1), avgl in L2 d(l,L2)]
• Distance between Link Sets: Link summary distance– Group detail distance is costly because of pair-wise
comparison between group sets– Maintain group summary: all unique references in the
group set– dsumm(L1,L2)=d(sum1,sum2)
08/22/2004 MRDM 2004 Workshop 68
Group Detection: ExampleGroup Detection: Example
A. AhoEntities
LinksAlfred Aho, John Hopcroft, Jeffrey Ullman, Data Structures and Algorithms
AV Aho, R Sethi, J D Ullman, Compilers: Principles, Techniques and Tools
Groups PL Databases Algorithms
J. HopcroftJ. UllmanR. Sethi
• Problem: Discover the hidden set of groups and mapping from entities to groups
08/22/2004 MRDM 2004 Workshop 69
Group Detection Using LinksGroup Detection Using Links
• Cluster similar links into groups
Alfred Aho, John Hopcroft, Jeffrey Ullman, Design and Analysis of Computer Algorithms
Alfred Aho, John Hopcroft, Jeffrey Ullman, Data Structures and Algorithms
AV Aho, R Sethi, J D Ullman, Compilers: Principles, Techniques and Tools
Alfred V. Aho, Brian W. Kernighan, Peter J. Weinberger, The AWK Programming Language
Alfred V. Aho, Jeffrey D. Ullman,
Principles of Compiler Design
Algorithms
PL & Compilers
08/22/2004 MRDM 2004 Workshop 70
• Define cluster distance considering generation probability of observed links given model M
• Probabilistic Distance dLP of two clusters– What is the change in probability of observed data if the
two clusters are merged?
• Group summary distance dsumm
– dLP is lower for higher overlap in entity sets of two clusters
– But entity sets are the link summaries
– Use summary distance as approximation of dLP
Link Distances for Group DetectionLink Distances for Group Detection
08/22/2004 MRDM 2004 Workshop 71
• Initialize clusters – Deduplication: using attribute distance only
• Greedily pick best candidate using d(ci,cj)– Compare different distance measures
• Update link sets / summaries and cluster distances and continue
• Select candidates for merge using threshold
Deduplication / Group Detection Deduplication / Group Detection Using ClusteringUsing Clustering
• Each cluster contains currently known duplicates / links from the same group
08/22/2004 MRDM 2004 Workshop 72
Evaluation: Data Generator ParametersEvaluation: Data Generator Parameters
• Structural Parameters– Number of author-entities and groups– Degree of overlap among the groups
• Noise parameters– For deduplication data, generate noisy
attributes for each entity in the links
• Generative Parameters– Number of links– Mean size of links
08/22/2004 MRDM 2004 Workshop 73
Evaluation: MetricEvaluation: Metric
• Diversity of clusters– How many different entities does a cluster
contain?– Links from how many different groups does a
cluster contain?
• Dispersion of entities/groups– How many different clusters is an entity
spread over?– How many different clusters are links from
the same group spread over?
• Measure average dispersion and diversity
08/22/2004 MRDM 2004 Workshop 74
Results: DeduplicationResults: Deduplication• Algorithm Parameters: mixing weight and threshold
– Superior results with link distances
Dispersion vs diversity for varying alpha
0
100
200
300
400
500
600
700
800
900
1000
1 1.2 1.4 1.6 1.8 2 2.2 2.4entity dispersion
clu
ster
div
ersi
ty
attribute
summary
Dispersion vs diversity for varying threshold
0
100
200
300
400
500
600
700
800
900
1000
1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8entity dispersion
clu
ster
div
ersi
ty
attribute
summary
• Detailed results in DMKD ’04 paper, Iterative Record Linkage for Cleaning and Integration, Indrajit Bhattacharya and Lise Getoor.
08/22/2004 MRDM 2004 Workshop 75
Results: Distance MeasuresResults: Distance Measures
• Comparison of attribute distance, group summary distance and group detail distance
Cluster Diversity
1
1.5
2
2.5
3
3.5
8001000120014001600180020002200
# clusters
clus
ter
dive
rsity attribute
group_summary
group detail
Entity Dispersion
1.3
1.5
1.7
1.9
2.1
2.3
8001000120014001600180020002200
# clusters
entit
y di
sper
sion attribute
group_summary
group_detail
08/22/2004 MRDM 2004 Workshop 76
Deduplicating Real DataDeduplicating Real Data
• Machine learning papers from Citeseer– Citations hand-matched by Steve Lawrence et
al– Author references hand-matched by Culotta &
McCallum– 1504 paper citations– 2892 author references– 1167 author entities identified
• Initial results– Link summary improves entity dispersion over
attribute clustering– Discovered labeling errors that are hard
to identify considering attributes only
08/22/2004 MRDM 2004 Workshop 77
Deduplicating Real dataDeduplicating Real data
• 174 | 610 | barron_a_r | A.R. Barron • 175 | 610 | barron_r_l | R.L. Barron
• Not the same entity– Barron, A.R., Barron, R.L., 1988. Statistical
learning networks: a unifying view. In: 1988 Symposium on the Interface: Statistics and Computer Science, pp. 192-203.
08/22/2004 MRDM 2004 Workshop 78
Deduplicating Real dataDeduplicating Real data
• 2097 | 8460 | ramakrishnan_c_r | C. R. Ramakrishnan
• 2098 | 8460 | ramakrishnan_i_v | I. V. Ramakrishnan
• Not the same entity– A Symbolic Constraint Solving Framework for Analysis of
Logic Programs, C.R. Ramakrishnan, I.V. Ramakrishnan and R. Sekar, ACM Conference on Partial Evaluation and Semantics based Program Manipulation (PEPM), June 1995
08/22/2004 MRDM 2004 Workshop 79
Deduplicating Real dataDeduplicating Real data
• Parse Error– 1734 | 7010 | minton_andrew_b_philips_steven
| Andrew B. Philips Steven Minton
• Same entity as– 1735 | 7020 | minton_s | Minton , S.
08/22/2004 MRDM 2004 Workshop 80
Deduplicating Real dataDeduplicating Real data
• Parse Error– 2083 | 8370 | raedt_2_l_de | 2. L. De Raedt
• Same entity as– 2085 | 8380 | raedt_l_de | L. De Raedt
08/22/2004 MRDM 2004 Workshop 81
Deduplication and Group Detection Deduplication and Group Detection SummarySummary
• Study of novel distance measures for clustering similar entities in linked environments
• Unified generative model for evaluating the related problems
• Link-based clustering shows superior performance over attribute clustering for both tasks on synthetic data
• Link-based Challenges addressed:– Collective consolidation
08/22/2004 MRDM 2004 Workshop 82
RoadmapRoadmap
• Intro to Link Mining– Link Mining Tasks– Link Mining Challenges
• Some Current Projects– Link-based Classification
• work with Qing Lu and Prithviraj Sen• Link-based classification using a variety of link
descriptions • Link-based classification using labeled and unlabeled
data
– Link-based Clustering• Entity detection• Group Detection
• Conclusion
08/22/2004 MRDM 2004 Workshop 83
Link Mining SummaryLink Mining Summary• Link Mining Tasks
– Link-based Object Classification
– Object Type Prediction– Link Type Prediction– Predicting Link Existence
• Link Mining Challenges– Logical vs. Statistical
dependencies– Feature construction– Instances vs. Classes– Collective Classification
– Link Cardinality Estimation– Object Consolidation– Group Detection – Subgraph Discovery– Metadata Mining
– Collective Consolidation– Effective Use of Labeled &
Unlabeled Data– Link Prediction– Closed vs. Open World
08/22/2004 MRDM 2004 Workshop 84
ReferencesReferences• Deduplication and Group Detection Using Links Indrajit Bhattacharya and Lise Getoor. 10th ACM SIGKDD Workshop on
Link Analysis and Group Detection, Seattle, WA, August 2004.
• Word Sense Disambiguation using Probabilistic Models, Indrajit Bhattacharya, Lise Getoor and Yoshua Bengio. 42nd Annual Meeting of the Association for Computational Linguistics, Barcelona, SP, July 2004.
• Iterative Record Linkage for Cleaning and Integration Indrajit Bhattacharya and Lise Getoor. 9th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, Paris, FR, June 2004.
• Using the Structure of Web Sites for Automatic Segmentation of Tables, Kristina Lerman, Lise Getoor, Steve Minton and Craig Knoblock. Proceedings of ACM-SIGMOD 2004 International Conference on Management of Data, Paris, FR, June 2004.
• Structure Discovery using Statistical Relational Learning, Lise Getoor. Data Engineering Bulletin, vol. 26, No. 3, 2003.
• Link Mining: A New Data Mining Challenge, Lise Getoor. SIGKDD Explorations, volume 5, issue 1, 2003. Iterative Deduplication, I. Bhattacharya, L. Getoor.
• Link Mining: A New Data Mining Challenge, L. Getoor. SIGKDD Explorations, volume 4, issue 2, 2003.
• Link-based Classification, Q. Lu and L. Getoor, International Conference on Machine Learning, August, 2003
• Labeled and Unlabeled Data for Link-based Classification, Q. Lu and L. Getoor. ICML workshop on The Continuum from Labeled to Unlabeled Data, August, 2003.
• Link-based Classification for Text Classification and Mining, Q. Lu and L. Getoor. IJCAI workshop on Text Mining and Link Analysis
• IJCAI 03 Workshop: Learning Statistical Models from Relational Data SRL 2003, http://kdl.cs.umass.edu/srl2003
• ICML 04 Workshop: Statistical Relational Learning and Connections to Other Fields, SRL 2004, http://www.cs.umd.edu/srl2004
Supported by