Heterogeneous Heterogeneous Social and Social and Information Information Networks Networks Jiawei Han Computer Science , University of Illinois at Urbana- Champaign Collaborated with many, especially Yizhou Sun, Ming Ji, Chi Wang, Tim Weninger, Xiaoxin Yin, Bo Zhao Acknowledgements: ARL, NSF, AFOSR (MURI), NASA, Microsoft, IBM, Yahoo!, Boeing July 3, 2022 1
29
Embed
Challenging Problems for Scalable Mining of Heterogeneous Social and Information Networks by Jiawei Han
In today’s interconnected real world, social and informational entities are interconnected, forming gigantic, interconnected, integrated social and information networks. By structuring these data objects into multiple types, such networks become semi-structured heterogeneous social and information networks. Most real world applications that handle big data, including interconnected social media and social networks, medical information systems, online e-commerce systems, or database systems, can be structured into typed, heterogeneous social and information networks. For example, in a medical care network, objects of multiple types, such as patients, doctors, diseases, medication, and links such as visits, diagnosis, and treatments are intertwined together, providing rich information and forming heterogeneous information networks. Effective analysis of large-scale heterogeneous social and information networks poses an interesting but critical challenge.
In this talk, we present a set of data mining scenarios in heterogeneous social and information networks and show that mining typed, heterogeneous networks is a new and promising research frontier in data mining research. However, such mining may raise some serious challenging problems on scalability computation. We identify a set of problems on scalable computation and calls for serious studies on such problems. This includes how to efficiently computation for (1) meta path-based similarity search, (2) rank-based clustering, (3) rank-based classification, (4) meta path-based link/relationship prediction, and (5) topical hierarchies from heterogeneous information networks. We introduce some recent efforts, discuss the trade-offs between query-independent pre-computation vs. query-dependent online computation, and point out some promising research directions.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Challenging Problems for Challenging Problems for Scalable Mining of Scalable Mining of
Heterogeneous Social and Heterogeneous Social and Information NetworksInformation Networks
Jiawei Han Computer Science , University of Illinois at Urbana-Champaign
Collaborated with many, especially Yizhou Sun, Ming Ji, Chi Wang, Tim Weninger, Xiaoxin Yin, Bo Zhao
Acknowledgements: ARL, NSF, AFOSR (MURI), NASA, Microsoft, IBM, Yahoo!, Boeing
April 13, 2023
1
2
OutlineOutline Why Is Mining Heterogeneous Social and Info Networks Promising?Why Is Mining Heterogeneous Social and Info Networks Promising?
Homogeneous vs. Heterogeneous Social and Info. Networks Homogeneous vs. Heterogeneous Social and Info. Networks
On the Power of Mining Structured, Heterogeneous Social and On the Power of Mining Structured, Heterogeneous Social and Info. Networks Info. Networks
Challenges on BigMine: Scalable Mining of Massive Heterogeneous Challenges on BigMine: Scalable Mining of Massive Heterogeneous Social and Information NetworksSocial and Information Networks
Structured Heterogeneous Network Modeling Structured Heterogeneous Network Modeling Leads to the New Power of Data Mining! Leads to the New Power of Data Mining!
DBLP: A Computer Science bibliographic database
A sample publication record in DBLP (>2 M papers, >0.7 M authors, >10 K venues), …
5
Power of het. network modeling: Treat Author, Venue, Term, Paper all first-class citizens!
6
OutlineOutline Why Is Mining Heterogeneous Social and Info Networks Promising?Why Is Mining Heterogeneous Social and Info Networks Promising?
Homogeneous vs. Heterogeneous Social and Info. Networks Homogeneous vs. Heterogeneous Social and Info. Networks
On the Power of Mining Structured, Heterogeneous Social and On the Power of Mining Structured, Heterogeneous Social and Info. Networks Info. Networks
Challenges on BigMine: Scalable Mining of Massive Heterogeneous Challenges on BigMine: Scalable Mining of Massive Heterogeneous Social and Information NetworksSocial and Information Networks
PathPredict: Query-Based Prediction Using Meta-PathPathPredict: Query-Based Prediction Using Meta-Path
Efficient Hidden Network Discovery: A Scalability ChallengeEfficient Hidden Network Discovery: A Scalability Challenge
ConclusionsConclusions
7
On the Power of Mining Structured, On the Power of Mining Structured, Heterogeneous Networks Heterogeneous Networks
Links carry a lot of hidden information in structured, Links carry a lot of hidden information in structured, heterogeneous social and information networks heterogeneous social and information networks
Effectiveness of miningEffectiveness of mining Clustering in heterogeneous networks: Rank-based Clustering in heterogeneous networks: Rank-based
clustering: (RankClus [EDBT’09] and NetClus [KDD’09]) and clustering: (RankClus [EDBT’09] and NetClus [KDD’09]) and user-guided, meta-path-based clustering [KDD’12]user-guided, meta-path-based clustering [KDD’12]
Knowledge propgation through heterogeneous links Knowledge propgation through heterogeneous links (GNetMine [ECMLPKDD’10]) and Rank-based classification (GNetMine [ECMLPKDD’10]) and Rank-based classification (RankClass [KDD’11])(RankClass [KDD’11])
Meta-path-based similarity search (PathSim [VLDB’11])Meta-path-based similarity search (PathSim [VLDB’11]) Meta-path-based prediction in heterogeneous networks Meta-path-based prediction in heterogeneous networks
David J. DeWitt 0.00491615Hector Garcia-Molina 0.00453497
H. V. Jagadish 0.00434289David B. Lomet 0.00397865
Raghu Ramakrishnan 0.0039278Philip A. Bernstein 0.00376314
Joseph M. Hellerstein 0.00372064Jeffrey F. Naughton 0.00363698Yannis E. Ioannidis 0.00359853
Jennifer Widom 0.00351929Per-Ake Larson 0.00334911Rakesh Agrawal 0.00328274
Dan Suciu 0.00309047Michael J. Franklin 0.00304099Umeshwar Dayal 0.00290143
Abraham Silberschatz 0.00278185
VLDB 0.318495SIGMOD Conf. 0.313903
ICDE 0.188746PODS 0.107943EDBT 0.0436849
Go one-level deeper: Authors in XML, Xquery
cluster
Term Venue Author
Rank-Based Clustering for OthersRank-Based Clustering for Others
11
RankCompete: Organize your photo album automatically!RankCompete: Organize your photo album automatically!
Rank treatments for AIDS from MEDLINERank treatments for AIDS from MEDLINE
12
Classification in Heterogeneous NetworksClassification in Heterogeneous Networks GNetMine [ECMLPKDD'10]:
Knowledge propagation across heterogeneous links
RankClass [KDD’11]: Integration of ranking and classification in heterogeneous network analysis
Highly ranked objects play more role in classification
An object can only be ranked high in some focused classes
Class membership and ranking are stat. distributions
Let ranking and classification mutually enhance each other!
Output: Classification results + ranking list of objects within each class
Experiments with Very Small Training SetExperiments with Very Small Training Set
DBLP: 4-fields data set (DB, DM, AI, IR) forming a heterog. info. network Rank objects within each class (with extremely limited label information) Obtain High classification accuracy and excellent rankings within each class
Database Data Mining AI IR
Top-5 ranked conferences
VLDB KDD IJCAI SIGIR
SIGMOD SDM AAAI ECIR
ICDE ICDM ICML CIKM
PODS PKDD CVPR WWW
EDBT PAKDD ECML WSDM
Top-5 ranked terms
data mining learning retrieval
database data knowledge information
query clustering reasoning web
system classification logic search
xml frequent cognition text
13
Similarity Search: Find Similar Objects in Networks Similarity Search: Find Similar Objects in Networks
Who are most similar to Christos Faloutsos? Meta-Path: Meta-level description of a path between two
objects
Christos’s students or close collaborators Similar reputation at similar venues
PathPredict: PathPredict: Meta-Path Based New Co-author Meta-Path Based New Co-author Relationship Prediction in DBLP [ASONAM’11]Relationship Prediction in DBLP [ASONAM’11]
Co-authorship prediction: Whether two authors are going to collaborate for the first time
Co-authorship encoded in meta-path Author-Paper-Author (A-P-A)
Topological features encoded in meta-paths as below:
Meta-paths between authors under length 4Meta-paths between authors under length 4
Meta-Path Semantic Meaning
16
The Success of PathPredict: Exploring Meta-PathsThe Success of PathPredict: Exploring Meta-Paths
Explain the prediction power of each meta-path Wald Test for logistic
regression Higher prediction accuracy
than using projected homogeneous network 11% higher in
prediction accuracy Citation prediction
The selected meta-paths could be rather different
17
Co-author prediction Co-author prediction for Jian Peifor Jian Pei: Only 42 among 4809 : Only 42 among 4809 candidates are true first-time co-authors!candidates are true first-time co-authors!(Feature collected in [1996, 2002]; Test period in [2003,2009])
18
OutlineOutline Why Is Mining Heterogeneous Social and Info Networks Promising?Why Is Mining Heterogeneous Social and Info Networks Promising?
Homogeneous vs. Heterogeneous Social and Info. Networks Homogeneous vs. Heterogeneous Social and Info. Networks
On the Power of Mining Structured, Heterogeneous Social and On the Power of Mining Structured, Heterogeneous Social and Info. Networks Info. Networks
Challenges on BigMine: Scalable Mining of Massive Heterogeneous Challenges on BigMine: Scalable Mining of Massive Heterogeneous Social and Information NetworksSocial and Information Networks
PathPredict: Query-Based Prediction Using Meta-PathPathPredict: Query-Based Prediction Using Meta-Path
Efficient Hidden Network Discovery: A Scalability ChallengeEfficient Hidden Network Discovery: A Scalability Challenge
ConclusionsConclusions
19
Challenges on BigMineChallenges on BigMine Scalable mining of massive information networks: Necessity
Many such networks are gigantic: News, PubMed, … DBLP is a small one: 2M papers and 0.8M authors, …
Meta-path: Potentially long chains of matrix multiplication of such networks
APVPA: AP X PV X VP X PA Comparative analysis of multi-meta-paths is costly
Scalable mining of massive information networks: Possibility Many functions do not need to compute eigen values Top-k computation may save computation cost substantially Precomputation may save online computation substantially Clustering-based precomputation:
20
Computing Eigen Values: When Need It? Computing Eigen Values: When Need It?