Top Banner
Mining Advisor-Advisee Relationships from Res earch Publication Networks KDD2010 报报报 报报报
47

Mining Advisor-Advisee Relationships from Research Publication Networks

Jan 08, 2016

Download

Documents

elin

Mining Advisor-Advisee Relationships from Research Publication Networks. KDD2010 报告人:徐晓旻. INTRODUCTION. conduct a systematic investigation of the case of mining advisor-advisee relationships between authors in a research publication network. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Mining Advisor-Advisee Relationships from Research Publication Networks

Mining Advisor-Advisee Relationships from ResearchPublication Networks

KDD2010

报告人:徐晓旻

Page 2: Mining Advisor-Advisee Relationships from Research Publication Networks

INTRODUCTION

conduct a systematic investigation of the case of mining advisor-advisee relationships between authors in a research publication network. better understand the insight of the research co

mmunity provides additional semantic information on th

e links

Page 3: Mining Advisor-Advisee Relationships from Research Publication Networks
Page 4: Mining Advisor-Advisee Relationships from Research Publication Networks

INTRODUCTION(cont.) The left figure

shows the input: an temporal collaboration network, which consists of authors, papers

The middle figure shows the output of our analysis: an author network

with solid arrow indicating the advising relationship

The right figure gives an example of visualized chronological hierarc

hies.

Page 5: Mining Advisor-Advisee Relationships from Research Publication Networks

PROBLEM FORMULATION

{G} = {(V = Vp Va,E)},∪where Vp ={p1, . . . , pnp} is the set of publications, with pi published in time ti,

V a = {a1, . . . , ana} is the set of authors, and E is the set of edges.

Each edge eij E associates the paper pi and ∈the author aj , meaning aj is one author of pi.

Page 6: Mining Advisor-Advisee Relationships from Research Publication Networks

original network transformed

original network can be transformed into network containing only authors. Let G′ = (V ′,E′,{pyij}eij E∈ ′,{pnij}eij E∈ ′), where V ′

= {a0, . . . , ana} is the set of authors (including a virtual node a0). Each edge e′

ij = (i, j) E connects ∈authors ai and aj if they have publication together

two vectors associated with the edge, Pub_Year_vector pyij and Pub_Num_vector pnij .

Page 7: Mining Advisor-Advisee Relationships from Research Publication Networks

network transformed cont.

associate with each author two vectors pyi and pni to respectively represent the number of papers and the corresponding published year by author ai. The two vectors pyi and pni can be derived from pyij and pnij.

Page 8: Mining Advisor-Advisee Relationships from Research Publication Networks

this problem is more complicated

(i) one could have multiple advisors like master advisors, PhD co-advisors

(ii) some mentors from industry behave similarly as academic advisors if only judged by the collaboration history;

(iii) one’s advisor could be missing in the data set

Page 9: Mining Advisor-Advisee Relationships from Research Publication Networks

construct subgraph H′

Formally, we denote rij as the probability of aj being the advisor of ai.

construct a subgraph H′< G′by removing some edges from G′ and make the remaining edges directed from advisee to potential advisor.

Page 10: Mining Advisor-Advisee Relationships from Research Publication Networks

construct subgraph H′cont.

A simple way to predict is :to fetch top k potential advisors of ai and check whether aj is one of them while rij > ri0 or rij > , where is a threshold such as 0.5. We use P@(k, ) to denote this method.

Page 11: Mining Advisor-Advisee Relationships from Research Publication Networks
Page 12: Mining Advisor-Advisee Relationships from Research Publication Networks

4. APPROACH

The main idea is to leverage a time-constrained probabilistic factor graph model to decompose the joint probability of the unknown advisor of every author.

By maximizing the joint probability of the factor graph we can infer the relationship and compute ranking score for each relation edge on the candidate graph.

Page 13: Mining Advisor-Advisee Relationships from Research Publication Networks

4.1 Assumptions and Framework

Page 14: Mining Advisor-Advisee Relationships from Research Publication Networks

two-stage framework solution

In stage 1, we preprocess the heterogeneous collaboration network to generate the candidate graph H′. This includes the transformation from G to a homogeneous network G′ , the construction from G′ to H′, and the estimate of the local likelihood on each edge of H′

In stage 2, these potential relations are further modeled with a probabilistic model. Local likelihood and time constraints are combined in the global joint probability of all the hidden variables. The joint probability is maximized and the ranking score of all the potential relations is computed together. The construction of H is finished in this stage.

Page 15: Mining Advisor-Advisee Relationships from Research Publication Networks

4.2 Stage 1: Preprocessing

The purpose of preprocessing is to generate the candidate graph H′ and reduce the search space

First For each paper pi V p, we construct an edge between eve∈ry pair of its authors and update the vectors py and pn.

Then a filtering process is performed to remove unlikely relations of advisor-advisee

For each edge eij on G′, To decide whether aj is ai’s potential advisor, following conditions are checked. First, Assumption 2 is checked. Only if aFirst, Assumption 2 is checked. Only if ajj started to publish ea started to publish ea

rlier than arlier than aii

Second, some heuristic rules are applied, we list the rules here Second, some heuristic rules are applied, we list the rules here and will test them in the experiment partand will test them in the experiment part.

Page 16: Mining Advisor-Advisee Relationships from Research Publication Networks

Rule to detect advisor

The Kulczynski measure reflects the correlation of the two authors’publications.

IR is used to measure the imbalance of the occurrence of aj given ai and the occurrence of ai given aj

Page 17: Mining Advisor-Advisee Relationships from Research Publication Networks

Rule to detect advisor

Page 18: Mining Advisor-Advisee Relationships from Research Publication Networks

When the pair of authors passes the test of selected rules from them, we construct a directed edge from ai to aj in H′.

we estimate the starting time and ending time of the advising, as well as the local likelihood of aj being ai’s advisor lij

starting time stij is estimated as the time they started to collaborate

Page 19: Mining Advisor-Advisee Relationships from Research Publication Networks

the ending time edij can be estimated as either the time point when the Kulczynski measure starts to decrease, or the year making the largest difference between the Kulczynski measure before and after it.

local likelihood of aj being ai’s advisor lij

Page 20: Mining Advisor-Advisee Relationships from Research Publication Networks

Stage 2: TPFG Model

define the TPFG model For each node ai, there are three variables to de

cide: yi, sti, and edi.

local feature function g(yi, sti, edi)

joint probability of all the variables in the network

Page 21: Mining Advisor-Advisee Relationships from Research Publication Networks

Stage 2: TPFG Model

To find the most probable values of all the hidden variables, we need to maximize the joint probability of all of them.

It is intractable to do exhaustive search

Page 22: Mining Advisor-Advisee Relationships from Research Publication Networks

Decomposition of variables dependency

消除变量 sti,edi

计算 j为 i的老师的可能性,以及必须满足的条件 (由指示函数 I给出 )

Page 23: Mining Advisor-Advisee Relationships from Research Publication Networks

Decomposition of variables dependency

Page 24: Mining Advisor-Advisee Relationships from Research Publication Networks

该图中 f1(.)相关的节点有 y1,以及节点 1所有可能的学生节点从图表中可以看出是节点 2,3

Page 25: Mining Advisor-Advisee Relationships from Research Publication Networks

4.4 Model Learning

To maximize the objective function and compute the ranking score along with each edge in the candidate graph H′, we need to infer the marginal maximal joint probability on TPFG

Old methold:Sum-product a general algorithm called sum-product to compua general algorithm called sum-product to compu

te marginal function on a factor graph te marginal function on a factor graph without without cycles cycles based on message passingbased on message passing

Page 26: Mining Advisor-Advisee Relationships from Research Publication Networks

Sum-productSum-Product 算法继承了消息传递机制,但通过引入 factor graph将全局的概率密度函数分解成若干个局部概率密度函数的乘积

Page 27: Mining Advisor-Advisee Relationships from Research Publication Networks

single- sum-product algorithm

Page 28: Mining Advisor-Advisee Relationships from Research Publication Networks

Sum-product algorithm

考虑 gi(xi)正是只关于 xi的函数,即有gi(xi)=ux->gi()(xi)于是就照公式 (5)可得 gi

(xi)

Page 29: Mining Advisor-Advisee Relationships from Research Publication Networks

single- sum-product algorithm

Page 30: Mining Advisor-Advisee Relationships from Research Publication Networks

New TPFG Inference Algorithm

The original sum-product algorithm meet with difficulty since it requires that each node needs to wait for all-but-one message to arrive. Thus in TPFG some nodes will be waiting forever due to the existence of cycles.

we arrange the message passing in a mode based on the strict order determined by H′. Each node ai has a descendant set Y−1

i and an ascendant set Yi.

Page 31: Mining Advisor-Advisee Relationships from Research Publication Networks

Message Passing two-phase schema

In the first phase, messages are passed from advisees to possible advisors, and in the second, messages are passed back from advisors to possible advisees.

the first phase:The message from fi() to yi is generated and sent onl

y when all the messages from its descendants have arrived. And yi immediately send it to all its ascendants fj(), j Y∈ i.

Page 32: Mining Advisor-Advisee Relationships from Research Publication Networks

two-phase schema cont.

the second phase:each of which are along the reverse

direction on the edge as in phase 1.

为什么有了 lij 还要计算rij?因为 lij是 j为 i的导师的 local支持度rij根据定义是全局意义上的支持度他考虑了图的其他依赖关系,考虑形式就是该传播模型

Page 33: Mining Advisor-Advisee Relationships from Research Publication Networks

two-phase schema cont.

After the two phases of message propagation, we can collect the two messages on any edge and obtain the marginal function.

Page 34: Mining Advisor-Advisee Relationships from Research Publication Networks

simplify the message propagationEliminating the function nodes and the internal messa

ges between a function node and a variable nodeThe improved message propagation is still separated i

nto two Phases the first phase, the messages senti which passed fr

om one to their ascendants are generated in a similar order as before.

In the second, messages returned from ascendants recvi are stored in each node.

Page 35: Mining Advisor-Advisee Relationships from Research Publication Networks

simplify the message propagation

Page 36: Mining Advisor-Advisee Relationships from Research Publication Networks

simplify the message propagation

Page 37: Mining Advisor-Advisee Relationships from Research Publication Networks
Page 38: Mining Advisor-Advisee Relationships from Research Publication Networks

5. EXPERIMENTAL RESULTS

Data Sets:DBLP Computer Science Bibliography Database

test the accuracy of the discovered advisor-advisee relationships adopt three data sets: One is manually labeled by

looking into the home page of the advisors, and the other two are crawled from the Mathematics Genealogy project1 and AI Genealogy project

Page 39: Mining Advisor-Advisee Relationships from Research Publication Networks

compare TPFG with baselinemethods

Evaluation Aspects two performance measurements: accuracy and sca

lability.

Page 40: Mining Advisor-Advisee Relationships from Research Publication Networks

5.2 Accuracy

Effect of rules in TPFG From Figure 5(a) we can see that R2/R3 has th

e highest suitability on the tested data.

ROC 曲线:通过 test data中已知的师生 pair和算法计算出的师生 pair的比较,将计算出的 pair按照 rank score 从大到小排列,然后取横轴为 top a%of 计算pair,纵轴为 top a%与 test data 中 pair 的交集 /test data规模

Page 41: Mining Advisor-Advisee Relationships from Research Publication Networks

Effect of network structure

From Figure 5(c) we see that for closures with different depths,TPFG achieves better accuracy when the depth increases,

To compare it with the exact maximal joint probability and other approximate algorithmJuncT and LBP

Page 42: Mining Advisor-Advisee Relationships from Research Publication Networks

Effect of training data Support Vector Machines(SVMs) are accurate

supervised learning approaches reduce advisor mining to a classification problem

we combined Kulczynski and IR measures with as features.

TPFG can achieve comparable or even better accuracy compared with a supervised method

Page 43: Mining Advisor-Advisee Relationships from Research Publication Networks

Effect of training data

Page 44: Mining Advisor-Advisee Relationships from Research Publication Networks

5.3 Scalability Performance

Page 45: Mining Advisor-Advisee Relationships from Research Publication Networks

5.4 Applications Visualization of genealogy

The visualized hierarchies of research community based on the relationship can help us gain a better insight of the community

Page 46: Mining Advisor-Advisee Relationships from Research Publication Networks

5.4 Applications

Expert finding and Bole search bole search , a specific expert finding task, aim

ing to identify best supervisors

Page 47: Mining Advisor-Advisee Relationships from Research Publication Networks