Top Banner
Semi-Supervised Classification of Network Data Using Very Few Labels Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University ASONAM 2010 2010-08-11, Odense, Denmark
34

Semi-Supervised Classification of Network Data Using Very Few Labels Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University.

Dec 19, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Semi-Supervised Classification of Network Data Using Very Few Labels Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University.

Semi-Supervised Classification of Network Data Using Very Few Labels

Frank Lin and William W. CohenSchool of Computer Science, Carnegie Mellon University

ASONAM 20102010-08-11, Odense, Denmark

Page 2: Semi-Supervised Classification of Network Data Using Very Few Labels Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University.

Overview

• Preview• MultiRankWalk– Random Walk with Restart– RWR for Classification

• Seed Preference• Experiments• Results• The Question

Page 3: Semi-Supervised Classification of Network Data Using Very Few Labels Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University.

Preview

• Classification labels are expensive to obtain• Semi-supervised learning (SSL) learns from

labeled and unlabeled data for classification

Page 4: Semi-Supervised Classification of Network Data Using Very Few Labels Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University.

Preview

[Adamic & Glance 2005]

Page 5: Semi-Supervised Classification of Network Data Using Very Few Labels Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University.

Preview

• When it comes to network data, what is a general, simple, and effective method that requires very few labels?

• One that researchers could use as a strong baseline when developing more complex and domain-specific methods?

Our Answer:

MultiRankWalk (MRW)&

Label high PageRank nodes first (authoritative seeding)

Page 6: Semi-Supervised Classification of Network Data Using Very Few Labels Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University.

Preview• MRW (red) vs. a popular method (blue)

accu

racy

# of training labels

Only 1 training label per class!

Page 7: Semi-Supervised Classification of Network Data Using Very Few Labels Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University.

Preview• The popular method using authoritative seeding (red & green) vs. random seeding

(blue)

Same blue line as before

label “authoritative seeds” first

Page 8: Semi-Supervised Classification of Network Data Using Very Few Labels Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University.

Overview

• Preview• MultiRankWalk– Random Walk with Restart– RWR for Classification

• Seed Preference• Experiments• Results• The Question

Page 9: Semi-Supervised Classification of Network Data Using Very Few Labels Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University.

Random Walk with Restart

• Imagine a network, and starting at a specific node, you follow the edges randomly.

• But (perhaps you’re afraid of wondering too far) with some probability, you “jump” back to the starting node (restart!).

If you record the number of times you land on each node, what would that distribution

look like?

Page 10: Semi-Supervised Classification of Network Data Using Very Few Labels Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University.

Random Walk with Restart

What if we start at a

different node?Start node

Page 11: Semi-Supervised Classification of Network Data Using Very Few Labels Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University.

Random Walk with Restart

• The walk distribution r satisfies a simple equation:

rur dWd )1(

Start node(s)

Transition matrix of the

network

Restart probability

“Keep-going” probability (damping factor)

Equivalent to the well-known

PageRank ranking if all nodes are

start nodes! (u is uniform)

Page 12: Semi-Supervised Classification of Network Data Using Very Few Labels Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University.

Random Walk with Restart

• Random walk with restart (RWR) can be solved simply and efficiently with an iterative procedure:

1)1( tt dWd rur

Page 13: Semi-Supervised Classification of Network Data Using Very Few Labels Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University.

Overview

• Preview• MultiRankWalk– Random Walk with Restart– RWR for Classification

• Seed Preference• Experiments• Results• The Question

Page 14: Semi-Supervised Classification of Network Data Using Very Few Labels Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University.

RWR for Classification

RWR with start nodes being

labeled points in class A

RWR with start nodes being

labeled points in class B

Nodes frequented more by RWR(A) belongs to class A, otherwise they

belong to B

• Simple idea: use RWR for classification

Page 15: Semi-Supervised Classification of Network Data Using Very Few Labels Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University.

RWR for Classification

We refer to this method as MultiRankWalk: it classifies data with multiple rankings using random walks

Page 16: Semi-Supervised Classification of Network Data Using Very Few Labels Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University.

Overview

• Preview• MultiRankWalk– Random Walk with Restart– RWR for Classification

• Seed Preference• Experiments• Results• The Question

Page 17: Semi-Supervised Classification of Network Data Using Very Few Labels Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University.

Seed Preference

• Obtaining labels for data points is expensive• We want to minimize cost for obtaining labels• Observations:– Some labels inherently more useful than others– Some labels easier to obtain than others

Question: “Authoritative” or “popular” nodes in a network are typically easier to obtain labels for. But are these labels also more

useful than others?

Page 18: Semi-Supervised Classification of Network Data Using Very Few Labels Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University.

Seed Preference

• Consider the task of giving a human expert (or posting jobs on Amazon Mechanical Turk) a list of data points to label

• The list (seeds) can be generated uniformly at random, or we can have a seed preference, according to simple properties of the unlabeled data

• We consider 3 preferences:– Random– Link Count– PageRank

Nodes with highest counts make the list

Nodes with highest scores make the list

Page 19: Semi-Supervised Classification of Network Data Using Very Few Labels Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University.

Overview

• Preview• MultiRankWalk– Random Walk with Restart– RWR for Classification

• Seed Preference• Experiments• Results• The Question

Page 20: Semi-Supervised Classification of Network Data Using Very Few Labels Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University.

Experiments

• Test effectiveness of MRW and compare seed preferences on five real network datasets:

Political Blogs (Liberal vs. Conservative)

Citation Networks (7 and 6 academic fields,

respectively)

Page 21: Semi-Supervised Classification of Network Data Using Very Few Labels Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University.

Experiments

• We compare MRW against a currently very popular network SSL method – wvRN

“weighted-voted relational

network classifier”

You may know wvRN as the harmonic functions method,

adsorption, random walk with sink nodes, …

Recommended as a strong network SSL baseline in (Macskassy & Provost 2007)

Page 22: Semi-Supervised Classification of Network Data Using Very Few Labels Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University.

Experiments

• To simulate a human expert labeling data, we use the “ranked-at-least-n-per-class” method

Political blog example with n=2:blogsforbush.comdailykos.commoorewatch.comright-thinking.comtalkingpointsmemo.cominstapundit.commichellemalkin.comatrios.blogspot.comlittlegreenfootballs.comwashingtonmonthly.compowerlineblog.comdrudgereport.com

conservativeliberalconservativeconservativeliberal

We have at least 2 labels

per class. Stop.

Page 23: Semi-Supervised Classification of Network Data Using Very Few Labels Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University.

Overview

• Preview• MultiRankWalk– Random Walk with Restart– RWR for Classification

• Seed Preference• Experiments• Results• The Question

Page 24: Semi-Supervised Classification of Network Data Using Very Few Labels Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University.

Results

• MRW vs. wvRN with random seed preference

MRW drastically better with a small number of

seed labels; performance not significantly different with larger numbers of

seeds

MRW does extremely well with just one

randomly selected label

per class!

Averaged over 20 runs

Page 25: Semi-Supervised Classification of Network Data Using Very Few Labels Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University.

Results

• wvRN with different seed preferences

LinkCount or PageRank

much better than Random with smaller number of seed labels

PageRank slightly better

than LinkCount, but in general not significantly so

Page 26: Semi-Supervised Classification of Network Data Using Very Few Labels Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University.

Results

• Does MRW benefit from seed preference?

Yes, on certain datasets with small number of seed labels;

note the already very high F1 on

most datasets

A rare instance where

authoritative seeds hurt

performance, but not

statistically significant

Page 27: Semi-Supervised Classification of Network Data Using Very Few Labels Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University.

Results• How much better is MRW using authoritative seed preference?

y-axis:MRW F1

score minus wvRN F1

x-axis: number of seed

labels per classThe gap between

MRW and wvRN narrows with authoritative

seeds, but they are still

prominent on some datasets

with small number of seed

labels

Page 28: Semi-Supervised Classification of Network Data Using Very Few Labels Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University.

Results

• Summary– MRW much better than wvRN with small number

of seed labels– MRW more robust to varying quality of seed labels

than wvRN– Authoritative seed preference boosts algorithm

effectiveness with small number of seed labels

We recommend MRW and authoritative seed preference as a strong baseline for semi-supervised classification on network data

Page 29: Semi-Supervised Classification of Network Data Using Very Few Labels Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University.

Overview

• Preview• MultiRankWalk– Random Walk with Restart– RWR for Classification

• Seed Preference• Experiments• Results• The Question

Page 30: Semi-Supervised Classification of Network Data Using Very Few Labels Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University.

The Question• What really makes MRW and wvRN different?• Network-based SSL often boil down to label propagation. • MRW and wvRN represent two general propagation methods –

note that they are call by many names:MRW wvRN

Random walk with restart Reverse random walk

Regularized random walk Random walk with sink nodes

Personalized PageRank Hitting time

Local & global consistency Harmonic functions on graphs

Iterative averaging of neighbors

Great…but we still don’t know why the differences in

their behavior on these network datasets!

Page 31: Semi-Supervised Classification of Network Data Using Very Few Labels Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University.

The Question

• It’s difficult to answer exactly why MRW does better with a smaller number of seeds.

• But we can gather probable factors from their propagation models:

MRW wvRN

1 Centrality-sensitive Centrality-insensitive

2 Exponential drop-off / damping factor No drop-off / damping

3Propagation of different classes done independently

Propagation of different classes interact

Page 32: Semi-Supervised Classification of Network Data Using Very Few Labels Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University.

The Question

• An example from a political blog dataset – MRW vs. wvRN scores for how much a blog is politically conservative:

1.000 neoconservatives.blogspot.com1.000 strangedoctrines.typepad.com1.000 jmbzine.com0.593 presidentboxer.blogspot.com0.585 rooksrant.com0.568 purplestates.blogspot.com0.553 ikilledcheguevara.blogspot.com0.540 restoreamerica.blogspot.com0.539 billrice.org0.529 kalblog.com0.517 right-thinking.com0.517 tom-hanna.org0.514 crankylittleblog.blogspot.com0.510 hasidicgentile.org0.509 stealthebandwagon.blogspot.com0.509 carpetblogger.com0.497 politicalvicesquad.blogspot.com0.496 nerepublican.blogspot.com0.494 centinel.blogspot.com0.494 scrawlville.com0.493 allspinzone.blogspot.com0.492 littlegreenfootballs.com0.492 wehavesomeplanes.blogspot.com0.491 rittenhouse.blogspot.com0.490 secureliberty.org0.488 decision08.blogspot.com0.488 larsonreport.com

0.020 firstdownpolitics.com0.019 neoconservatives.blogspot.com0.017 jmbzine.com0.017 strangedoctrines.typepad.com0.013 millers_time.typepad.com0.011 decision08.blogspot.com0.010 gopandcollege.blogspot.com0.010 charlineandjamie.com0.008 marksteyn.com0.007 blackmanforbush.blogspot.com0.007 reggiescorner.blogspot.com0.007 fearfulsymmetry.blogspot.com0.006 quibbles-n-bits.com0.006 undercaffeinated.com0.005 samizdata.net0.005 pennywit.com0.005 pajamahadin.com0.005 mixtersmix.blogspot.com0.005 stillfighting.blogspot.com0.005 shakespearessister.blogspot.com0.005 jadbury.com0.005 thefulcrum.blogspot.com0.005 watchandwait.blogspot.com0.005 gindy.blogspot.com0.005 cecile.squarespace.com0.005 usliberals.about.com0.005 twentyfirstcenturyrepublican.blogspot.com

Seed labels underlined

1. Centrality-sensitive: seeds have different

scores and not necessarily the highest

2. Exponential drop-off: much less sure

about nodes further away from seeds

3. Classes propagate independently:

charlineandjamie.com is both very likely a conservative and a

liberal blog (good or bad?)

We still don’t completely understand it yet.

Page 33: Semi-Supervised Classification of Network Data Using Very Few Labels Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University.

Questions?

Page 34: Semi-Supervised Classification of Network Data Using Very Few Labels Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University.

Related Work

• MRW is very much related to– “Local and global consistency” (Zhou et al. 2004)– “Web content categorization using link information”

(Gyongyi et al. 2006)– “Graph-based semi-supervised learning as a generative

model” (He et al. 2007)• Seed preference is related to the field of active

learning– Active learning chooses which data point to label next

based on previous labels; the labeling is interactive– Seed preference is a batch labeling method

Similar formulation,

different view

RWR ranking as features to SVM

Random walk

without restart,

heuristic stopping

Authoritative seed preference a good base line for active

learning on network data!