Top Banner
Confidence-based Graph Convolutional Networks for Semi-Supervised Learning Shikhar Vashishth Prateek Yadav Manik Bhandari Partha Talukdar Indian Institute of Science {shikhar,prateekyadav,manikb,ppt}@iisc.ac.in Abstract Predicting properties of nodes in a graph is an important problem with applications in a variety of domains. Graph-based Semi Super- vised Learning (SSL) methods aim to address this problem by labeling a small subset of the nodes as seeds and then utilizing the graph structure to predict label scores for the rest of the nodes in the graph. Recently, Graph Con- volutional Networks (GCNs) have achieved impressive performance on the graph-based SSL task. In addition to label scores, it is also desirable to have confidence scores associated with them. Unfortunately, confidence estima- tion in the context of GCN has not been pre- viously explored. We fill this important gap in this paper and propose ConfGCN, which estimates labels scores along with their con- fidences jointly in GCN-based setting. Con- fGCN uses these estimated confidences to de- termine the influence of one node on another during neighborhood aggregation, thereby ac- quiring anisotropic 1 capabilities. Through ex- tensive analysis and experiments on standard benchmarks, we find that ConfGCN is able to outperform state-of-the-art baselines. We have made ConfGCN’s source code available to encourage reproducible research. Equal contribution 1 anisotropic (adjective): varying in magnitude according to the direction of measurement (Oxford English Dictio- nary) Proceedings of the 22 nd International Conference on Ar- tificial Intelligence and Statistics (AISTATS) 2019, Naha, Okinawa, Japan. PMLR: Volume 89. Copyright 2019 by the author(s). 1 Introduction Graphs are all around us, ranging from citation and social networks to knowledge graphs. Predicting prop- erties of nodes in such graphs is often desirable. For example, given a citation network, we may want to pre- dict the research area of an author. Making such pre- dictions, especially in the semi-supervised setting, has been the focus of graph-based semi-supervised learning (SSL) (Subramanya and Talukdar, 2014). In graph- based SSL, a small set of nodes are initially labeled. Starting with such supervision and while utilizing the rest of the graph structure, the initially unlabeled nodes are labeled. Conventionally, the graph struc- ture has been incorporated as an explicit regularizer which enforces a smoothness constraint on the labels estimated on nodes (Zhu et al., 2003; Belkin et al., 2006; Weston et al., 2008). Recently proposed Graph Convolutional Networks (GCN) (Defferrard et al., 2016; Kipf and Welling, 2016) provide a framework to ap- ply deep neural networks to graph-structured data. GCNs have been employed successfully for improving performance on tasks such as semantic role labeling (Marcheggiani and Titov, 2017), machine translation (Bastings et al., 2017), relation extraction (Vashishth et al., 2018b; Zhang et al., 2018), document dating (Vashishth et al., 2018a), shape segmentation (Yi et al., 2016), and action recognition (Huang et al., 2017). GCN formulations for graph-based SSL have also at- tained state-of-the-art performance (Kipf and Welling, 2016; Liao et al., 2018; Veličković et al., 2018). In this paper, we also focus on the task of graph-based SSL using GCNs. GCN iteratively estimates embedding of nodes in the graph by aggregating embeddings of neighborhood nodes, while backpropagating errors from a target loss function. Finally, the learned node embeddings are used to estimate label scores on the nodes. In addition to the label scores, it is desirable to also have confi- dence estimates associated with them. Such confidence scores may be used to determine how much to trust arXiv:1901.08255v2 [cs.LG] 12 Feb 2019
10

arXiv:1901.08255v2 [cs.LG] 12 Feb 2019 · 2019-02-13 · Runningheadingauthorbreakstheline Dataset Nodes Edges Classes Features LabelMismatch |V l| |V| Cora 2,708 5,429 7 1,433 0.002

Aug 12, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: arXiv:1901.08255v2 [cs.LG] 12 Feb 2019 · 2019-02-13 · Runningheadingauthorbreakstheline Dataset Nodes Edges Classes Features LabelMismatch |V l| |V| Cora 2,708 5,429 7 1,433 0.002

Confidence-based Graph Convolutional Networksfor Semi-Supervised Learning

Shikhar Vashishth? Prateek Yadav? Manik Bhandari? Partha TalukdarIndian Institute of Science

{shikhar,prateekyadav,manikb,ppt}@iisc.ac.in

Abstract

Predicting properties of nodes in a graph isan important problem with applications in avariety of domains. Graph-based Semi Super-vised Learning (SSL) methods aim to addressthis problem by labeling a small subset of thenodes as seeds and then utilizing the graphstructure to predict label scores for the rest ofthe nodes in the graph. Recently, Graph Con-volutional Networks (GCNs) have achievedimpressive performance on the graph-basedSSL task. In addition to label scores, it is alsodesirable to have confidence scores associatedwith them. Unfortunately, confidence estima-tion in the context of GCN has not been pre-viously explored. We fill this important gapin this paper and propose ConfGCN, whichestimates labels scores along with their con-fidences jointly in GCN-based setting. Con-fGCN uses these estimated confidences to de-termine the influence of one node on anotherduring neighborhood aggregation, thereby ac-quiring anisotropic1 capabilities. Through ex-tensive analysis and experiments on standardbenchmarks, we find that ConfGCN is ableto outperform state-of-the-art baselines. Wehave made ConfGCN’s source code availableto encourage reproducible research.

?

Equal contribution1anisotropic (adjective): varying in magnitude according

to the direction of measurement (Oxford English Dictio-nary)

Proceedings of the 22nd International Conference on Ar-tificial Intelligence and Statistics (AISTATS) 2019, Naha,Okinawa, Japan. PMLR: Volume 89. Copyright 2019 bythe author(s).

1 Introduction

Graphs are all around us, ranging from citation andsocial networks to knowledge graphs. Predicting prop-erties of nodes in such graphs is often desirable. Forexample, given a citation network, we may want to pre-dict the research area of an author. Making such pre-dictions, especially in the semi-supervised setting, hasbeen the focus of graph-based semi-supervised learning(SSL) (Subramanya and Talukdar, 2014). In graph-based SSL, a small set of nodes are initially labeled.Starting with such supervision and while utilizing therest of the graph structure, the initially unlabelednodes are labeled. Conventionally, the graph struc-ture has been incorporated as an explicit regularizerwhich enforces a smoothness constraint on the labelsestimated on nodes (Zhu et al., 2003; Belkin et al.,2006; Weston et al., 2008). Recently proposed GraphConvolutional Networks (GCN) (Defferrard et al., 2016;Kipf and Welling, 2016) provide a framework to ap-ply deep neural networks to graph-structured data.GCNs have been employed successfully for improvingperformance on tasks such as semantic role labeling(Marcheggiani and Titov, 2017), machine translation(Bastings et al., 2017), relation extraction (Vashishthet al., 2018b; Zhang et al., 2018), document dating(Vashishth et al., 2018a), shape segmentation (Yi et al.,2016), and action recognition (Huang et al., 2017).GCN formulations for graph-based SSL have also at-tained state-of-the-art performance (Kipf and Welling,2016; Liao et al., 2018; Veličković et al., 2018). In thispaper, we also focus on the task of graph-based SSLusing GCNs.

GCN iteratively estimates embedding of nodes in thegraph by aggregating embeddings of neighborhoodnodes, while backpropagating errors from a target lossfunction. Finally, the learned node embeddings areused to estimate label scores on the nodes. In additionto the label scores, it is desirable to also have confi-dence estimates associated with them. Such confidencescores may be used to determine how much to trust

arX

iv:1

901.

0825

5v2

[cs

.LG

] 1

2 Fe

b 20

19

Page 2: arXiv:1901.08255v2 [cs.LG] 12 Feb 2019 · 2019-02-13 · Runningheadingauthorbreakstheline Dataset Nodes Edges Classes Features LabelMismatch |V l| |V| Cora 2,708 5,429 7 1,433 0.002

Confidence-based Graph Convolutional Networks for Semi-Supervised Learning

the label scores estimated on a given node. Whilemethods to estimate label score confidence in non-deepgraph-based SSL has been previously proposed (Or-bach and Crammer, 2012), confidence-based GCN isstill unexplored.

In order to fill this important gap, we propose Con-fGCN, a GCN framework for graph-based SSL. Con-fGCN jointly estimates label scores on nodes, alongwith confidences over them. One of the added benefitsof confidence over node’s label scores is that they maybe used to subdue irrelevant nodes in a node’s neigh-borhood, thereby controlling the number of effectiveneighbors for each node. In other words, this enablesanisotropic behavior in GCNs. Let us explain thisthrough the example shown in Figure 1. In this figure,while node a has true label L0 (white), it is incorrectlyclassified as L1 (black) by Kipf-GCN (Kipf and Welling,2016)2. This is because Kipf-GCN suffers from limita-tions of its neighborhood aggregation scheme (Xu et al.,2018). For example, Kipf-GCN has no constraints onthe number of nodes that can influence the represen-tation of a given target node. In a k-layer Kipf-GCNmodel, each node is influenced by all the nodes in itsk-hop neighborhood. However, in real world graphs,nodes are often present in heterogeneous neighborhoods,i.e., a node is often surrounded by nodes of other la-bels. For example, in Figure 1, node a is surroundedby three nodes (d, e, and f) which are predominantlylabeled L1, while two nodes (b and c) are labeled L0.Please note that all of these are estimated label scoresduring GCN learning. In this case, it is desirable thatnode a is more influenced by nodes b and c than theother three nodes. However, since Kipf-GCN doesn’tdiscriminate among the neighboring nodes, it is swayedby the majority and thereby estimating the wrong labelL1 for node a.

ConfGCN is able to overcome this problem by estimat-ing confidences on each node’s label scores. In Figure1, such estimated confidences are shown by bars, withwhite and black bars denoting confidences in scores oflabels L0 and L1, respectively. ConfGCN uses theselabel confidences to subdue nodes d, e, f since theyhave low confidence for their label L1 (shorter blackbars), whereas nodes b and c are highly confident abouttheir labels being L0 (taller white bars). This leadsto higher influence of b and c during aggregation, andthereby ConfGCN correctly predicting the true labelof node a as L0 with high confidence. This clearlydemonstrates the benefit of label confidences and theirutility in estimating node influences. Graph AttentionNetworks (GAT) (Veličković et al., 2018), a recentlyproposed method also provides a mechanism to esti-

2In this paper, unless otherwise stated, we refer to Kipf-GCN whenever we mention GCN.

mate influences by allowing nodes to attend to theirneighborhood. However, as we shall see in Section6, ConfGCN, through its use of label confidences, isconsiderably more effective.

Our contributions in this paper are as follows.

• We propose ConfGCN, a Graph ConvolutionalNetwork (GCN) framework for semi-supervisedlearning which models label distribution and theirconfidences for each node in the graph. To thebest of our knowledge, this is the first confidence-enabled formulation of GCNs.

• ConfGCN utilize label confidences to estimate in-fluence of one node on another in a label-specificmanner during neighborhood aggregation of GCNlearning.

• Through extensive evaluation on multiple real-world datasets, we demonstrate ConfGCN effec-tiveness over state-of-the-art baselines.

ConfGCN’s source code and datasets used in the paperare available at http://github.com/malllabiisc/ConfGCN.

2 Related Work

Semi-Supervised learning (SSL) on graphs: SSLon graphs is the problem of classifying nodes in agraph, where labels are available only for a small frac-tion of nodes. Conventionally, the graph structure isimposed by adding an explicit graph-based regulariza-tion term in the loss function (Zhu et al., 2003; Westonet al., 2008; Belkin et al., 2006). Recently, implicitgraph regularization via learned node representationhas proven to be more effective. This can be done ei-ther sequentially or in an end to end fashion. Methodslike DeepWalk (Perozzi et al., 2014), node2vec (Groverand Leskovec, 2016), and LINE (Tang et al., 2015) firstlearn graph representations via sampled random walkon the graph or breadth first search traversal and thenuse the learned representation for node classification.On the contrary, Planetoid (Yang et al., 2016) learnsnode embedding by jointly predicting the class labelsand the neighborhood context in the graph. Recently,Kipf and Welling (2016) employs Graph ConvolutionalNetworks (GCNs) to learn node representations.

Graph Convolutional Networks (GCNs): Thegeneralization of Convolutional Neural Networks tonon-euclidean domains is proposed by Bruna et al.(2013) which formulates the spectral and spatial con-struction of GCNs. This is later improved through anefficient localized filter approximation (Defferrard et al.,2016). Kipf and Welling (2016) provide a first-order

Page 3: arXiv:1901.08255v2 [cs.LG] 12 Feb 2019 · 2019-02-13 · Runningheadingauthorbreakstheline Dataset Nodes Edges Classes Features LabelMismatch |V l| |V| Cora 2,708 5,429 7 1,433 0.002

Running heading author breaks the line

Kipf-GCN ConfGCN

L0

L1

a

d

e

f

b

c

aa

d

e

f

b

c

Label confidences

True label: L0Predicted label: L1

True label: L0Predicted label: L0

Figure 1: Label prediction on node a by Kipf-GCN and ConfGCN (this paper). L0 is a’s true label. Shadeintensity of a node reflects the estimated score of label L1 assigned to that node. Since Kipf-GCN is not capableof estimating influence of one node on another, it is misled by the dominant label L1 in node a’s neighborhoodand thereby making the wrong assignment. ConfGCN, on the other hand, estimates confidences (shown by bars)over the label scores, and uses them to increase influence of nodes b and c to estimate the right label on a. Pleasesee Section 1 for details.

formulation of GCNs and show its effectiveness for SSLon graphs. Marcheggiani and Titov (2017) proposeGCNs for directed graphs and provide a mechanismfor edge-wise gating to discard noisy edges during ag-gregation. This is further improved by Veličković et al.(2018) which allows nodes to attend to their neigh-boring nodes, implicitly providing different weights todifferent nodes. Liao et al. (2018) propose Graph Parti-tion Neural Network (GPNN), an extension of GCNs tolearn node representations on large graphs. GPNN firstpartitions the graph into subgraphs and then alternatesbetween locally and globally propagating informationacross subgraphs. Recently, Lovasz Convolutional Net-works Yadav et al. (2019) is proposed for incorporatingglobal graph properties in GCNs. An extensive sur-vey of GCNs and their applications can be found inBronstein et al. (2017).

Confidence Based Methods: The natural idea of in-corporating confidence in predictions has been exploredby Li and Sethi (2006) for the task of active learning.Lei (2014) proposes a confidence based framework forclassification problems, where the classifier consists oftwo regions in the predictor space, one for confidentclassifications and other for ambiguous ones. In repre-sentation learning, uncertainty (inverse of confidence) isfirst utilized for word embeddings by Vilnis and McCal-lum (2014). Athiwaratkun and Wilson (2018) furtherextend this idea to learn hierarchical word representa-tion through encapsulation of probability distributions.Orbach and Crammer (2012) propose TACO (Transduc-tion Algorithm with COnfidence), the first graph basedmethod which learns label distribution along with its

uncertainty for semi-supervised node classification. Bo-jchevski and Günnemann (2018) embeds graph nodes asGaussian distribution using ranking based frameworkwhich allows to capture uncertainty of representation.They update node embeddings to maintain neighbor-hood ordering, i.e. 1-hop neighbors are more similar to2-hop neighbors and so on. Gaussian embeddings havebeen used for collaborative filtering (Dos Santos et al.,2017) and topic modelling (Das et al., 2015) as well.

3 Notation & Problem Statement

Let G = (V, E ,X ) be an undirected graph, where V =Vl ∪ Vu is the union of labeled (Vl) and unlabeled (Vu)nodes in the graph with cardinalities nl and nu, E isthe set of edges and X ∈ R(nl+nu)×d is the input nodefeatures. The actual label of a node v is denoted bya one-hot vector Yv ∈ Rm, where m is the numberof classes. Given G and seed labels Y ∈ Rnl×m, thegoal is to predict the labels of the unlabeled nodes.To incorporate confidence, we additionally estimatelabel distribution µv ∈ Rm and a diagonal co-variancematrix Σv ∈ Rm×m, ∀v ∈ V. Here, µv,i denotes thescore of label i on node v, while (Σv)ii denotes thevariance in the estimation of µv,i. In other words,(Σ−1

v )ii is ConfGCN’s confidence in µv,i.

4 Background: Graph ConvolutionalNetworks

In this section, we give a brief overview of Graph Con-volutional Networks (GCNs) for undirected graphs as

Page 4: arXiv:1901.08255v2 [cs.LG] 12 Feb 2019 · 2019-02-13 · Runningheadingauthorbreakstheline Dataset Nodes Edges Classes Features LabelMismatch |V l| |V| Cora 2,708 5,429 7 1,433 0.002

Confidence-based Graph Convolutional Networks for Semi-Supervised Learning

proposed by Kipf and Welling (2016). Given a graphG = (V, E ,X ) as defined Section 3, the node repre-sentation after a single layer of GCN can be definedas

H = f((D− 12 (A+ I)D− 1

2 )XW ) (1)

where, W ∈ Rd×d denotes the model parameters, Ais the adjacency matrix and Dii =

∑j(A+ I)ij . f is

any activation function, we have used ReLU, f(x) =max(0, x) in this paper. Equation 1 can also be writtenas

hv = f

∑u∈N (v)

Whu + b

, ∀v ∈ V. (2)

Here, b ∈ Rd denotes bias, N (v) corresponds to imme-diate neighbors of v in graph G including v itself andhv is the obtained representation of node v.

For capturing multi-hop dependencies between nodes,multiple GCN layers can be stacked on top of oneanother. The representation of node v after kth layerof GCN is given as

hk+1v = f

∑u∈N (v)

(W khku + bk

) ,∀v ∈ V. (3)

where, W k, bk denote the layer specific parameters ofGCN.

5 Confidence Based GraphConvolution (ConfGCN)

Following (Orbach and Crammer, 2012), ConfGCNuses co-variance matrix based symmetric Mahalanobisdistance for defining distance between two nodes inthe graph. Formally, for any two given nodes u andv, with label distributions µu and µv and co-variancematrices Σu and Σv, distance between them is definedas follows.

dM (u, v) = (µu − µv)T (Σ−1u + Σ−1

v )(µu − µv).

Characteristic of the above distance metric is that ifeither of Σu or Σv has large eigenvalues, then thedistance will be low irrespective of the closeness ofµu and µv. On the other hand, if Σu and Σv bothhave low eigenvalues, then it requires µu and µv tobe close for their distance to be low. Given the aboveproperties, we define ruv, the influence score of node uon its neighboring node v during GCN aggregation, asfollows.

ruv = 1dM (u, v) .

This influence score gives more relevance to neighbor-ing nodes with highly confident similar label, while

reducing importance of nodes with low confident labelscores. This results in ConfGCN acquiring anisotropiccapability during neighborhood aggregation. For anode v, ConfGCN’s equation for updating embeddingat the k-th layer is thus defined as follows.

hk+1v = f

∑u∈N (v)

ruv ×(W khku + bk

) ,∀v ∈ V.

(4)

The final node representation obtained from ConfGCNis used for predicting labels of the nodes in the graphas follows.

Yv = softmax(WKhKv + bK), ∀v ∈ V

where, K denotes the number of ConfGCN’s layers.Finally, in order to learn label scores {µv} and co-variance matrices {Σv} jointly with other parameters{W k, bk}, following Orbach and Crammer (2012), weinclude the following two terms in ConfGCN’s objectivefunction.

For enforcing neighboring nodes to be close to eachother, we include Lsmooth defined as

Lsmooth =∑

(u,v)∈E

(µu − µv)T (Σ−1u + Σ−1

v )(µu − µv).

To impose the desirable property that the label distri-bution of nodes in Vl should be close to their inputlabel distribution, we incorporate Llabel defined as

Llabel =∑v∈Vl

(µv − Yv)T (Σ−1v + 1

γI)(µv − Yv).

Here, for input labels, we assume a fixed uncertaintyof 1

γ I ∈ RL×L, where γ > 0. We also include thefollowing regularization term, Lreg to constraint theco-variance matrix to be finite and positive.

Lreg =∑v∈V

Tr max(−Σv, 0),

This regularization term enforces soft positivity con-straint on co-variance matrix. Additionally in Con-fGCN, we include the Lconst in the objective, to pushthe label distribution (µ) close to the final model pre-diction (Y ).

Lconst =∑v∈V

(µv − Yv)T (µv − Yv).

Page 5: arXiv:1901.08255v2 [cs.LG] 12 Feb 2019 · 2019-02-13 · Runningheadingauthorbreakstheline Dataset Nodes Edges Classes Features LabelMismatch |V l| |V| Cora 2,708 5,429 7 1,433 0.002

Running heading author breaks the line

Dataset Nodes Edges Classes Features Label Mismatch |Vl||V|

Cora 2,708 5,429 7 1,433 0.002 0.052Cora-ML 2,995 8,416 7 2,879 0.018 0.166Citeseer 3,327 4,372 6 3,703 0.003 0.036Pubmed 19,717 44,338 3 500 0.0 0.003

Table 1: Details of the datasets used in the paper. Please refer Section 6.1 for more details.

Finally, we include the standard cross-entropy loss forsemi-supervised multi-class classification over all thelabeled nodes (Vl).

Lcross = −∑v∈Vl

m∑j=1

Yvj log(Yvj).

The final objective for optimization is the linear com-bination of the above defined terms.

L(θ) = −∑i∈Vl

L∑j=1

Yij log(Yij)

+ λ1∑

(u,v)∈E

(µu − µv)T (Σ−1u + Σ−1

v )(µu − µv)

+ λ2∑u∈Vl

(µu − Yu)T (Σ−1u + 1

γI)(µu − Yu)

+ λ3∑v∈V

(µu − Yu)T (µu − Yu)

+ λ4∑v∈V

Tr max(−Σv, 0)

where, θ = {W k, bk,µv,Σv} and λi ∈ R, are theweights of the terms in the objective. We optimize L(θ)using stochastic gradient descent. We hypothesize thatall the terms help in improving ConfGCN’s performanceand we validate this in Section 7.4.

6 Experiments

6.1 Datasets

For evaluating the effectiveness of ConfGCN, we eval-uate on several semi-supervised classification bench-marks. Following the experimental setup of (Kipf andWelling, 2016; Liao et al., 2018), we evaluate on Cora,Citeseer, and Pubmed datasets (Sen et al., 2008). Thedataset statistics is summarized in Table 1. Label mis-match denotes the fraction of edges between nodes withdifferent labels in the training data. The benchmarkdatasets commonly used for semi-supervised classifica-tion task have substantially low label mismatch rate. Inorder to examine models on datasets with more hetero-

geneous neighborhoods, we also evaluate on Cora-MLdataset (Bojchevski and Günnemann, 2018).

All the four datasets are citation networks, where eachdocument is represented using bag-of-words featuresin the graph with undirected citation links betweendocuments. The goal is to classify documents into oneof the predefined classes. We use the data splits used by(Yang et al., 2016) and follow similar setup for Cora-MLdataset. Following (Kipf and Welling, 2016), additional500 labeled nodes are used for hyperparameter tuning.

Hyperparameters: We use the same data splits asdescribed in (Yang et al., 2016), with a test set of 1000labeled nodes for testing the prediction accuracy ofConfGCN and a validation set of 500 labeled nodesfor optimizing the hyperparameters. The ranges ofhyperparameters were adapted from previous literature(Orbach and Crammer, 2012; Kipf and Welling, 2016).The model is trained using Adam (Kingma and Ba,2014) with a learning rate of 0.01. The weight matricesalong with µ are initialized using Xavier initialization(Glorot and Bengio, 2010) and Σ matrix is initializedwith identity. To avoid numerical instability we modelΣ−1 directly and compute Σ wherever required. Fol-lowing Kipf and Welling (2016), we use two layers ofGCN (K) for all the experiments in this paper.

6.2 Baselines

For evaluating ConfGCN, we compare against the fol-lowing baselines:

• Feat (Yang et al., 2016) takes only node features asinput and ignores the graph structure.

• ManiReg (Belkin et al., 2006) is a framework forproviding data-dependent geometric regularization.

• SemiEmb (Weston et al., 2008) augments deep ar-chitectures with semi-supervised regularizers to im-prove their training.

• LP (Zhu et al., 2003) is an iterative iterative labelpropagation algorithm which propagates a nodeslabels to its neighboring unlabeled nodes accordingto their proximity.

Page 6: arXiv:1901.08255v2 [cs.LG] 12 Feb 2019 · 2019-02-13 · Runningheadingauthorbreakstheline Dataset Nodes Edges Classes Features LabelMismatch |V l| |V| Cora 2,708 5,429 7 1,433 0.002

Confidence-based Graph Convolutional Networks for Semi-Supervised Learning

Method Citeseer Cora Pubmed Cora MLLP (Zhu et al., 2003) 45.3 68.0 63.0 -ManiReg (Belkin et al., 2006) 60.1 59.5 70.7 -SemiEmb (Weston et al., 2008) 59.6 59.0 71.1 -Feat (Yang et al., 2016) 57.2 57.4 69.8 -DeepWalk (Perozzi et al., 2014) 43.2 67.2 65.3 -GGNN (Li et al., 2015) 68.1 77.9 77.2 -Planetoid (Yang et al., 2016) 64.9 75.7 75.7 -Kipf-GCN (Kipf and Welling, 2016) 69.4 ± 0.4 80.9 ± 0.4 76.8 ± 0.2 85.7 ± 0.3G-GCN (Marcheggiani and Titov, 2017) 69.6 ± 0.5 81.2 ± 0.4 77.0 ± 0.3 86.0 ± 0.2GPNN (Liao et al., 2018) 68.1 ± 1.8 79.0 ± 1.7 73.6 ± 0.5 69.4 ± 2.3GAT (Veličković et al., 2018) 72.5 ± 0.7 83.0 ± 0.7 79.0 ± 0.3 83.0 ± 0.8ConfGCN (this paper) 72.7 ± 0.8 82.0 ± 0.3 79.5 ± 0.5 86.5 ± 0.3

Table 2: Performance comparison of several methods for semi-supervised node classification on multiple benchmarkdatasets. ConfGCN performs consistently better across all the datasets. Baseline method performances onCiteseer, Cora and Pubmed datasets are taken from Liao et al. (2018); Veličković et al. (2018). We consideronly the top performing baseline methods on these datasets for evaluation on the Cora-ML dataset. Please referSection 7.1 for details.

• DeepWalk (Perozzi et al., 2014) learns node fea-tures by treating random walks in a graph as theequivalent of sentences.

• Planetoid (Yang et al., 2016) provides a transduc-tive and inductive framework for jointly predictingclass label and neighborhood context of a node inthe graph.

• GCN (Kipf and Welling, 2016) is a variant of con-volutional neural networks used for semi-supervisedlearning on graph-structured data.

• G-GCN (Marcheggiani and Titov, 2017) is a variantof GCN with edge-wise gating to discard noisy edgesduring aggregation.

• GGNN (Li et al., 2015) is a generalization of RNNframework which can be used for graph-structureddata.

• GPNN (Liao et al., 2018) is a graph partition basedalgorithm which propagates information after parti-tioning large graphs into smaller subgraphs.

• GAT (Veličković et al., 2018) is a graph attentionbased method which provides different weights todifferent nodes by allowing nodes to attend to theirneighborhood.

7 Results

In this section, we attempt to answer the followingquestions:

Q1. How does ConfGCN compare against existingmethods for the semi-supervised node classificationtask? (Section 7.1)

Q2. How do the performance of methods vary withincreasing node degree and neighborhood labelmismatch? (Section 7.2)

Q3. How does increasing the number of layers effectConfGCN’s performance? (Section 7.3)

Q4. What is the effect of ablating different terms inConfGCN’s loss function? (Section 7.4)

7.1 Node Classification

The evaluation results for semi-supervised node clas-sification are summarized in Table 2. Results of allother baseline methods on Cora, Citeseer and Pubmeddatasets are taken from (Liao et al., 2018; Veličkovićet al., 2018) directly. For evaluation on the Cora-MLdataset, only top performing baselines from the otherthree datasets are considered. Overall, we find thatConfGCN outperforms all existing approaches consis-tently across all the datasets.

This may be attributed to ConfGCN’s ability to modelnodes’ label distribution along with the confidencescores which subdues the effect of noisy nodes duringneighborhood aggregation. The lower performance ofGAT (Veličković et al., 2018) compared to Kipf-GCNon Cora-ML shows that computing attention based onthe hidden representation of nodes is not much helpfulin suppressing noisy neighborhood nodes. We alsoobserve that the performance of GPNN (Liao et al.,2018) suffers on the Cora-ML dataset. This is due tothe fact that while propagating information between

Page 7: arXiv:1901.08255v2 [cs.LG] 12 Feb 2019 · 2019-02-13 · Runningheadingauthorbreakstheline Dataset Nodes Edges Classes Features LabelMismatch |V l| |V| Cora 2,708 5,429 7 1,433 0.002

Running heading author breaks the line

Acc

urac

y

Neighborhood label entropy0 - 0.3 0.3 - 0.6 0.6 - 0.85 0.85 - 1.5

(a)

Node degree

Acc

urac

y

1 - 2 2 - 4 4 - 7 7 - 79

(b)

Figure 2: Plots of node classification accuracy vs. (a) neighborhood label entropy and (b) node degree. On x-axis,we plot quartiles of (a) neighborhood label entropy and (b) degree, i.e., each bin has 25% of the samples in sortedorder. Overall, we observe that ConfGCN performs better than Kipf-GCN and GAT at all levels of node entropyand degree. Please see Section 7.2 for details.

small subgraphs, the high label mismatch rate in Cora-ML (please see Table 1) leads to wrong informationpropagation. Hence, during the global propagationstep, this error is further magnified.

7.2 Effect of Node Entropy and Degree onPerformance

In this section, we provide an analysis of the perfor-mance of Kipf-GCN, GAT and ConfGCN for nodeclassification on the Cora-ML dataset which has higherlabel mismatch rate. We use neighborhood label en-tropy to quantify label mismatch, which for a node uis defined as follows.

NeighborLabelEntropy(u) = −L∑l=1

pul log pul

where, pul = |{v ∈ N (u) | label(v) = l}||N (u)| .

Here, label(v) is the true label of node v. The resultsfor neighborhood label entropy and node degree aresummarized in Figures 2a and 2b, respectively. On thex-axis of these figures, we plot quartiles of label entropyand degree, i.e., each bin has 25% of the instances insorted order. With increasing neighborhood label en-tropy, the node classification task is expected to becomemore challenging. We indeed see this trend in Figure2a where performances of all the methods degrade withincreasing neighborhood label entropy. However, Con-fGCN performs comparatively better than the existingstate-of-art approaches at all levels of node entropy.

In case of node degree also (Figure 2b), we find thatConfGCN performs better than Kipf-GCN and GATat all quartiles of node degrees. Classifying sparselyconnected nodes (first and second bins) is challeng-ing as very little information is present in the nodeneighborhood. Performance improves with availabil-ity of moderate number of neighboring nodes (thirdbin), but further increase in degree (fourth bin) resultsin introduction of many potentially noisy neighbors,thereby affecting performance of all the methods. Forhigher degree nodes, ConfGCN gives an improvementof around 3% over GAT and Kipf-GCN. This showsthat ConfGCN, through its use of label confidences, isable to give higher influence score to relevant nodes inthe neighborhood during aggregation while reducingimportance of the noisy ones.

7.3 Effect of Increasing Convolutional Layers

Recently, Xu et al. (2018) highlighted an unusual be-havior of Kipf-GCN where its performance degradessignificantly with increasing number of layers. This isbecause of increase in the number of influencing nodeswith increasing layers, resulting in “averaging out" ofinformation during aggregation. For comparison, weevaluate the performance of Kipf-GCN and ConfGCNon citeseer dataset with increasing number of convo-lutional layers. The results are summarized in Figure3. We observe that Kipf-GCN’s performance degradesdrastically with increasing number of layers, whereasConfGCN’s decrease in performance is more gradual.This shows that confidence based GCN helps in al-leviating this problem. We also note that ConfGCN

Page 8: arXiv:1901.08255v2 [cs.LG] 12 Feb 2019 · 2019-02-13 · Runningheadingauthorbreakstheline Dataset Nodes Edges Classes Features LabelMismatch |V l| |V| Cora 2,708 5,429 7 1,433 0.002

Confidence-based Graph Convolutional Networks for Semi-Supervised Learning

Number of Layers

Accu

racy

Figure 3: Evaluation of Kipf-GCN and ConfGCN onthe citeseer dataset with increasing number of GCNlayers. Overall, ConfGCN outperforms Kipf-GCN, andwhile both methods’ performance degrade with increas-ing layers, ConfGCN’s degradation is more gradualthan Kipf-GCN’s abrupt drop. Please see Section 7.3for details.

ConfGCN

Accuracy

Figure 4: Performance comparison of different ablatedversion of ConfGCN on the citeseer dataset. Theseresults justify inclusion of the different terms in Con-fGCN’s objective function. Please see Section 7.4 fordetails.

outperforms Kipf-GCN at all layer levels.

7.4 Ablation Results

In this section, we evaluate the different ablated ver-sion of ConfGCN by cumulatively eliminating termsfrom its objective function as defined in Section 5. Theresults on citeseer dataset are summarized in Figure 4.Overall, we find that each term ConfGCN’s loss func-tion (Equation 5) helps in improving its performanceand the method performs best when all the terms areincluded.

8 ConclusionIn this paper, we present ConfGCN, a confidence basedGraph Convolutional Network which estimates labelscores along with their confidences jointly in a GCN-based setting. In ConfGCN, the influence of one nodeon another during aggregation is determined using theestimated confidences and label scores, thus inducinganisotropic behavior to GCN. We demonstrate the effec-tiveness of ConfGCN against state-of-the-art methodsfor the semi-supervised node classification task andanalyze its performance in different settings. We makeConfGCN’s source code available.

References

Athiwaratkun, B. and Wilson, A. G. (2018). On mod-eling hierarchical data via probabilistic order em-beddings. In International Conference on LearningRepresentations.

Bastings, J., Titov, I., Aziz, W., Marcheggiani, D., andSimaan, K. (2017). Graph convolutional encoders forsyntax-aware neural machine translation. In Proceed-ings of the 2017 Conference on Empirical Methodsin Natural Language Processing, pages 1957–1967.Association for Computational Linguistics.

Belkin, M., Niyogi, P., and Sindhwani, V. (2006). Mani-fold regularization: A geometric framework for learn-ing from labeled and unlabeled examples. J. Mach.Learn. Res., 7:2399–2434.

Bojchevski, A. and Günnemann, S. (2018). Deep gaus-sian embedding of graphs: Unsupervised inductivelearning via ranking. In International Conference onLearning Representations.

Bronstein, M. M., Bruna, J., LeCun, Y., Szlam, A.,and Vandergheynst, P. (2017). Geometric deep learn-ing: Going beyond euclidean data. IEEE SignalProcessing Magazine, 34(4):18–42.

Bruna, J., Zaremba, W., Szlam, A., and LeCun, Y.(2013). Spectral networks and locally connectednetworks on graphs. CoRR, abs/1312.6203.

Das, R., Zaheer, M., and Dyer, C. (2015). Gaussianlda for topic models with word embeddings. In Pro-ceedings of the 53rd Annual Meeting of the Asso-ciation for Computational Linguistics and the 7thInternational Joint Conference on Natural LanguageProcessing (Volume 1: Long Papers), pages 795–804.Association for Computational Linguistics.

Defferrard, M., Bresson, X., and Vandergheynst,P. (2016). Convolutional neural networks ongraphs with fast localized spectral filtering. CoRR,abs/1606.09375.

Dos Santos, L., Piwowarski, B., and Gallinari, P. (2017).Gaussian embeddings for collaborative filtering. In

Page 9: arXiv:1901.08255v2 [cs.LG] 12 Feb 2019 · 2019-02-13 · Runningheadingauthorbreakstheline Dataset Nodes Edges Classes Features LabelMismatch |V l| |V| Cora 2,708 5,429 7 1,433 0.002

Running heading author breaks the line

Proceedings of the 40th International ACM SIGIRConference on Research and Development in Infor-mation Retrieval, SIGIR ’17, pages 1065–1068, NewYork, NY, USA. ACM.

Glorot, X. and Bengio, Y. (2010). Understanding thedifficulty of training deep feedforward neural net-works. In Proceedings of the thirteenth internationalconference on artificial intelligence and statistics,pages 249–256.

Grover, A. and Leskovec, J. (2016). node2vec: Scalablefeature learning for networks. In Proceedings of the22nd ACM SIGKDD International Conference onKnowledge Discovery and Data Mining.

Huang, Z., Wan, C., Probst, T., and Van Gool, L.(2017). Deep learning on lie groups for skeleton-based action recognition. In Proceedings of the 2017IEEE Conference on Computer Vision and PatternRecognition (CVPR), pages 1243–1252. IEEE com-puter Society.

Kingma, D. P. and Ba, J. (2014). Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

Kipf, T. N. and Welling, M. (2016). Semi-supervisedclassification with graph convolutional networks.CoRR, abs/1609.02907.

Lei, J. (2014). Classification with confidence.Biometrika, 101(4):755–769.

Li, M. and Sethi, I. K. (2006). Confidence-based activelearning. IEEE Trans. Pattern Anal. Mach. Intell.,28(8):1251–1261.

Li, Y., Tarlow, D., Brockschmidt, M., and Zemel, R.(2015). Gated graph sequence neural networks. arXivpreprint arXiv:1511.05493.

Liao, R., Brockschmidt, M., Tarlow, D., Gaunt, A., Ur-tasun, R., and Zemel, R. S. (2018). Graph partitionneural networks for semi-supervised classification.

Marcheggiani, D. and Titov, I. (2017). Encoding sen-tences with graph convolutional networks for seman-tic role labeling. In Proceedings of the 2017 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 1506–1515. Association for Com-putational Linguistics.

Orbach, M. and Crammer, K. (2012). Graph-basedtransduction with confidence. In ECML/PKDD.

Perozzi, B., Al-Rfou, R., and Skiena, S. (2014). Deep-walk: Online learning of social representations. InProceedings of the 20th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Min-ing, KDD ’14, pages 701–710, New York, NY, USA.ACM.

Sen, P., Namata, G. M., Bilgic, M., Getoor, L., Gal-lagher, B., and Eliassi-Rad, T. (2008). Collective

classification in network data. AI Magazine, 29(3):93–106.

Subramanya, A. and Talukdar, P. P. (2014). Graph-based semi-supervised learning. Synthesis Lectureson Artificial Intelligence and Machine Learning,8(4):1–125.

Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J.,and Mei, Q. (2015). Line: Large-scale informationnetwork embedding. In WWW. ACM.

Vashishth, S., Dasgupta, S. S., Ray, S. N., and Taluk-dar, P. (2018a). Dating documents using graph con-volution networks. In Proceedings of the 56th AnnualMeeting of the Association for Computational Lin-guistics (Volume 1: Long Papers), pages 1605–1615.Association for Computational Linguistics.

Vashishth, S., Joshi, R., Prayaga, S. S., Bhattacharyya,C., and Talukdar, P. (2018b). RESIDE: Improv-ing distantly-supervised neural relation extractionusing side information. In Proceedings of the 2018Conference on Empirical Methods in Natural Lan-guage Processing, pages 1257–1266. Association forComputational Linguistics.

Veličković, P., Cucurull, G., Casanova, A., Romero,A., Liò, P., and Bengio, Y. (2018). Graph AttentionNetworks. International Conference on LearningRepresentations. accepted as poster.

Vilnis, L. and McCallum, A. (2014). Word representa-tions via gaussian embedding. CoRR, abs/1412.6623.

Weston, J., Ratle, F., and Collobert, R. (2008). Deeplearning via semi-supervised embedding. In Proceed-ings of the 25th International Conference on MachineLearning, ICML ’08, pages 1168–1175, New York,NY, USA. ACM.

Xu, K., Li, C., Tian, Y., Sonobe, T., Kawarabayashi,K.-i., and Jegelka, S. (2018). Representation learn-ing on graphs with jumping knowledge networks. InDy, J. and Krause, A., editors, Proceedings of the35th International Conference on Machine Learning,volume 80 of Proceedings of Machine Learning Re-search, pages 5453–5462, Stockholmsmässan, Stock-holm Sweden. PMLR.

Yadav, P., Nimishakavi, M., Yadati, N., Rajkumar,A., and Talukdar, P. (2019). Lovasz convolutionalnetworks. In International Conference on ArtificialIntelligence and Statistics (AISTATS)".

Yang, Z., Cohen, W. W., and Salakhutdinov, R.(2016). Revisiting semi-supervised learning withgraph embeddings. In Proceedings of the 33rd In-ternational Conference on International Conferenceon Machine Learning - Volume 48, ICML’16, pages40–48. JMLR.org.

Page 10: arXiv:1901.08255v2 [cs.LG] 12 Feb 2019 · 2019-02-13 · Runningheadingauthorbreakstheline Dataset Nodes Edges Classes Features LabelMismatch |V l| |V| Cora 2,708 5,429 7 1,433 0.002

Confidence-based Graph Convolutional Networks for Semi-Supervised Learning

Yi, L., Su, H., Guo, X., and Guibas, L. (2016). Sync-speccnn: Synchronized spectral cnn for 3d shapesegmentation. arXiv preprint arXiv:1612.00606.

Zhang, Y., Qi, P., and Manning, C. D. (2018). Graphconvolution over pruned dependency trees improvesrelation extraction. In Proceedings of the Conferenceof Empirical Methods in Natural Language Processing(EMNLP ’18), Brussels, Belgium.

Zhu, X., Ghahramani, Z., and Lafferty, J. (2003). Semi-supervised learning using gaussian fields and har-monic functions. In Proceedings of the TwentiethInternational Conference on International Confer-ence on Machine Learning, ICML’03, pages 912–919.AAAI Press.