This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
DeepInf: Social Influence Prediction with Deep Learning
Jiezhong Qiu†, Jian Tang
♯♮, Hao Ma
‡, Yuxiao Dong
‡, Kuansan Wang
‡, and Jie Tang
†∗
†Department of Computer Science and Technology, Tsinghua University
‡Microsoft Research, Redmond
♯HEC Montreal, Canada
♮Montreal Institute for Learning Algorithms, Canada
Graph Attention; Social Influence; Social Networks
ACM Reference Format:Jiezhong Qiu, Jian Tang, HaoMa, Yuxiao Dong, KuansanWang, and Jie Tang.
2018. DeepInf: Social Influence Prediction with Deep Learning . In KDD ’18:The 24th ACM SIGKDD International Conference on Knowledge Discovery &Data Mining, August 19–23, 2018, London, United Kingdom. ACM, New York,
NY, USA, 10 pages. https://doi.org/10.1145/3219819.3220077
1 INTRODUCTIONSocial influence is everywhere around us, not only in our daily
physical life but also on the virtual Web space. The term social in-
fluence typically refers to the phenomenon that a person’s emotions,
opinions, or behaviors are affected by others. With the global pene-
tration of online andmobile social platforms, people have witnessed
the impact of social influence in every field, such as presidential elec-
tions [7], advertising [3, 24], and innovation adoption [42]. To date,
there is little doubt that social influence has become a prevalent,
yet complex force that drives our social decisions, making a clear
need for methodologies to characterize, understand, and quantify
the underlying mechanisms and dynamics of social influence.
Indeed, extensive work has been done on social influence predic-
tion in the literature [26, 32, 42, 43]. For example, Matsubara et al.
[32] studied the dynamics of social influence by carefully design-
ing differential equations extended from the classic ‘Susceptible-
Infected’ (SI) model; Most recently, Li et al. [26] proposed an end-to-
end predictor for inferring cascade size by incorporating recurrent
neural network (RNN) and representation learning. All these ap-
proaches mainly aim to predict the global or aggregated patterns
of social influence such as the cascade size within a time-frame.
However, in many online applications such as advertising and rec-
ommendation, it is critical to effectively predict the social influence
for each individual, i.e., user-level social influence prediction.
In this paper, we focus on the prediction of user-level social
influence. We aim to predict the action status of a user given the
action statuses of her near neighbors and her local structural infor-
mation. For example, in Figure 1, for the central user v , if some of
her friends (black circles) bought a product, will she buy the same
product in the future? The problem mentioned above is prevalent
in real-world applications whereas its complexity and non-linearity
have frequently been observed, such as the “S-shaped” curve in [2]
Figure 1: A motivating example of social influence localityprediction. The goal is to predict v’s action status, given 1)the observed action statuses (black and gray circles are usedto indicate “active” and “inactive”, respectively) of her nearneighbors and 2) the local network she is embedded in.
and the celebrated “structural diversity” in [46]. The above obser-
vations inspire a lot of user-level influence prediction models, most
of which [27, 53, 54] consider complicated hand-crafted features,
which require extensive knowledge of specific domains and are
usually difficult to generalize to different domains.
Inspired by the recent success of neural networks in representa-
tion learning, we design an end-to-end approach to discover hidden
and predictive signals in social influence automatically. By archi-
tecting network embedding [37], graph convolution [25], and graph
attention mechanism [49] into a unified framework, we expect that
the end-to-end model can achieve better performance than conven-
tional methods with feature engineering. In specific, we propose
a deep learning based framework, DeepInf, to represent both in-
fluence dynamics and network structures into a latent space. To
predict the action status of a user v , we first sample her local neigh-
bors through random walks with restart. After obtaining a local
network as shown in Figure 1, we leverage both graph convolution
and attention techniques to learn latent predictive signals.
We demonstrate the effectiveness and efficiency of our proposed
framework on four social and information networks from different
domains—Open Academic Graph (OAG), Digg, Twitter, and Weibo.
We compare DeepInf with several conventional methods such as
linear models with hand-crafted features [54] as well as the state-of-
that the DeepInf model can significantly improve the prediction
performance, demonstrating the promise of representation learning
for social and information network mining tasks.
Organization The rest of this paper is organized as follows: Sec-
tion 2 formulates social influence prediction problem. Section 3
introduces the proposed framework in detail. In Section 4 and 5, we
conduct extensive experiments and case studies. Finally, Section 6
summarizes related work and Section 7 concludes this work.
2 PROBLEM FORMULATIONIn this section, we introduce necessary definitions and then formu-
late the problem of predicting social influence.
Definition 2.1. r -neighbors and r -ego network LetG = (V ,E)be a static social network, where V denotes the set of users and
E ⊆ V × V denotes the set of relationships2. For a user v , its r -
neighbors are defined as Γrv = {u : d(u,v) ≤ r } where d(u,v) is theshortest path distance (in terms of the number of hops) between
u and v in the network G. The r -ego network of user v is the sub-
network induced by Γrv , denoted by Grv .
Definition 2.2. Social Action Users in social networks perform
social actions, such as retweet. At each timestamp t , we observe abinary action status of user u, stu ∈ {0, 1}, where s
tu = 1 indicates
user u has performed this action before or on the timestamp t , andstu = 0 indicates that the user has not performed this action yet.
Such an action log can be available from many social networks, e.g.,
the “retweet” action in Twitter and the citation action in academic
social networks.
Given the above definitions, we introduce social influence local-
ity, which amounts to a kind of closed world assumption: users’
social decisions and actions are influenced only by their near neigh-
bors within the network, while external sources are assumed to be
not present.
Problem 1. Social Influence Locality[53] Social influence local-ity models the probability ofv’s action status conditioned on her r -egonetwork Gr
v and the action states of her r -neighbors. More formally,givenGr
v and Stv = {stu : u ∈ Γrv \ {v}}, social influence locality aims
to quantify the activation probability of v after a given time interval∆t :
P(s t+∆tv
����Grv , S
tv
).
Practically, suppose we have N instances, each instance is a
3-tuple (v,a, t), where v is a user, a is a social action and t is atimestamp. For such a 3-tuple (v,a, t), we also know v’s r -egonetwork—Gr
v , the action statuses of v’s r -neighbors—Stv , and v’s
future action status at t + ∆t , i.e., st+∆tv . We then formulate social
influence prediction as a binary graph classification problem which
can be solved by minimizing the following negative log likelihood
objective w.r.t model parameters Θ:
L(Θ) = −N∑i=1
log
(PΘ
(s ti+∆tvi
����Grvi , S
tivi
)). (1)
Especially, in this work, we assume ∆t is sufficiently large, that is,
we want to predict the action status of the ego user v at the end of
our observation window.
3 MODEL FRAMEWORKIn this section, we formally propose DeepInf, a deep learning based
model, to parameterize the probability in Eq. 1 and automatically
detect the mechanisms and dynamics of social influence. The frame-
work firstly samples a fixed-size sub-network as the proxy for each
r -ego network (see Section 3.1). The sampled sub-networks are
then fed into a deep neural network with mini-batch learning (see
Section 3.2). Finally, the model output is compared with ground
truth to minimize the negative log-likelihood loss.
3.1 Sampling Near NeighborsGiven a user v , a straightforward way to extract her r -ego networkGrv is to perform Breadth-First-Search (BFS) starting from user v .
2In this work, we consider undirected relationships.
Embedding
Layer
n
B
action status
ego
Input
Layer
B
2
|V|
Loss
Output
Layer
GCN/GAT
Layer
v
Raw
Input
mini-batch
of size B
v
u
vertex
features
Instance
Normalization
d
avv
B
v
Ground
Truth
(a) (b) (c) (d) (e) (f) (g)
avu
xv
xu yu
yv
Figure 2: Model Framework of DeepInf. (a) Raw input which consists of a mini-batch of B instances; Each instance is a sub-network comprised of n users who are sampled using random walk with restart as described in Section 3.1. In this example,we keep our eyes on ego user v (marked as blue) and one of her active neighbor u (marked as orange). (b) An embeddinglayer which maps each user to her D-dimensional representation; (c) An Instance Normalization layer [47]. For each instance,this layer normalizes users’ embedding xu ’s according to Eq. 3. The output embedding yu ’s have zero mean and unit variancewithin each instance. (d) The formal input layer which concatenates together network embedding, two dummy features (oneindicates whether the user is active, the other indicates whether the user is the ego), and other customized vertex features (seeTable 2 for example). (e) A GCN or GAT layer. avv and avu indicate the attention coefficients along self-loop (v,v) and edge(v,u), respectively; The value of these attention coefficients can be chosen between Eq. 5 and Eq. 7 according to the choicebetween GCN and GAT. (f) and (g) Compare model output and ground truth, we get the negative log likelihood loss. In thisexample, ego user v was finally activated (marked as black).
However, for different users,Grv ’s may have different sizes. Mean-
while, the size (regarding the number of vertices) of Grv ’s can be
very large due to the small-world property in social networks [50].
Such variously sized data is unsuited to most deep learning mod-
els. To remedy these issues, we sample a fixed-size sub-network
from v’s r -ego network, instead of directly dealing with the r -egonetwork.
A natural choice of the sampling method is to perform random
walk with restart (RWR) [45]. Inspired by [2, 46] which suggest
that people are more likely to be influenced by active neighbors
than inactive ones, we start random walks from either the ego user
v or one of her active neighbors randomly. Next, the random walk
iteratively travels to its neighborhood with the probability that
is proportional to the weight of each edge. Besides, at each step,
the walk is assigned a probability to return to the starting node,
that is, either the ego user v or one of v’s active neighbors. TheRWR runs until it successfully collects a fixed number of vertices,
denoted by Γrv with
��Γrv �� = n. We then regard the sub-network Grv
induced by Γrv as a proxy of the r -ego network Grv , and denote
Stv = {stu : u ∈ Γrv \ {v}} to be the action statuses of v’s sampled
neighbors. Therefore, we re-define the optimization objective in
Eq. 1 to be:
L(Θ) = −N∑i=1
log
(PΘ
(s ti+∆tvi
����Grvi , S
tivi
)). (2)
3.2 Neural Network ModelWith the retrieved Gr
v and Stv for each user, we design an effective
neural network model to incorporate both the structural properties
in Grv and action statuses in Stv . The output of the neural network
model is a hidden representation for the ego user v , which is then
used to predict her action status—st+∆tv . As shown in Figure 2, the
proposed neural network model consists of a network embedding
layer, an instance normalization layer, an input layer, several graph
convolutional or graph attention layers, and an output layer. In this
section, we introduce these layers one by one and build the model
step by step.
Embedding Layer With the recent emergence of representation
learning [5], the network embedding technique has been exten-
sively studied to discover and encode network structural properties
into a low-dimensional latent space. More formally, network embed-
ding learns an embedding matrix X ∈ RD×|V | , with each column
corresponding to the representation of a vertex (user) in the net-
work G. In the proposed model, we use a pre-trained embedding
layer which maps a user u to her D-dimensional representation
xu ∈ RD , as shown in Figure 2(b).
Instance Normalization [47] Instance normalization is a re-
cently proposed technique in image style transfer [47]. We adopt
this technique in our social influence prediction task. As shown in
Figure 2(c), for each user u ∈ Γrv , after retrieving her representa-
tion xu from the embedding layer, the instance normalization yu is
given by
yud =xud − µd√σ 2
d + ϵ(3)
for each embedding dimension d = 1, · · · ,D, where
µd =1
n
∑u∈Γrv
xud , σ 2
d =1
n
∑u∈Γrv
(xud − µd )2
(4)
Here µd and σd are the mean and variance, and ϵ is a small number
for numerical stability. Intuitively, such normalization can remove
instance-specific mean and variance, which encourages the down-
stream model to focus on users’ relative positions in latent em-
bedding space rather than their absolute positions. As we will see
later in Section 5, instance normalization can help avoid overfitting
during training.
Input Layer As illustrated in Figure 2(d), the input layer con-
structs a feature vector for each user. Besides the normalized low-
dimensional embedding comes from up-stream instance normaliza-
tion layer, it also considers two binary variables. The first variable
indicates users’ action statuses, and the other indicates whether the
user is the ego user. Also, the input layer covers all other customized
vertex features such as structural features, content features, and
demographic features.
GCN [25] Based Network Encoding Graph Convolutional Net-
work (GCN) is a semi-supervised learning algorithm for graph-
structured data. The GCN model is built by stacking multiple GCN
layers. The input to each GCN layer is a vertex feature matrix,
H ∈ Rn×F , where n is the number of vertices, and F is the number
of features. Each row ofH , denoted by h⊤i , is associated with a ver-
tex. Generally speaking, the essence of the GCN layer is a nonlinear
transformation that outputs H ′ ∈ Rn×F′
as follows:
H ′ = GCN(H ) = д(A(G)HW ⊤ + b
), (5)
whereW ∈ RF′×F ,b ∈ RF
′
are model parameters, д is a non-linear
activation function, A(G) is a n × n matrix that captures structural
information of graphG . GCN instantiatesA(G) to be a static matrix
closely related to the normalized graph Laplaican [10]:
AGCN(G) = D−1/2AD−1/2, (6)
where A is the adjacency matrix3of G, and D = diag(A1) is the
degree matrix.
Multi-head Graph Attention [49] Graph Attention (GAT) is a
recent proposed technique that introduces the attention mechanism
into GCN. GAT defines matrix AGAT(G) = [ai j ]n×n through a self-
attention mechanism. More formally, an attention coefficient ei j is
firstly computed by an attention function attn : RF′
× RF′
→ R,which measures the importance of vertex j to vertex i:
ei j = attn
(W hi ,W hj
).
Different from traditional self-attention mechanisms where the at-
tention coefficients between all pairs of instances will be computed,
GAT only evaluates ei j for (i, j) ∈ E(Grv ) or i = j, i.e., (i, j) is either
an edge or a self-loop. In doing so, it is able to better leverage and
capture the graph structural information. After that, to make coef-
ficients comparable among vertices, a softmax function is adopted
to normalize attention coefficients:
ai j = softmax(ei j ) =exp (ei j )∑
k∈Γ1
iexp (eik )
.
Following Velickovic et al. [49], the attention function is in-
stantiated with a dot product and a LeakyReLU [31, 51] nonlin-
earity. For an edge or a self-loop (i, j), the dot product is per-
formed between parameter c and the concatenation of the fea-
ture vectors of the two end points—Whi and Whj , i.e., ei j =LeakyReLU
(c⊤
[Whi | |Whj
] ), where the LeakyReLU has negative
3GCN applies self-loop trick on graph G by adding self-loop on each vertex, i.e.,
A← A + I
Table 1: Summary of datasets. |V | and |E | indicates the num-ber of vertices and edges in graph G = (V ,E), while N is thenumber of social influence locality instances (observations)as described in Section 2.
Table 6: Prediction performance of DeepInf-GAT (%)with/without vertex features as introduced in Table 2.
Data Features AUC Prec. Rec. F1
OAG
× 68.07 34.77 66.87 45.78√
71.79 40.77 60.97 48.86
Digg
× 89.39 68.52 72.85 70.62√
90.65 66.82 78.49 72.19
Twitter
× 78.30 47.24 65.36 54.84√
80.22 48.41 69.08 56.93
Weibo
× 81.47 46.90 75.02 57.71√
82.72 48.53 76.09 59.27
methods. We attribute its inferiority to the homophily assumption
of GCN—similar vertices are more likely to link with each other
than dissimilar ones. Under such assumption, for a specific vertex,
GCN computes its hidden representation by taking an unweighted
average over its neighbors’ representations. However, in our appli-
cation, the homophily assumption may not be true. By averaging
over neighbors, GCN may mix predictive signals with noise. On
the other hand, as pointed out by [2, 46], active neighbors are more
important than inactive neighbors, which also encourages us to use
graph attention which treats neighbors differently.
(4) In experiments shown in Table 3, 4, and 5, we still rely on
several vertex features, such as page rank score and clustering
coefficient. However, we want to avoid using any hand-crafted fea-
tures and make DeepInf a “pure” end-to-end learning framework.
Quite surprisingly, we can still achieve comparable performance (as
shown in Table 6), even we do not consider any hand-crafted fea-
tures except the pre-trained network embedding.
5.1 Parameter AnalysisIn this section, we investigate how the prediction performance
varies with the hyper-parameters in sampling near neighbors and
the neural network model. We conduct the parameter analyses on
the Weibo dataset unless otherwise stated.
Return Probability of RandomWalkwith Restart When sam-
pling near neighbors, the return probability of random walk with
restart (RWR) controls the “shape” of the sampled r -ego network.
Figure 3(a) shows the prediction performance (in terms of AUC
and F1) by varying the return probability from 10% to 90%. As the
increasing of return probability, the prediction performance also
increases slightly, illustrating the locality pattern of social influence.
Size of Sampled Networks Another parameter that controls the
sampled r -ego network is the size of sampled networks. Figure 3(b)
shows the prediction performance (in terms of AUC and F1) by
varying the size from 10 to 100. We can observe a slow increase
of prediction performance when we sample more near neighbors.
This is not surprising because we have more information as the
size of sampled networks increases.
Negative Positive Ratio As we mentioned in Section. 5, the pos-
itive and negative observations are imbalanced in our datasets. To
investigate how such imbalance influence the prediction perfor-
mance, we vary the ratio between negative and positive instances
from 1 to 10 , and show the performance in Figure 3(c). We can
observe a decreasing trend w.r.t. the F1 measure, while the AUC
score stays stable.
#Head for Multi-head Attention Another hyper-parameter we
analyze is the number of heads used for multi-head attention. For a
fair comparison, we fixed the number of total hidden units to be 128.
We vary the number of heads to be 1, 2, 4, 8, 16, 32, 64, 128, i.e., each
head has 128, 64, 32, 16, 8, 4, 2, 1 hidden units, respectively. As shown
in Figure 3(d), we can see that DeepInf benefits from the multi-
head mechanism. However, as the decreasing of the number of
hidden units associated with each head, the prediction performance
decreases.
Effect of Instance Normalization As claimed in Section 3, we
use an Instance Normalization (IN) layer to avoid overfitting, es-
pecially when training set is small, e.g., Digg. Figure 4(a) and Fig-
ure 4(b) illustrate the training loss and test AUC of DeepInf-GAT on
the Digg dataset trained with and without IN layer. We can see that
IN significantly avoids overfitting and makes the training process
more robust.
5.2 Discussion on GAT and Case StudyBesides the concatenation-based attention used in GAT (Eq. 7), we
also try other popular attention mechanisms, e.g., the dot product
attention or the bilinear attention as summarized in [28]. How-
ever, those attention mechanisms do not perform as well as the
concatenation-based one. In this section, we introduce the order-
preserving property of GAT [49]. Based on the property, we attempt
to explain the effectiveness of DeepInf-GAT through case studies.
Observation 1. Order-preserving of Graph Attention Suppose(i, j), (i,k), (i ′, j) and (i ′,k) are either edges or self-loops, and ai j ,aik , ai′j , ai′k are the attention coefficients associated with them. Ifai j > aik then ai′j > ai′k .
Proof. As introduced in Eq. 7, the graph attention coefficient
for edge (or self-loop) (i, j) is defined as ai j = softmax(ei j ), where
ei j = LeakyReLU
(c⊤
[W hi | |W hj
] ).
If we rewrite c⊤ =[p⊤ q⊤
], we have
ei j = LeakyReLU
(p⊤W hi + q
⊤W hj).
Due to the strict monotonicity of softmax and LeakyReLU, ai j >aik implies q⊤Whj > q⊤Whk . Apply the strict monotonicity of
LeakyReLU and softmax again, we get ai′j > ai′k . □
The above observation shows the following fact—although each
vertex only pay attention to its neighbors in GAT (local attention),
the attention coefficients have a global ranking, which is determined
by q⊤Whj only. Thus we can define a score function score(j) =q⊤Whj . Then each vertex pays attention to its neighbors accordingto this score function—a higher score function value indicates a
higher attention coefficient. Thus, plotting the value of the scoring
function can illustrate where are the “popular areas” or “important
areas” of the network. Furthermore, multi-head attention provides
a multi-view mechanism—for K heads, we have K score functions,
scorek (j) = q⊤kWkhj , k = 1, · · · ,K , highlighting different areas
of the network. To better illustrate this mechanism, we perform a
few case studies. As shown in Figure 5, we choose four instances
20 40 60 8080
81
82
83
84
85
AU
C(%
)
50
52
54
56
58
60
F1(%
)
AUCF1
(a) Return Probability of RWR
20 40 60 80 10080
81
82
83
84
85
AU
C(%
)
50
52
54
56
58
60
F1(%
)
AUCF1
(b) Size of Sampled Networks
2 4 6 8 1080
81
82
83
84
85
AU
C(%
)
30
40
50
60
70
80
90
F1(%
)
AUCF1
(c) Negative Positive Ratio
100 101 10280
81
82
83
84
85
AU
C(%
)
50
52
54
56
58
60
F1(%
)
AUCF1
(d) Number of Heads
Figure 3: Parameter analysis. (a) Return probability of random walk with restart. (b) Size of sampled networks. (c) Negativepositive ratio. (d) The number of heads used for multi-head attention.
0 200 400 600 800 1000Epochs
0.2
0.3
0.4
0.5
0.6
Trai
ning
Los
s
With INWithout IN
(a) Training Loss
0 200 400 600 800 1000Epochs
0.84
0.86
0.88
0.90
Test
AU
C(%
)
With INWithout IN
(b) Test AUC (%)
Figure 4: The (a) training loss/(b) test AUC of DeepInf-GATon Digg data set trained with and without Instance Normal-ization, vs. the number of epochs. Instance Normalizationhelps avoid overfitting.
from the Digg dataset (each row corresponding to one instance) and
select three representative attention heads from the first GAT layer.
Quite interestingly, we can observe explainable and heterogeneous
patterns discovered by different attention heads. For example, as
shown in Figure 5, the first attention head tend to focus on the
ego-user, while the second and the third highlight active users and
inactive users, respectively. However, this property does not hold
for other attention mechanisms. Due to the page limit, we do not
discuss them here.
6 RELATEDWORKOur study is closely related to a large body of literature on social
influence analysis [42] and graph representation learning [22, 37].
Social Influence Analysis Most existing work has focused on
social influence modeled as a macro-social process (a.k.a., cascade),
with a few that have explored the alternative user-level mecha-
nism that considers the locality of social influence in practice. At
the macro level, researchers are interested in global patterns of
social influence. Such global patterns includes various respects of
a cascade and their correlation with the final cascade size, e.g.,
the rise-and-fall patterns [32], external influence sources [33], and
conformity phenomenon [43]. Recently, there have been efforts to
detect those global patterns automatically using deep learning, e.g.,
the DeepCas model [26] which formulate cascade prediction as a
sequence problem and solve it with Recurrent Neural Network.
v v v v
v v v v
v v v v
v v v v
Head 1 Head 2 Head 3Cases
(a) (b) (c) (d)
Figure 5: Case study. How different graph attention headshighlight different areas of the network. (a) Four selectedcases from the Digg dataset. Active and inactive users aremarked as black and gray, respectively. User v is the ego-user that we are interested in. (b)(c)(d) Three representativeattention heads.
Another line of studies focuses on the user-level mechanism in so-
cial influence where each user is only influenced by her near neigh-
bors. Examples of such work include pairwise influence [19, 39],
topic-level influence [42], group formation [2, 38] and structural
diversity [14, 29, 46]. Such user-level models act as fundamental
building blocks of many real-world problems and applications. For
example, in the influence maximization problem [24], both inde-
pendent cascade and linear threshold models assume a pairwise
influence model; In social recommendation [30], a key assumption
is social influence—the ratings and reviews of existing users will
influence future customers’ decisions through social interaction.
Another example is a large-scale field experiment by Facebook Bond
et al. [7] during the 2010 US congressional elections, the results
showed how online social influence changes offline voting behavior.
2008. LIBLINEAR: A library for large linear classification. JMLR 9, Aug (2008),
1871–1874.
[18] Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training
deep feedforward neural networks. In AISTATS ’10. 249–256.[19] Amit Goyal, Francesco Bonchi, and Laks VS Lakshmanan. 2010. Learning influ-
ence probabilities in social networks. In WSDM ’10. ACM, 241–250.
[20] Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for
networks. In KDD ’16. ACM, 855–864.
[21] Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation
learning on large graphs. In NIPS ’17. 1025–1035.[22] William L Hamilton, Rex Ying, and Jure Leskovec. 2017. Representation Learning
on Graphs: Methods and Applications. arXiv preprint arXiv:1709.05584 (2017).[23] Tad Hogg and Kristina Lerman. 2012. Social dynamics of digg. EPJ Data Science
1, 1 (2012), 5.
[24] David Kempe, Jon Kleinberg, and Éva Tardos. 2003. Maximizing the spread of
influence through a social network. In KDD ’03. 137–146.[25] Thomas N Kipf and MaxWelling. 2017. Semi-supervised classification with graph
convolutional networks. ICLR ’17 (2017).
[26] Cheng Li, Jiaqi Ma, Xiaoxiao Guo, and Qiaozhu Mei. 2017. DeepCas: An end-to-
end predictor of information cascades. In WWW ’17. 577–586.[27] Huijie Lin, Jia Jia, Jiezhong Qiu, Yongfeng Zhang, Guangyao Shen, Lexing Xie,
Jie Tang, Ling Feng, and Tat-Seng Chua. 2017. Detecting stress based on social
interactions in social networks. TKDE 29, 9 (2017), 1820–1833.
[28] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective
approaches to attention-based neural machine translation. EMNLP ’15 (2015).[29] Hao Ma. 2013. An experimental study on implicit social recommendation. In
SIGIR ’13. ACM, 73–82.
[30] Hao Ma, Irwin King, and Michael R Lyu. 2009. Learning to recommend with
social trust ensemble. In SIGIR ’09. ACM, 203–210.
[31] Andrew LMaas, Awni Y Hannun, and Andrew Y Ng. 2013. Rectifier nonlinearities
improve neural network acoustic models. In ICML ’13. 3.[32] Yasuko Matsubara, Yasushi Sakurai, B Aditya Prakash, Lei Li, and Christos Falout-
sos. 2012. Rise and fall patterns of information diffusion: model and implications.
In KDD ’12. 6–14.
[33] Seth A Myers, Chenguang Zhu, and Jure Leskovec. 2012. Information diffusion
and external influence in networks. In KDD ’12. ACM, 33–41.
[34] Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. 2016. Learning
convolutional neural networks for graphs. In ICML ’16. 2014–2023.[35] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The
PageRank citation ranking: Bringing order to the web. Technical Report. StanfordInfoLab.
[36] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online learning
of social representations. In KDD ’14. ACM, 701–710.
[37] Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li, Kuansan Wang, and Jie Tang. 2018.
Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, PTE,
and node2vec. In WSDM ’18. ACM, 459–467.
[38] Jiezhong Qiu, Yixuan Li, Jie Tang, Zheng Lu, Hao Ye, Bo Chen, Qiang Yang, and
John E Hopcroft. 2016. The lifecycle and cascade of wechat social messaging
groups. In WWW ’16. 311–320.[39] Kazumi Saito, Ryohei Nakano, and Masahiro Kimura. 2008. Prediction of informa-
tion diffusion probabilities for independent cascade model. In KES ’08. Springer,67–75.
[40] Nino Shervashidze, SVN Vishwanathan, Tobias Petri, Kurt Mehlhorn, and Karsten
Borgwardt. 2009. Efficient graphlet kernels for large graph comparison. In
AISTATS’ 09. 488–495.[41] Jian Tang,MengQu,MingzheWang,Ming Zhang, Jun Yan, andQiaozhuMei. 2015.
LINE: Large-scale information network embedding. In WWW ’15. 1067–1077.[42] Jie Tang, Jimeng Sun, Chi Wang, and Zi Yang. 2009. Social influence analysis in
large-scale networks. In KDD ’09. ACM, 807–816.
[43] Jie Tang, Sen Wu, and Jimeng Sun. 2013. Confluence: Conformity influence in
large social networks. In KDD ’13. ACM, 347–355.
[44] Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. 2008. Arnet-
miner: extraction and mining of academic social networks. In KDD ’08. 990–998.[45] Hanghang Tong, Christos Faloutsos, and Jia-Yu Pan. 2006. Fast Random Walk
with Restart and Its Applications. In ICDM ’06. 613–622.[46] Johan Ugander, Lars Backstrom, Cameron Marlow, and Jon Kleinberg. 2012.
Structural diversity in social contagion. PNAS 109, 16 (2012), 5962–5966.[47] Dmitry Ulyanov, Vedaldi Andrea, and Victor Lempitsky. 2016. Instance Normaliza-
tion: The Missing Ingredient for Fast Stylization. arXiv preprint arXiv:1607.08022(2016).
Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In NIPS ’17. 6000–6010.[49] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro
Lio, and Y Bengio. 2018. Graph Attention Networks. ICLR ’18 (2018).[50] Duncan JWatts and Steven H Strogatz. 1998. Collective dynamics of ’small-world’
networks. nature 393, 6684 (1998), 440–442.[51] Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. 2015. Empirical evaluation of
rectified activations in convolutional network. arXiv preprint arXiv:1505.00853(2015).
[52] Pinar Yanardag and SVN Vishwanathan. 2015. Deep graph kernels. In KDD ’15.1365–1374.
[53] Jing Zhang, Biao Liu, Jie Tang, Ting Chen, and Juanzi Li. 2013. Social Influence
Locality for Modeling Retweeting Behaviors.. In IJCAI’ 13.[54] Jing Zhang, Jie Tang, Juanzi Li, Yang Liu, and Chunxiao Xing. 2015. Who influ-
enced you? predicting retweet via social influence locality. TKDD 9, 3 (2015),