This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Research ArticleSVD-CNN A Convolutional Neural Network Model withOrthogonal Constraints Based on SVD for Context-AwareCitation Recommendation
Shaoyu Tao Chaoyuan Shen Li Zhu and Tao Dai
School of Software Engineering Xirsquoan Jiaotong University Xirsquoan Shanxi 710049 China
Correspondence should be addressed to Li Zhu zhulixjtueducn
Received 27 November 2019 Revised 28 September 2020 Accepted 5 October 2020 Published 23 October 2020
Academic Editor Giosue Lo Bosco
Copyright copy 2020 Shaoyu Tao et al 0is is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited
Context-aware citation recommendation aims to automatically predict suitable citations for a given citation context which isessentially helpful for researchers when writing scientific papers In existing neural network-based approaches overcorrelation inthe weight matrix influences semantic similarity which is a difficult problem to solve In this paper we propose a novel context-aware citation recommendation approach that can essentially improve the orthogonality of the weight matrix and explore moreaccurate citation patterns We quantitatively show that the various reference patterns in the paper have interactional features thatcan significantly affect link prediction We conduct experiments on the CiteSeer datasets 0e results show that our model issuperior to baseline models in all metrics
1 Introduction
Citation recommendation for researchers to quickly find theappropriate relevant literature is a rapidly developing re-search area [1] Among this area context-aware citationrecommendation is a particular type for predicting citationsfor a citation context [2]0e citation context is usually a fewsentences before and after the place holder such as ldquo[]rdquo 0ekey problem for context-aware citation recommendation ishow to measure the similarity between the citation contextand a specific scientific paper
Similar to other NLP tasks (eg information retrieval(IR) and text mining) the simplest solution for context-aware citation recommendation calculates the relevant scorebetween a citation context and candidate papers via Eu-clidean distance [3] and then selects the salient citationsHowever simple text similarity is obviously too coarse to bea good measurement In recent years neural networkmodels have been widely used to recommend documentsdue to their efficiency and effectiveness [4ndash7] Neural net-work models can be regarded as better solutions than tra-ditional machine learning methods for simplifying feature
engineering tasks and having the ability to deal with large-scale data However the weight vectors in existing neuralnetwork-based models are usually strongly correlated Infact a critical assumption of using similarity measurementssuch as Euclidean distance or cosine distance is that theentries in the feature vectors should be possibly independent[8] When the weight vectors are overcorrelated some en-tries of the descriptor will dominate the measurement andcause poor ranking results 0e above problems seriouslyaffect the performance of citation recommendation becauseciting activity appears to have strong orthogonality Assumethere are three types of citations in a paper including ldquofield-referencerdquo (red color) ldquomethod-referencerdquo (purple color)and ldquomath-referencerdquo (blue color) ldquoField-referencerdquo usuallyappears in the introduction and cites scientific articles thatuse the same techniques in other research fields ldquoMethod-referencerdquo usually appears in related work and cites scientificarticles solving the same task ldquoMath-referencerdquo usuallyappears in the main part of the paper describing the re-searcherrsquos method in detail and its citations will be morerelated to mathematical theorem It is obvious that thesethree types of citations have strong orthogonality In the
HindawiComputational Intelligence and NeuroscienceVolume 2020 Article ID 5343214 12 pageshttpsdoiorg10115520205343214
neural network model these three citation types are usuallymapped into a matrix and can be seen as base vectors forinputs As shown in Figure 1 vectors in the mapping matrixlearned by traditional neural network models are not or-thogonal When a sample is mapped by w1
rarr w2rarr and w3
rarrapparently w1
rarr and w3rarr will dominate the output and con-
sequently create low discriminative ability A more satis-factory w2prime
rarr(yellow color) imposes orthogonality
To address the aforementioned problems we propose aneural network model with orthogonal regularization forcontext-aware citation recommendation Our model usesCNN to extract the semantic features for citation contextand candidate papers We then add the orthogonal con-straint based on SVD in our model to weaken the correlationof weight vectors in the FC layer which can learn goodinterpretable features for citation context and papers To thebest of our knowledge this is the first work that addresses thecontext-aware citation recommendation with the CNN andorthogonal constraint framework Experimental resultsshow that our model significantly outperforms otherbaseline methods
2 Related Work
21 Citation Recommendation A variety of citation rec-ommendation approaches have been proposed in the lit-erature including text similarity-based [9 10] topic model-based [11 12] probabilistic model-based [13] translationmodel-based [7] and collaborative filtering-based [14] Sunet al [15] proposed amethod for recommending appropriatepapers for academic reviewers by using the similarity-basedalgorithm 0eir method builds preference vectors for re-viewers based on published history information and cal-culates the similarity between the preference vector andcandidate document vector 0e literature with high simi-larity is recommended to corresponding reviewers Sha-parenko and Joachims [16] considered the relevance ofcitation context and the paper content and applied a lan-guage model to the recommendation task Strohman et al[17] showed that using text similarity alone was not ideal forrecommending citations because scholars tend to constructnew words to describe their own achievements while twoscholars who study the same topic may use different ex-pressions for the same concept and method To address thisproblem Strohman et al [17] regarded the document as anode in a directed graph to perform citation recommen-dations 0ey believe that the similarity measurement withreference information can reflect the reference situation of anode more authentically Livne et al [18] proposed a citationrecommendation method by coupling the enriched citationcontext of the literature and adopted various techniquesincluding machine learning when making recommenda-tions Some works addressed the language gap between citedpapers and citation contexts and attempted to use transla-tionmodels or distributed semantic representations Lu et al[19] assumed that the languages used in the citation contextsand in the cited papers were different and used a translationmodel to solve this problem He et al [3] combined alanguage model topic model and feature model to find the
appropriate citation context Huang et al [20] assumed thatthe appearance of cited papers was a particular language andrepresented the cited papers in unique IDs regarded as newldquowordsrdquo 0e probability of citing a paper given a citationcontext is directly estimated by using a translation modelTang et al [21] proposed a joint embedding model to learn alow-dimensional embedding space for both contexts andcitations
In recent years neural networks have shown betterperformance in many fields Some researchers haveattempted to recommend citations by using neural net-works Huang et al [4] learned a distributed word rep-resentation for citation context and associated documentembedding via a feedforward neural network and thenestimated the probability of citing a paper by a given ci-tation context Tan et al [5] proposed a neural networkmethod based on LSTM to solve quote recommended tasks0ey focused on the characteristics of quotes and trainedneural networks to bridge the language gap A neuralnetwork model learned the semantic representations ofarbitrary length texts from a large corpus
22 Orthogonal Constraint in Deep Learning One of thegreatest advantages of orthogonal matrices is that thenorm of the matrix is changed when it is multiplied by amatrix 0is property is useful in gradient back-propagation especially to deal with gradient explosionand gradient dissipation problems Orthogonal regula-rization is widely used in many fields Brock et al [22]used orthogonal regularization to improve the general-ization performance of image generation editor tasks byusing generative adversarial networks (GANs) [23] 0eyfurther expanded their work into BigGAN [24] 0e re-sults in their work showed that by applying orthogonalregularization the generator allows fine-tuning thetradeoff between fidelity and diversity of samples bytruncating hidden spaces which can make the modelachieve the best performance in the image synthesis ofclass conditions Another advantage of orthogonal ma-trices is that they benefit from deep representationlearning If the weight vectors of the full connection layerin the convolutional neural network are highly
w1w2
w3w2prime
Figure 1 Distribution of the weight vector of the reference type ingeometric space
2 Computational Intelligence and Neuroscience
correlated the individuals in each full-join descriptionwill also be highly correlated which will highly reduceretrieval performance Sun et al [25] proposed SVD-Netto show that guaranteeing the feature weight of the FClayer can increase the orthogonal constraint of the net-work and improve the accuracy Zheng et al [26] re-ported that regularization was an efficient method forimproving the generalization ability of deep CNN be-cause it makes it possible to train more complex modelswhile maintaining lower overfitting Zheng et al [26]proposed a method for optimizing the feature boundaryof a deep CNN through a two-stage training step to re-duce the overfitting problem However the mixed fea-tures learned from CNN potentially reduce therobustness of network models for identification orclassification To address this problem Wang et al [27]decomposed deep face features into two orthogonalcomponents to represent age-related and identity-relatedfeatures to learn the age-invariant deep face features Inthe above model age-invariant deep features can be ef-fectively obtained to improve AIFR performance Chenet al [28] proposed a group orthogonal convolutionalneural network (GoCNN) model based on the idea oflearning different groups of convolutional functions thatare ldquoorthogonalrdquo to those in other groups ie with nosignificant correlation among the produced featuresOptimizing orthogonality among convolutional func-tions reduces the redundancy and increases the diversitywithin the architecture Moreover it can also obtain asingle CNN model with sufficient inherent diversity suchthat the model learns more diverse representations andhas stronger generalization ability than vanilla CNNs
3 Proposed Method
31 Problem Formulation 0e context-aware citation rec-ommendation is defined as the matching task between citationcontext and candidate papers 0e main architecture of ourmodel is shown in Figure 2 Our model is actually a con-volutional neural network with two inputs and orthogonalconstraints Our model consists of the following main steps
(1) We adopt word2vec to obtain the raw input vectorsand then use CNNs to extract multiple granularitysemantic features
(2) 0e multiple granularity semantic feature is thenimposed orthogonally by an SVD-FC layer
(3) We use fully connected layers to obtain the finalvector representation 0e logistic function or SVMis used to obtain the recommendation result
32 Network Structure
321 Input Layer Word2vec [29] is used to embed the inputof our model Each word is represented as a d0 dimensionalprecomputed vector where d0 300 As a result each sentenceis represented as a feature matrix with dimension d0 times s0rough this layer we can obtain the raw representation ofcitation context c and candidate document d
We also calculate the weight of common wordsaccording to the inputs 0en we can obtain the basic inputfeatures TF minus IDF(c d) for our model which is the productof TF(wc d) and IDF to reflect how important a word incitation context c is for a candidate document d in the corpus[30] wc is a word in citation context c 0ese two variablesare calculated as follows
TF wc d( 1113857 count wc d( 1113857
top wlowast d( 1113857
IDF logN
docs wc D( 1113857
(1)
where count(wc d) is the number of words wc that appear indocument d top(wlowast d) is the occurrence number of theword wlowast that appears most frequently in this candidatedocument d docs(wc D) is the number of documentscontaining the word wc in all candidate citations D N is thetotal number of candidate citations
322 Convolution Layer 0e inputs of the convolutionlayer are the feature matrix of citation context c and doc-ument d 0e process of this layer is demonstrated in Fig-ure 3 We first pad the two inputs to have the same lengths max(c d) by zero vectors For every input letv1 v2 vs be the words in a sentenceWe define gi isin Rwd0 0lt ilt s + w minus 1 as the concatenation of viminusw vi 0enthis layer generates the feature Pi isin Rd1 for the phrasesviminusw vi as follows
Pi tanh W middot gi + b( 1113857 (2)
whereW isin Rd1timeswd0 is a convolution kernel and b isin Rd1 is thebias
323 Average Pooling Layer 0e pooling layer is usuallyused for feature compression In our model we chooseaverage pooling 0e reason is that whole sentences orparagraphs can express more meaningful semantics Asshown in Figure 4 we design two pooling layers 0e firstone is ldquow-aprdquo which is the column average for thewindow of w continuous columns After the convolutionlayer an s column feature map is converted into a news + w minus 1 column feature map By using ldquow-aprdquo the newfeature map is recovered into the s column 0is archi-tecture facilitates the extraction of more useful abstractfeatures
0e second one is ldquoall-aprdquo which normalizes all col-umns As shown in Figure 5 ldquoall-aprdquo generates a repre-sentation vector for each feature map 0e generated featurecombines the information of the whole citation context orcited document
Now we can obtain the features of citation context andindependent features of the cited document 0e next step isto obtain the semantic relationships between the citationcontext and the candidate paper We use cosine similarity tomeasure the semantic relations
Computational Intelligence and Neuroscience 3
simj 1113936
dj
i0 Cji times Dji1113872 1113873
1113936dj
i0 Cji1113872 11138732
times 1113936dj
i0 Dji1113872 11138732
1113969 (j isin [1 10]) (3)
Citation context Document
SVD-FC
Word2vet Word2vet
Convolution
W-ap W-ap
FC
FC
LogisticsSVM
W-ap W-ap
Convolution
Convolution Convolution10 th
1st
USSVDw
Splice
All-ap All-ap
All-ap All-ap
Based-feature
All-ap-feature
Figure 2 An overview of our model
s + w minus 1ws
Figure 3 Convolution extraction generates phrases
ss + w minus 1
Figure 4 ldquoW-aprdquo structure
Figure 5 ldquoAll-aprdquo structure
4 Computational Intelligence and Neuroscience
where Cj and Dj are the distributed representation of ci-tation context and candidate document after the j-th ldquoall-aprdquo layer respectively A total of ten ldquoall-aprdquo layers arecarried out in our model 0erefore j belongs to [1 10] 0ebenefit is that we can obtain the semantic relation betweenthe citation context and the cited document with multiplegranularities As shown in Figure 6 the final output featureconsists of all simj and basic features 0en it is fed into theSVD-FC layer
In most cases we find that if we use all outputs of poollayers as the input of the SVD-FC layer the performance willbe improved0e reason is that features from different layersrepresent the different levels of semantics Neglecting anylayers will obviously cause information loss problems
Next we use the SVD-FC layer to learn the nonlinearcombination features of citation relationships0is layer canforce vectors in the feature map independent and orthogonalto each other 0e added SVD-FC layer can also reduce thenegative impact of excessive parameters
324 SVD-FC Layer In this layer we use SVD to factorizethe weight matrix W (W USVT) and replace it with USOur experimental results show that replacing operations canreduce the negative impact on the sample space
0e Euclidean distance between samples can be used tomeasure whether their feature expression changes in asample space Denoting em and en as the feature maps of twodifferent samples we can obtain two different outputs of thefull connection operation by using the weight matrix W orUS as follows
p e times W (4)
q e times US (5)
As seen in the above equations q is orthogonalizedoutput while p is unorthogonalized0en we can obtain thefollowing theorem
Theorem 1 p and q in equations (4) and (5) will generatethe same Euclidean distance for samples em and en
Proof 0e Euclidean distance L between pm and pn iscalculated as follows
L pm
rarrminus pn
rarr2
emrarr
minus enrarr
( 1113857TWW
Temrarr
minus enrarr
( 1113857
1113969
emrarr
minus enrarr
( 1113857TUSVV
TS
TU
Temrarr
minus enrarr
( 1113857
1113969(6)
Since V is an orthogonal matrix equation (6) isequivalent to
L
emrarr
minus enrarr
( 1113857TUSS
TU
Temrarr
minus enrarr
( 1113857
1113969
qmrarr
minus qnrarr
( 1113857T
qmrarr
minus qnrarr
( 1113857
1113969
qmrarr
minus qnrarr
2
(7)
It can be seen that pm
rarrminus pn
rarr2 qm
rarrminus qn
rarr2
It should be noted that there are no negative impacts andno changes in discrimination ability for the entire samplespace when replacing the weight As shown in Figure 7 weuse SVD of weight matrix W to map the feature map to anorthogonal linear space
325 Output Layer 0e citation recommendation problemis regarded as a classification task in our model In this layerlogistics and SVM can deal with binary classification tasksand predict the final citation relationship
33 Training Details
331 Embeddings In our model words are initialized by300-dimensional word2vec embeddings and will notchange during training A single randomly initializedembedding is created for all unknown words by uniformsampling from[minus001 001] We employ AdaGrad [31] andL2 regularization We introduce adversarial training [32]for embeddings to make the model more robust 0eprocess is achieved by replacing the word vector v afterword2vec embeddings using word vector with disturbingvlowast
vlowast
v times radv (8)
where radv is the worst case of perturbation on the wordvector Goodfellow et al [33] approximated this value bylinearizing the loss function logp(y|x 1113954θ) around x where1113954θ is a constant set to the current parameters of our modeland it only participates in the calculation process of radvwithout a backpropagation algorithm With the linearapproximation and L2 norm constraint the adversarialperturbation is
All-ap-feature
simi
Ci Di
Basic-feature
Figure 6 Generating the feature map
SVD-FC layer input feature
SVD-FC layer output featureSVD-FC layer
Figure 7 SVD-FC layer
Computational Intelligence and Neuroscience 5
radv minusising
g2 whereg nablaxlogp(y|x 1113954θ) (9)
0is perturbation can be easily computed by usingbackpropagation in neural networks
332 Layerwise Training In our training steps we defineconv-pooling block bt (tge 2) which consists of a convo-lution layer and a pooling layer Our network model is thenassembled by the initialization block b1 that initializes usingword2vec and (n minus 1) conv-pooling blocks
First we train the conv-pooling block b2 after b1 istrained On this basis the next conv-pooling block b3 iscreated by keeping the previous block fixed We repeat thisprocedure until all (n minus 1) conv-pooling blocks are trained
Second the following semiorthogonal training proce-dure is used to train the whole network
Semiorthogonal training (SOT) it is crucial to trainSVD-CNN which consists of the following three steps
Step 1 Decompose the weight matrix by SVD ieW USVT W is the weight matrix of the linear layerU is the left-unitary matrix S is the singular valuematrix V is the right-unitary matrix After that wereplace W with US Next we take all eigenvectors ofUS(US)T as weight vectorsStep 2 0e backbone model is fine-tuned by fixing theSVD-FC layerStep 3 0e model keeps fine-tuning with the unfixedSVD-FC layer
Step 1 can generate orthogonal weights but the per-formance of prediction cannot be guaranteed 0e reason isthat over orthogonality will excessively punish synonymoussentences which is apparently inappropriate 0erefore weintroduce Steps 2 and 3 to solve the above problem
0e inputs of SVD-FC are defined as Y
Y (y1 y2 ym)T 0e outputs are defined as O
O (o1 o2 om)T 0e weight matrix is defined asW (w1 w2 wm)T 0e expected outputs are defined asA (a1 a2 am)T 0e error function is defined as
E 12
1113944
l
k1ak minus ok( 1113857
2 (10)
where ok f(1113936mj0 wkjyj) k 1 2 l 0en E with re-
spect to ok is derived and the outcome is
zE
zok
minus ak minus ok( 1113857 (11)
We utilize the gradient descent strategy to find thegradient of the error with respect to weights 0e iterativeupdate of weights is as follows
Δwkj minusηzE
zwkj
(12)
We define an error signal δok zEz netk equation (12) is
equivalent to
Δwkj minusηzE
z netk
z netkzwkj
minusηδok
z netkzwkj
(13)
According to equation (11) δok zEz netk is equivalent
to
δok minus
zE
zok
zok
z netk minus
zE
zok
fprime netk( 1113857
zE
zok
okprime minus dk minus ok( 1113857ok
prime
(14)
We use the sigmoid f(x) 1(1 + ex) as the nonlinearfunction so equation (13) is equivalent to
Δwkj minusηδokyj η dk minus ok( 1113857ok 1 minus ok( 1113857yj (15)
In Step 1 the weight matrix W is decomposed by SVDand replaced with US U (q1 q2 qm)T andS diag(λ1 λ2 λm) Since dk minus ok is given we definethat Loss dk minus ok As a result equation (15) is equivalent to
Δwkj η Loss middot ok minus sigmoid yj 1113944 qiλi + B1113872 11138732
1113876 1113877yj (16)
qi middot qj 0 ine j are in the left-unitary matrix U so themodel operation is not affected by the nonorthogonal ei-genvectors qi 0is is the reason for excessively punishingsynonymous sentences in Step 1 However orthogonalityhas a positive effect on Δwkj in Step 2
0e purpose of SVD is to maintain the orthogonality ofeach weight vector in geometric space When weight vectorsare conditioned by orthogonal regularization the relevancybetween weight vectors decreases We use the followingmethods in Step 3 to measure relevance
H WTW
w1rarrT
w1rarr
middot middot middot w1rarrT
wkrarr
⋮ ⋱ ⋮
wkrarrT
w1rarr
middot middot middot wkrarrT
wkrarr
⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣
⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦
h11 middot middot middot h1k
⋮ ⋱ ⋮
hk1 middot middot middot hkk
⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣
⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦
(17)
where W is a weight matrix that contains k weight vectorswi (i 1 k)hij (i j 1 k) is the dot product of wi
and wj Let us define S(W) as the correlation measurementof all column vectors in W
S(W) 1113936
ki1 hii
1113936ki1 1113936
kj1 hij
11138681113868111386811138681113868
11138681113868111386811138681113868 (18)
When W is an orthogonal matrix the value of S(W) is 1When ine j S(W) obtains the minimum value (1k)0erefore we can see that the value of S(W) falls into
6 Computational Intelligence and Neuroscience
[(1k) 1] As a result when S(W) is close to 1k or 0 theweight matrix will have high relevance
34 Complexity Analysis Assume that the training samplesize is |C| the average number of words in each citationcontext is |c| Cl is the number of kernels in the l-th layer andwis the size of the sliding window For one convolution layerthe training complexity is O(Clminus1 middot Cl middot w middot (s minus w + 1)) 0etraining complexity of one w-ap layer is O(C2
l middot w middot s) 0etraining complexity of one all-ap layer isO(C2
l middot (s minus w + 1))which was improved by C F Van Loan [12] computing theeigenvalue for SVD matrix decomposition with K size takesO(K) on the way of JACOBI Assume that the size of theweight matrix in the SVD-FC layer isK and the channel ofthe input matrix is Cin 0e computational cost for the SVD-FC layer is O(2K2 middot Cin + K)
4 Experiment
41 Dataset We use the CiteSeer dataset [34] to evaluatethe performance of our model 0e dataset was publishedby Huang et al [4] In this dataset citation relationshipsare extracted by a pair of citation contexts and the ab-stracts of cited papers A citation context includes thesentence where the citation placeholder appears and thesentences before and after the citation placeholderWithin each paper in the corpus the 50 words before and50 words after each citation reference are treated as thecorresponding citation context (a discussion on thenumber of words can be found in [7]) Before wordembedding we also remove stop words from the contextsTo preserve the time-sensitive pastpresentfuture tensesof verbs and the singularplural styles of named entitiesno stemming is done but all words are transferred tolower-case 0e training set contains 3989547 pairs ofreference contexts and citations and the test set contains1021685 citation relations
Following common practice in information retrieval(IR) we employ the following four evaluation metrics toevaluate recommendation results recall mean reciprocalrank (MRR) mean average precision (MAP) and normal-ized discounted cumulative gain (nDCG)
42 EvaluationMetric For each query in the test set we usethe original set of references as the ground truth Rg Assumethat the set of recommended citations is Rr and the correctrecommendations are Rg capRr Recall is defined as
recall Rg capRr
11138681113868111386811138681113868
11138681113868111386811138681113868
Rg
(19)
In our experiments the number of recommended ci-tations ranges from 1 to 10 Recall evaluation does not revealthe order of recommended references To address thisproblem we select the following two additional metrics
For a query q let rankq be the rank of the first correctrecommendation within the list MRR [35] is defined as
MRR 1
|Q|1113944qisinQ
1rankq
(20)
where Q is the testing set MRR reveals the average rankingof the first correct recommendation
For each citation placeholder we search the papers thatmay be referenced at this citation placeholder Each retrievalmodel returns a ranked list of papers Since there may be oneor more references for one citation context we use meanaverage precision (MAP) as the evaluation metric
where R(di) is a binary function indicating whether doc-ument di is relevant or not For our problem the papers citedat the citation placeholder are considered relevantdocuments
We use normalized discounted cumulative gain (NDCG)to measure the ranked recommendation list 0e NDCGvalue of a ranking list at position i is calculated as
NDCG d1 dN( 1113857 1113944i
2rel di( ) minus 1lni+1 (22)
where rel (di) is the 4-scale relevance of document di in theranked list We use the average cocited probability [2] oflangdi dlowastrang to weigh the citation relevance score of di to dlowast(anoriginal citation of the query) We report the average NDCGscore over all testing documents
43 BaselineComparison We choose the following methodsfor comparison
Cite-PLSA-LDA (CP-LDA) [36] we use the originalimplementation provided by the author 0e number oftopics is set to 60
(i) Restricted Boltzmann Machine (RBM-CS) [37] Wetrain two layers of RBM-CS according to the sug-gestion of the author We set the hidden layer size to600
(ii) Word2vec Model (W2V) [29] We use the word2vecmodel to learn words and document representa-tions 0e cited document is treated as a ldquowordrdquo (adocument uses a unique marker when it is cited bydifferent papers) 0e dimensions of the word anddocument vectors are set to n 300
(iii) Neural Probabilistic Model (NPM) [4] We followthe original implementation 0e dimensions of theword and document representation vector are set ton 600 For negative sampling we set the numberof negative samples k 10 where k is the number ofnoise words in the citation context For noisecontrast estimation we set the number of noisesamples k 1000
(iv) Neural Citation Network (NCN) [7] In NCN thegradient clipping is 5 the dropout probability is 02and the recurrent layers are 2 0e region sizes for
Computational Intelligence and Neuroscience 7
the encoder are set to 4 4 and 5 and the region sizesfor the author network are set to 1 and 2
Figures 8 and 9 show the performance of eachmethod onthe CiteSeer dataset It is obvious that the SVD-FC modelleads the performance in most cases More detailed analysesare given as follows
First we perform a comparison among CP-LDA RBMW2V and SVD-CNN Our SVD-CNN completely andsignificantly exceeds other models in all metrics 0e successof ourmodel is ascribed to the content and correlation of ournetwork Due to the lack of citation context information wefind that W2V is obviously worse than other methods interms of all metrics CP-LDA works much better than W2Vwhich indicates that link information is very important forfinding relevant papers RBM-CS shows a clear performancegain over W2V because RBM-CS automatically discoverstopical aspects of each paper based on citation contextHowever the vector representations of citation context inRBM-CS are extracted by traditional word vector repre-sentations which fully neglect semantic relations betweenthe citation document and citation context and thus may belimited by vocabulary
Second we compare the performance among NPMNCN and SVD-CNN It is not surprising that NPM andNCN achieve worse performance than SVD-CNN since theirdistributed representation of words and documents reliessolely on deep learning without restraint NPM recommendscitations based on trained distributed representations NCNfurther enhances the performance by considering authorinformation and using a more sophisticated neural networkarchitecture However the CNN in NCN does not haveorthogonal constraints which makes it difficult to capturedifferent types of citing activities In addition NCN onlyutilizes the title of the cited paper for a decoder which isapparently not sufficient for learning good embedding
44 e Influence on the Link Prediction of Reference PatternInteractionalFeatures According to the chapter positions ofcitation context in the article we divide the training set intothree parts the introduction part contains 1307885 pairs ofreference contexts and citations the related word partcontains 1599897 pairs of citations and the main partcontains 1024783 pairs Furthermore these datasets formthree mixed datasets In this part of the experiment we usethe CNN model without SVD as the baseline 0ese datasetsare tested in a ratio of 3 1 In Tables 1 and 2 we show theresults on the abovementioned datasets
From the results we obtain the following observationsFirst both CNN and SVD-CNN outperform unmixed
datasets over mixed datasets across the different evaluationmetrics which shows that the diversity of reference patternsincreases the difficulty of citation recommendation tasks
Second in Tables 1 and 2 we observe that our model isparticularly good at resolving the difficulties in mixeddatasets which come from the diversity of referencepatterns
To better explore why mixed datasets are more complexthan unmixed datasets in Figure 10 we show the change in
S(W) during the training process of SVD-CNN amongvarious datasets
As shown in Figure 10 the increase in S(W) on themixed datasets indicates that SVD-CNN is good at decor-relation We can also see in Tables 1 and 2 that the CNNmodel has pretty performance on unmixed datasets whileachieving poor performance on mixed datasets HoweverSVD-CNN achieves almost the same performance on thetwo types of datasets 0is proves that the correlation fromvarious reference patterns can significantly affect the linkprediction
0e reason why the change in S(W) is not large on theunmixed datasets is that reference patterns of unmixeddatasets have similar features which belong to the samecategory As a result the orthogonality of the weight matrixis hard to improve on unmixed datasets However a citationrecommendation algorithm has pretty performance on theunmixed datasets because there are low complexities
Although mixed datasets are more complicated thanunmixed datasets SVD-CNN still performs well in mixeddatasets 0is indicates that SVD-CNN reduces the negativeimpact of the correlation of reference patterns and ourapproach is more suitable for complex scenarios
45 Comparison with Other Types of Decorrelation In ad-dition to SVD there are still some other methods fordecorrelating the feature matrix However these methodscannot maintain the discriminating ability of the CNNmodel To illustrate this we compare SVD with severalvarieties as follows
(1) Using the originally learned W
(2) Replacing W with US
(3) Replacing W with U
(4) Replacing W with UVT
(5) Replacing Wwith Q D where D is the diagonalmatrix extracted from the upper triangle matrix inQ-R decomposition
(6) Replacing W with WPCA where WPCA is the diagonalmatrix extracted from the weight matrix W after theprocessing of dimension reduction by PCA
After convergence of training different orthogonalmatrices are used to replace the weight matrix W We defineT-cost as the time cost of replacing the weight which isequivalent to the proportion of the added time to the originaltime As shown in Table 3 other types of decorrelationdegrade the performance in addition to W⟶ US andW⟶WPCA However the time cost of W⟶WPCA ismore than that of W⟶ US
46 Ablation Study In our method there are two essentialparameters a term sot which means the number of SOTiterations and a biased parameter d0 In this section weconduct an ablation study of these parameters
We first evaluate the effectiveness of sot by empiri-cally fixing d0 300 Since sot defines the loop time of
8 Computational Intelligence and Neuroscience
orthogonal constraint training it should be set as anonnegative value Figure 11 illustrates the MRR with sotfrom 0 to 10 on the CiteSeer dataset We can see that the
performance improves as the value of sot increasesWhen sot 0 the model has no decorrelation andachieves the worst performance In this situation the
Table 1 MRR metric on various datasets
Introduction Related Main Introduction + related Introduction +main Related +mainCNN 03312 03294 03478 02773 02815 02978SVD-CNN 03995 04078 03989 03878 03889 03845
060055050045040035030025Re
call
020015010005000
20 40 60Number of recommended citations
80 100
W2VsNPMs
RBMsCP_LDAs
SVD_CNNsNCNs
Figure 8 Comparison of recall with different methods on CiteSeer
MRR MAP and nDCG scores for top 10 recommendations04
035
03
025
02
015
01
005
0
0091600997
MRR MAP nCDG
00662
01843
02667
03687
00912009982
00663
01835
02418
03352
01288 0135601476
0256602592
03448
CP-LDARBM
W2VNPM
NCNSVD-CNN
Figure 9 Comparison of MRR MAP and nDCG with different methods on CiteSeer
Table 2 MAP metric on various datasets
Introduction Related Main Introduction + related Introduction +main Related +mainCNN 03001 02909 03107 02572 02601 02637SVD-CNN 03701 03655 03693 03498 03511 03539
Computational Intelligence and Neuroscience 9
weight matrix in the FC layer is highly correlated andS(W) has the lowest value 0e recommendation per-formance then increases while adding sot which indi-cates that reducing the correlative degree of the weightmatrix in the FC layer is critical for improving perfor-mance When sot 10 our model achieves the bestperformance
In our model d0 is the dimension of citation contextand cited document representations Figure 12 shows howthe performance of SVD-CNN varies with d0 on the samesot When d0 is small the information content of thecitation context is very small and produces worse per-formance 0e recommendation performance increases toa maximum point until d0 reaches 300 It should be noted
05
045
04
035
03
S (W
)
025
02
015
01
005
00 1 2 3 4 5
Sot6 7 8 9 10 11
IntroductionRelatedMain
Introduction + relatedIntroduction + mainRelated + main
Figure 10 0e change in S(W) during training on unmixed datasets and mixed datasets
Table 3 0e comparison of related methods in Step 1
Figure 11 0e performance impact of sot on CiteSeer
10 Computational Intelligence and Neuroscience
that although the larger d0 is better the larger d0 willsignificantly increase the training time 0erefore wechoose d0 300
5 Conclusion and Future Works
We propose a convolutional neural network model withorthogonal regularization to solve the context-aware citationrecommendation task In our model orthogonal regulari-zation is achieved by using SVD to factorize the weight of theFC layer which can essentially make each vector in thefeature map more independent 0e orthogonal regulari-zation also enhances the feature extraction ability of CNN0e experimental results show that SVD-CNN outperformsthe other compared methods on CiteSeer Our model onlytakes the abstract as the content of the cited paper In thefuture we will explore the performance of our model byusing the full text of papers
Data Availability
Previously reported CiteSeer data were used to support thisstudy and are available at [httpspsuappboxcomvrefseer] 0ese prior datasets are cited at relevant placeswithin the text as references [4]
Conflicts of Interest
0e authors declare that there are no conflicts of interestregarding the publication of this paper
Acknowledgments
0is work was partially supported by the National NaturalScience Foundation of China (project no 61373046) andthe National Key Research and Development Programs ofChina (project nos 2018AAA0101100 and2019YFB2102500)
References
[1] M A Angrosh S Cranefield and N Stanger ldquoConditionalrandom field based sentence context identification enhancingcitation services for the research communityrdquo in Proceedingsof the First Australasian Web Conference Adelaide AustraliaJanuary 2013
[2] Q He J Pei D Kifer et al ldquoContext-aware citation rec-ommendationrdquo in Proceedings of the International Conferenceon World Wide Web Raleigh NC USA April 2010
[3] Q He D Kifer J Pei et al ldquoCitation recommendationwithout author supervisionrdquo in Proceedings of the FourthACM international Conference on Web Search and DataMining Hong Kong China February 2011
[4] W Huang ldquoA neural probabilistic model for context basedcitation recommendationrdquo in Proceedings of the AAAIConference on Artificial Intelligence Austin TX USA January2015
[5] J Tan X Wan and J Xiao ldquoA neural network approach toquote recommendation in writingsrdquo in Proceedings of theACM International on Conference on Information andKnowledge Management Indianapolis IN USA October2016
[6] X Ren J Liu X Yu et al ldquoCluscite effective citation rec-ommendation by information network-based clusteringrdquo inProceedings of the 20th ACM SIGKDD International Con-ference on Knowledge Discovery and Data Mining New YorkNY USA August 2014
[7] T Ebesu and Y Fang ldquoNeural citation network for context-aware citation recommendationrdquo in Proceedings of the 40thInternational ACM SIGIR Conference on Research and De-velopment in Information Retrieval pp 1093ndash1096 ShinjukuJapan August 2017
[8] D M Blei A Y Ng and M I Jordan ldquoLatentdirichlet allocationrdquo Journal of Machine Learning Researchvol 3 pp 993ndash1022 2003
[9] S Bradshaw ldquoReference directed indexing redeeming rele-vance for subject search in citation indexesrdquo Research andAdvanced Technology for Digital Libraries vol 2769pp 499ndash510 2003
[10] N Meuschke B Gipp and M Lipinsk ldquoCITREC an eval-uation framework for citation-based similarity measures
based on TREC genomics and PubMed centralrdquo 2015 httphdlhandlenet214273680
[11] A Ritchie S Robertson and S Teufel ldquoComparing CitationContexts for information Retrievalrdquo in Proceedings of the 17thACM Conference on Information and Knowledge Manage-ment pp 213ndash222 Napa Valley CA USA October 2008
[12] C F Van Loan e Block Jacobi Method for Computing theSingular Value Decomposition Department of ComputerScience Cornell University Ithaca NY USA 1985
[13] C Bhagavatula S Feldman R Power et al ldquoContent-basedcitation recommendationrdquo 2018 httpsarxivorgpdf18020830201v1pdf
[14] H Jia and E Saule ldquoLocal is good a fast citation recom-mendation approachrdquo Lecture Notes in Computer ScienceVol 10772 Springer Berlin Germany 2018
[15] Y Sun W Ni and R Men ldquoA personalized paper recom-mendation approach based on web paper mining and re-viewerrsquos interest modellingrdquo in Proceedings of theInternational Conference on Research Challenges in ComputerScience Shanghai China December 2009
[16] B Shaparenko and T Joachims ldquoInformation genealogyUncovering the flow of ideas in non-hyperlinked documentdatabasesrdquo in Proceedings of the ACM SIGKDD internationalConference on Knowledge Discovery and Data Mining SanJose CA USA August 2007
[17] T Strohman W B Croft and D Jensen ldquoRecommendingcitations for academic papersrdquo in Proceedings of the Annualinternational ACM SIGIR Conference on Research and De-velopment in information Retrieval Amsterdam NetherlandsJuly 2007
[18] A Livne V Gokuladas J Teevan et al ldquoCiteSight supportingcontextual citation recommendation using differentialsearchrdquo in Proceedings of the International ACM SIGIRConference on Research amp Development in informationRetrieval Gold Coast Australia July 2014
[19] Y Lu J He D Shan et al ldquoRecommending citations withtranslation modelrdquo in Proceedings of the ACM internationalConference on Information and Knowledge ManagementGlasgow UK October 2011
[20] W Huang P Mitra S Kataria et al ldquoRecommending cita-tions translating papers into referencesrdquo in Proceedings of theACM international Conference on Information and KnowledgeManagement Shanghai China November 2014
[21] X Tang X Wan X Zhang et al ldquoCross-language context-aware citation recommendation in scientific articlesrdquo inProceedings of the International ACM SIGIR Conference onResearch amp Development in information Retrieval Gold CoastUK July 2014
[22] A Brock T Lim J M Ritchie et al ldquoNeural photo edi-tingwith introspective adversarial networksrdquo in InternationalConference on Learning Representations 2017
[23] I J Goodfellow J Pouget-Abadie MMirza et al ldquoGenerativeadversarial netsrdquo in Proceedings of the International Con-ference on Neural Information Processing Systems MontrealCanada December 2014
[24] A Brock J Donahue K Simonyan et al ldquoLarge scale GANtraining for high fidelity natural image synthesisrdquo 2018httpsarxivorgabs180911096
[25] Y Sun L Zheng W Deng et al ldquoSVDNet for pedestrianretrievalrdquo in Proceedings of the 2017 IEEE InternationalConference on Computer Vision (ICCV) pp 3820ndash3828Venice Italy October 2017
[26] Q Zheng M Yang J Yang Q Zhang and X ZhangldquoImprovement of generalization ability of deep CNN via
implicit regularization in two-stage training processrdquo IEEEAccess vol 6 no 1109 pp 15844ndash15869 2018
[27] Y Wang D Gong Z Zheng et al ldquoOrthogonal deep featuresdecomposition for age-invariant face recognitionrdquo in Pro-ceedings of the European Conference on Computer Vision(ECCV) Munich Germany September 2018
[28] Y Chen X Jin J Feng et al ldquoTraining group orthogonalneural networks with privileged informationrdquo in Proceedingsof the Twenty-Sixth International Joint Conference on ArtificialIntelligence Melbourne Australia August 2017
[29] T Mikolov I Sutskever K Chen et al ldquoDistributedrepresentations of words and phrases and their composi-tionalityrdquo in Proceedings of the 26th International Con-ference on Neural Information Processing Systems LakeTahoe NV USA December 2013
[30] A Rajaraman and J D Ullman ldquoData miningrdquo Mining ofMassive Datasets vol 3 no 2 pp 1ndash17 2014
[31] J Duchi E Hazan and Y Singer ldquoAdaptive subgradientmethods for online learning and stochastic optimizationrdquoJournal of Machine Learning Research vol 12 no 7pp 2121ndash2159 2011
[32] T Miyato A M Dai and I Goodfellow ldquoAdversarial trainingmethods for semi-supervised text classificationrdquo in Pro-ceedings of the International Conference on LearningRepresentations San Juan Puerto Rico May 2016
[33] I J Goodfellow J Shlens and C Szegedy ldquoExplaining andharnessing adversarial examplesrdquo in Proceedings of the In-ternational Conference on Learning Representations BanffCanada April 2014
[34] K Chandrasekaran S Gauch P Lakkaraju et al ldquoConcept-based document recommendations for CiteSeer authorsrdquo inProceedings of the International Conference on AdaptiveHypermedia and Adaptive Web-Based Systems HannoverGermany August 2008
[35] E Voorhees ldquo0e trec-8 question answering track reportrdquo inProceedings of the TRECrsquo00 pp 77ndash82 Gaithersburg MDUSA 2000
[36] S Kataria P Mitra and S Bhatia ldquoUtilizing context ingenerative bayesian models for linked corpusrdquo in Proceedingsof the Twenty-Fourth AAAI Conference on ArtificialIntelligence Atlanta GA USA July 2010
[37] J Tang and J Zhang ldquoA discriminative approach to topic-based citation recommendationrdquo in Proceedings of the Pacific-Asia Conference Hyderabad India July 2009
12 Computational Intelligence and Neuroscience
neural network model these three citation types are usuallymapped into a matrix and can be seen as base vectors forinputs As shown in Figure 1 vectors in the mapping matrixlearned by traditional neural network models are not or-thogonal When a sample is mapped by w1
rarr w2rarr and w3
rarrapparently w1
rarr and w3rarr will dominate the output and con-
sequently create low discriminative ability A more satis-factory w2prime
rarr(yellow color) imposes orthogonality
To address the aforementioned problems we propose aneural network model with orthogonal regularization forcontext-aware citation recommendation Our model usesCNN to extract the semantic features for citation contextand candidate papers We then add the orthogonal con-straint based on SVD in our model to weaken the correlationof weight vectors in the FC layer which can learn goodinterpretable features for citation context and papers To thebest of our knowledge this is the first work that addresses thecontext-aware citation recommendation with the CNN andorthogonal constraint framework Experimental resultsshow that our model significantly outperforms otherbaseline methods
2 Related Work
21 Citation Recommendation A variety of citation rec-ommendation approaches have been proposed in the lit-erature including text similarity-based [9 10] topic model-based [11 12] probabilistic model-based [13] translationmodel-based [7] and collaborative filtering-based [14] Sunet al [15] proposed amethod for recommending appropriatepapers for academic reviewers by using the similarity-basedalgorithm 0eir method builds preference vectors for re-viewers based on published history information and cal-culates the similarity between the preference vector andcandidate document vector 0e literature with high simi-larity is recommended to corresponding reviewers Sha-parenko and Joachims [16] considered the relevance ofcitation context and the paper content and applied a lan-guage model to the recommendation task Strohman et al[17] showed that using text similarity alone was not ideal forrecommending citations because scholars tend to constructnew words to describe their own achievements while twoscholars who study the same topic may use different ex-pressions for the same concept and method To address thisproblem Strohman et al [17] regarded the document as anode in a directed graph to perform citation recommen-dations 0ey believe that the similarity measurement withreference information can reflect the reference situation of anode more authentically Livne et al [18] proposed a citationrecommendation method by coupling the enriched citationcontext of the literature and adopted various techniquesincluding machine learning when making recommenda-tions Some works addressed the language gap between citedpapers and citation contexts and attempted to use transla-tionmodels or distributed semantic representations Lu et al[19] assumed that the languages used in the citation contextsand in the cited papers were different and used a translationmodel to solve this problem He et al [3] combined alanguage model topic model and feature model to find the
appropriate citation context Huang et al [20] assumed thatthe appearance of cited papers was a particular language andrepresented the cited papers in unique IDs regarded as newldquowordsrdquo 0e probability of citing a paper given a citationcontext is directly estimated by using a translation modelTang et al [21] proposed a joint embedding model to learn alow-dimensional embedding space for both contexts andcitations
In recent years neural networks have shown betterperformance in many fields Some researchers haveattempted to recommend citations by using neural net-works Huang et al [4] learned a distributed word rep-resentation for citation context and associated documentembedding via a feedforward neural network and thenestimated the probability of citing a paper by a given ci-tation context Tan et al [5] proposed a neural networkmethod based on LSTM to solve quote recommended tasks0ey focused on the characteristics of quotes and trainedneural networks to bridge the language gap A neuralnetwork model learned the semantic representations ofarbitrary length texts from a large corpus
22 Orthogonal Constraint in Deep Learning One of thegreatest advantages of orthogonal matrices is that thenorm of the matrix is changed when it is multiplied by amatrix 0is property is useful in gradient back-propagation especially to deal with gradient explosionand gradient dissipation problems Orthogonal regula-rization is widely used in many fields Brock et al [22]used orthogonal regularization to improve the general-ization performance of image generation editor tasks byusing generative adversarial networks (GANs) [23] 0eyfurther expanded their work into BigGAN [24] 0e re-sults in their work showed that by applying orthogonalregularization the generator allows fine-tuning thetradeoff between fidelity and diversity of samples bytruncating hidden spaces which can make the modelachieve the best performance in the image synthesis ofclass conditions Another advantage of orthogonal ma-trices is that they benefit from deep representationlearning If the weight vectors of the full connection layerin the convolutional neural network are highly
w1w2
w3w2prime
Figure 1 Distribution of the weight vector of the reference type ingeometric space
2 Computational Intelligence and Neuroscience
correlated the individuals in each full-join descriptionwill also be highly correlated which will highly reduceretrieval performance Sun et al [25] proposed SVD-Netto show that guaranteeing the feature weight of the FClayer can increase the orthogonal constraint of the net-work and improve the accuracy Zheng et al [26] re-ported that regularization was an efficient method forimproving the generalization ability of deep CNN be-cause it makes it possible to train more complex modelswhile maintaining lower overfitting Zheng et al [26]proposed a method for optimizing the feature boundaryof a deep CNN through a two-stage training step to re-duce the overfitting problem However the mixed fea-tures learned from CNN potentially reduce therobustness of network models for identification orclassification To address this problem Wang et al [27]decomposed deep face features into two orthogonalcomponents to represent age-related and identity-relatedfeatures to learn the age-invariant deep face features Inthe above model age-invariant deep features can be ef-fectively obtained to improve AIFR performance Chenet al [28] proposed a group orthogonal convolutionalneural network (GoCNN) model based on the idea oflearning different groups of convolutional functions thatare ldquoorthogonalrdquo to those in other groups ie with nosignificant correlation among the produced featuresOptimizing orthogonality among convolutional func-tions reduces the redundancy and increases the diversitywithin the architecture Moreover it can also obtain asingle CNN model with sufficient inherent diversity suchthat the model learns more diverse representations andhas stronger generalization ability than vanilla CNNs
3 Proposed Method
31 Problem Formulation 0e context-aware citation rec-ommendation is defined as the matching task between citationcontext and candidate papers 0e main architecture of ourmodel is shown in Figure 2 Our model is actually a con-volutional neural network with two inputs and orthogonalconstraints Our model consists of the following main steps
(1) We adopt word2vec to obtain the raw input vectorsand then use CNNs to extract multiple granularitysemantic features
(2) 0e multiple granularity semantic feature is thenimposed orthogonally by an SVD-FC layer
(3) We use fully connected layers to obtain the finalvector representation 0e logistic function or SVMis used to obtain the recommendation result
32 Network Structure
321 Input Layer Word2vec [29] is used to embed the inputof our model Each word is represented as a d0 dimensionalprecomputed vector where d0 300 As a result each sentenceis represented as a feature matrix with dimension d0 times s0rough this layer we can obtain the raw representation ofcitation context c and candidate document d
We also calculate the weight of common wordsaccording to the inputs 0en we can obtain the basic inputfeatures TF minus IDF(c d) for our model which is the productof TF(wc d) and IDF to reflect how important a word incitation context c is for a candidate document d in the corpus[30] wc is a word in citation context c 0ese two variablesare calculated as follows
TF wc d( 1113857 count wc d( 1113857
top wlowast d( 1113857
IDF logN
docs wc D( 1113857
(1)
where count(wc d) is the number of words wc that appear indocument d top(wlowast d) is the occurrence number of theword wlowast that appears most frequently in this candidatedocument d docs(wc D) is the number of documentscontaining the word wc in all candidate citations D N is thetotal number of candidate citations
322 Convolution Layer 0e inputs of the convolutionlayer are the feature matrix of citation context c and doc-ument d 0e process of this layer is demonstrated in Fig-ure 3 We first pad the two inputs to have the same lengths max(c d) by zero vectors For every input letv1 v2 vs be the words in a sentenceWe define gi isin Rwd0 0lt ilt s + w minus 1 as the concatenation of viminusw vi 0enthis layer generates the feature Pi isin Rd1 for the phrasesviminusw vi as follows
Pi tanh W middot gi + b( 1113857 (2)
whereW isin Rd1timeswd0 is a convolution kernel and b isin Rd1 is thebias
323 Average Pooling Layer 0e pooling layer is usuallyused for feature compression In our model we chooseaverage pooling 0e reason is that whole sentences orparagraphs can express more meaningful semantics Asshown in Figure 4 we design two pooling layers 0e firstone is ldquow-aprdquo which is the column average for thewindow of w continuous columns After the convolutionlayer an s column feature map is converted into a news + w minus 1 column feature map By using ldquow-aprdquo the newfeature map is recovered into the s column 0is archi-tecture facilitates the extraction of more useful abstractfeatures
0e second one is ldquoall-aprdquo which normalizes all col-umns As shown in Figure 5 ldquoall-aprdquo generates a repre-sentation vector for each feature map 0e generated featurecombines the information of the whole citation context orcited document
Now we can obtain the features of citation context andindependent features of the cited document 0e next step isto obtain the semantic relationships between the citationcontext and the candidate paper We use cosine similarity tomeasure the semantic relations
Computational Intelligence and Neuroscience 3
simj 1113936
dj
i0 Cji times Dji1113872 1113873
1113936dj
i0 Cji1113872 11138732
times 1113936dj
i0 Dji1113872 11138732
1113969 (j isin [1 10]) (3)
Citation context Document
SVD-FC
Word2vet Word2vet
Convolution
W-ap W-ap
FC
FC
LogisticsSVM
W-ap W-ap
Convolution
Convolution Convolution10 th
1st
USSVDw
Splice
All-ap All-ap
All-ap All-ap
Based-feature
All-ap-feature
Figure 2 An overview of our model
s + w minus 1ws
Figure 3 Convolution extraction generates phrases
ss + w minus 1
Figure 4 ldquoW-aprdquo structure
Figure 5 ldquoAll-aprdquo structure
4 Computational Intelligence and Neuroscience
where Cj and Dj are the distributed representation of ci-tation context and candidate document after the j-th ldquoall-aprdquo layer respectively A total of ten ldquoall-aprdquo layers arecarried out in our model 0erefore j belongs to [1 10] 0ebenefit is that we can obtain the semantic relation betweenthe citation context and the cited document with multiplegranularities As shown in Figure 6 the final output featureconsists of all simj and basic features 0en it is fed into theSVD-FC layer
In most cases we find that if we use all outputs of poollayers as the input of the SVD-FC layer the performance willbe improved0e reason is that features from different layersrepresent the different levels of semantics Neglecting anylayers will obviously cause information loss problems
Next we use the SVD-FC layer to learn the nonlinearcombination features of citation relationships0is layer canforce vectors in the feature map independent and orthogonalto each other 0e added SVD-FC layer can also reduce thenegative impact of excessive parameters
324 SVD-FC Layer In this layer we use SVD to factorizethe weight matrix W (W USVT) and replace it with USOur experimental results show that replacing operations canreduce the negative impact on the sample space
0e Euclidean distance between samples can be used tomeasure whether their feature expression changes in asample space Denoting em and en as the feature maps of twodifferent samples we can obtain two different outputs of thefull connection operation by using the weight matrix W orUS as follows
p e times W (4)
q e times US (5)
As seen in the above equations q is orthogonalizedoutput while p is unorthogonalized0en we can obtain thefollowing theorem
Theorem 1 p and q in equations (4) and (5) will generatethe same Euclidean distance for samples em and en
Proof 0e Euclidean distance L between pm and pn iscalculated as follows
L pm
rarrminus pn
rarr2
emrarr
minus enrarr
( 1113857TWW
Temrarr
minus enrarr
( 1113857
1113969
emrarr
minus enrarr
( 1113857TUSVV
TS
TU
Temrarr
minus enrarr
( 1113857
1113969(6)
Since V is an orthogonal matrix equation (6) isequivalent to
L
emrarr
minus enrarr
( 1113857TUSS
TU
Temrarr
minus enrarr
( 1113857
1113969
qmrarr
minus qnrarr
( 1113857T
qmrarr
minus qnrarr
( 1113857
1113969
qmrarr
minus qnrarr
2
(7)
It can be seen that pm
rarrminus pn
rarr2 qm
rarrminus qn
rarr2
It should be noted that there are no negative impacts andno changes in discrimination ability for the entire samplespace when replacing the weight As shown in Figure 7 weuse SVD of weight matrix W to map the feature map to anorthogonal linear space
325 Output Layer 0e citation recommendation problemis regarded as a classification task in our model In this layerlogistics and SVM can deal with binary classification tasksand predict the final citation relationship
33 Training Details
331 Embeddings In our model words are initialized by300-dimensional word2vec embeddings and will notchange during training A single randomly initializedembedding is created for all unknown words by uniformsampling from[minus001 001] We employ AdaGrad [31] andL2 regularization We introduce adversarial training [32]for embeddings to make the model more robust 0eprocess is achieved by replacing the word vector v afterword2vec embeddings using word vector with disturbingvlowast
vlowast
v times radv (8)
where radv is the worst case of perturbation on the wordvector Goodfellow et al [33] approximated this value bylinearizing the loss function logp(y|x 1113954θ) around x where1113954θ is a constant set to the current parameters of our modeland it only participates in the calculation process of radvwithout a backpropagation algorithm With the linearapproximation and L2 norm constraint the adversarialperturbation is
All-ap-feature
simi
Ci Di
Basic-feature
Figure 6 Generating the feature map
SVD-FC layer input feature
SVD-FC layer output featureSVD-FC layer
Figure 7 SVD-FC layer
Computational Intelligence and Neuroscience 5
radv minusising
g2 whereg nablaxlogp(y|x 1113954θ) (9)
0is perturbation can be easily computed by usingbackpropagation in neural networks
332 Layerwise Training In our training steps we defineconv-pooling block bt (tge 2) which consists of a convo-lution layer and a pooling layer Our network model is thenassembled by the initialization block b1 that initializes usingword2vec and (n minus 1) conv-pooling blocks
First we train the conv-pooling block b2 after b1 istrained On this basis the next conv-pooling block b3 iscreated by keeping the previous block fixed We repeat thisprocedure until all (n minus 1) conv-pooling blocks are trained
Second the following semiorthogonal training proce-dure is used to train the whole network
Semiorthogonal training (SOT) it is crucial to trainSVD-CNN which consists of the following three steps
Step 1 Decompose the weight matrix by SVD ieW USVT W is the weight matrix of the linear layerU is the left-unitary matrix S is the singular valuematrix V is the right-unitary matrix After that wereplace W with US Next we take all eigenvectors ofUS(US)T as weight vectorsStep 2 0e backbone model is fine-tuned by fixing theSVD-FC layerStep 3 0e model keeps fine-tuning with the unfixedSVD-FC layer
Step 1 can generate orthogonal weights but the per-formance of prediction cannot be guaranteed 0e reason isthat over orthogonality will excessively punish synonymoussentences which is apparently inappropriate 0erefore weintroduce Steps 2 and 3 to solve the above problem
0e inputs of SVD-FC are defined as Y
Y (y1 y2 ym)T 0e outputs are defined as O
O (o1 o2 om)T 0e weight matrix is defined asW (w1 w2 wm)T 0e expected outputs are defined asA (a1 a2 am)T 0e error function is defined as
E 12
1113944
l
k1ak minus ok( 1113857
2 (10)
where ok f(1113936mj0 wkjyj) k 1 2 l 0en E with re-
spect to ok is derived and the outcome is
zE
zok
minus ak minus ok( 1113857 (11)
We utilize the gradient descent strategy to find thegradient of the error with respect to weights 0e iterativeupdate of weights is as follows
Δwkj minusηzE
zwkj
(12)
We define an error signal δok zEz netk equation (12) is
equivalent to
Δwkj minusηzE
z netk
z netkzwkj
minusηδok
z netkzwkj
(13)
According to equation (11) δok zEz netk is equivalent
to
δok minus
zE
zok
zok
z netk minus
zE
zok
fprime netk( 1113857
zE
zok
okprime minus dk minus ok( 1113857ok
prime
(14)
We use the sigmoid f(x) 1(1 + ex) as the nonlinearfunction so equation (13) is equivalent to
Δwkj minusηδokyj η dk minus ok( 1113857ok 1 minus ok( 1113857yj (15)
In Step 1 the weight matrix W is decomposed by SVDand replaced with US U (q1 q2 qm)T andS diag(λ1 λ2 λm) Since dk minus ok is given we definethat Loss dk minus ok As a result equation (15) is equivalent to
Δwkj η Loss middot ok minus sigmoid yj 1113944 qiλi + B1113872 11138732
1113876 1113877yj (16)
qi middot qj 0 ine j are in the left-unitary matrix U so themodel operation is not affected by the nonorthogonal ei-genvectors qi 0is is the reason for excessively punishingsynonymous sentences in Step 1 However orthogonalityhas a positive effect on Δwkj in Step 2
0e purpose of SVD is to maintain the orthogonality ofeach weight vector in geometric space When weight vectorsare conditioned by orthogonal regularization the relevancybetween weight vectors decreases We use the followingmethods in Step 3 to measure relevance
H WTW
w1rarrT
w1rarr
middot middot middot w1rarrT
wkrarr
⋮ ⋱ ⋮
wkrarrT
w1rarr
middot middot middot wkrarrT
wkrarr
⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣
⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦
h11 middot middot middot h1k
⋮ ⋱ ⋮
hk1 middot middot middot hkk
⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣
⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦
(17)
where W is a weight matrix that contains k weight vectorswi (i 1 k)hij (i j 1 k) is the dot product of wi
and wj Let us define S(W) as the correlation measurementof all column vectors in W
S(W) 1113936
ki1 hii
1113936ki1 1113936
kj1 hij
11138681113868111386811138681113868
11138681113868111386811138681113868 (18)
When W is an orthogonal matrix the value of S(W) is 1When ine j S(W) obtains the minimum value (1k)0erefore we can see that the value of S(W) falls into
6 Computational Intelligence and Neuroscience
[(1k) 1] As a result when S(W) is close to 1k or 0 theweight matrix will have high relevance
34 Complexity Analysis Assume that the training samplesize is |C| the average number of words in each citationcontext is |c| Cl is the number of kernels in the l-th layer andwis the size of the sliding window For one convolution layerthe training complexity is O(Clminus1 middot Cl middot w middot (s minus w + 1)) 0etraining complexity of one w-ap layer is O(C2
l middot w middot s) 0etraining complexity of one all-ap layer isO(C2
l middot (s minus w + 1))which was improved by C F Van Loan [12] computing theeigenvalue for SVD matrix decomposition with K size takesO(K) on the way of JACOBI Assume that the size of theweight matrix in the SVD-FC layer isK and the channel ofthe input matrix is Cin 0e computational cost for the SVD-FC layer is O(2K2 middot Cin + K)
4 Experiment
41 Dataset We use the CiteSeer dataset [34] to evaluatethe performance of our model 0e dataset was publishedby Huang et al [4] In this dataset citation relationshipsare extracted by a pair of citation contexts and the ab-stracts of cited papers A citation context includes thesentence where the citation placeholder appears and thesentences before and after the citation placeholderWithin each paper in the corpus the 50 words before and50 words after each citation reference are treated as thecorresponding citation context (a discussion on thenumber of words can be found in [7]) Before wordembedding we also remove stop words from the contextsTo preserve the time-sensitive pastpresentfuture tensesof verbs and the singularplural styles of named entitiesno stemming is done but all words are transferred tolower-case 0e training set contains 3989547 pairs ofreference contexts and citations and the test set contains1021685 citation relations
Following common practice in information retrieval(IR) we employ the following four evaluation metrics toevaluate recommendation results recall mean reciprocalrank (MRR) mean average precision (MAP) and normal-ized discounted cumulative gain (nDCG)
42 EvaluationMetric For each query in the test set we usethe original set of references as the ground truth Rg Assumethat the set of recommended citations is Rr and the correctrecommendations are Rg capRr Recall is defined as
recall Rg capRr
11138681113868111386811138681113868
11138681113868111386811138681113868
Rg
(19)
In our experiments the number of recommended ci-tations ranges from 1 to 10 Recall evaluation does not revealthe order of recommended references To address thisproblem we select the following two additional metrics
For a query q let rankq be the rank of the first correctrecommendation within the list MRR [35] is defined as
MRR 1
|Q|1113944qisinQ
1rankq
(20)
where Q is the testing set MRR reveals the average rankingof the first correct recommendation
For each citation placeholder we search the papers thatmay be referenced at this citation placeholder Each retrievalmodel returns a ranked list of papers Since there may be oneor more references for one citation context we use meanaverage precision (MAP) as the evaluation metric
where R(di) is a binary function indicating whether doc-ument di is relevant or not For our problem the papers citedat the citation placeholder are considered relevantdocuments
We use normalized discounted cumulative gain (NDCG)to measure the ranked recommendation list 0e NDCGvalue of a ranking list at position i is calculated as
NDCG d1 dN( 1113857 1113944i
2rel di( ) minus 1lni+1 (22)
where rel (di) is the 4-scale relevance of document di in theranked list We use the average cocited probability [2] oflangdi dlowastrang to weigh the citation relevance score of di to dlowast(anoriginal citation of the query) We report the average NDCGscore over all testing documents
43 BaselineComparison We choose the following methodsfor comparison
Cite-PLSA-LDA (CP-LDA) [36] we use the originalimplementation provided by the author 0e number oftopics is set to 60
(i) Restricted Boltzmann Machine (RBM-CS) [37] Wetrain two layers of RBM-CS according to the sug-gestion of the author We set the hidden layer size to600
(ii) Word2vec Model (W2V) [29] We use the word2vecmodel to learn words and document representa-tions 0e cited document is treated as a ldquowordrdquo (adocument uses a unique marker when it is cited bydifferent papers) 0e dimensions of the word anddocument vectors are set to n 300
(iii) Neural Probabilistic Model (NPM) [4] We followthe original implementation 0e dimensions of theword and document representation vector are set ton 600 For negative sampling we set the numberof negative samples k 10 where k is the number ofnoise words in the citation context For noisecontrast estimation we set the number of noisesamples k 1000
(iv) Neural Citation Network (NCN) [7] In NCN thegradient clipping is 5 the dropout probability is 02and the recurrent layers are 2 0e region sizes for
Computational Intelligence and Neuroscience 7
the encoder are set to 4 4 and 5 and the region sizesfor the author network are set to 1 and 2
Figures 8 and 9 show the performance of eachmethod onthe CiteSeer dataset It is obvious that the SVD-FC modelleads the performance in most cases More detailed analysesare given as follows
First we perform a comparison among CP-LDA RBMW2V and SVD-CNN Our SVD-CNN completely andsignificantly exceeds other models in all metrics 0e successof ourmodel is ascribed to the content and correlation of ournetwork Due to the lack of citation context information wefind that W2V is obviously worse than other methods interms of all metrics CP-LDA works much better than W2Vwhich indicates that link information is very important forfinding relevant papers RBM-CS shows a clear performancegain over W2V because RBM-CS automatically discoverstopical aspects of each paper based on citation contextHowever the vector representations of citation context inRBM-CS are extracted by traditional word vector repre-sentations which fully neglect semantic relations betweenthe citation document and citation context and thus may belimited by vocabulary
Second we compare the performance among NPMNCN and SVD-CNN It is not surprising that NPM andNCN achieve worse performance than SVD-CNN since theirdistributed representation of words and documents reliessolely on deep learning without restraint NPM recommendscitations based on trained distributed representations NCNfurther enhances the performance by considering authorinformation and using a more sophisticated neural networkarchitecture However the CNN in NCN does not haveorthogonal constraints which makes it difficult to capturedifferent types of citing activities In addition NCN onlyutilizes the title of the cited paper for a decoder which isapparently not sufficient for learning good embedding
44 e Influence on the Link Prediction of Reference PatternInteractionalFeatures According to the chapter positions ofcitation context in the article we divide the training set intothree parts the introduction part contains 1307885 pairs ofreference contexts and citations the related word partcontains 1599897 pairs of citations and the main partcontains 1024783 pairs Furthermore these datasets formthree mixed datasets In this part of the experiment we usethe CNN model without SVD as the baseline 0ese datasetsare tested in a ratio of 3 1 In Tables 1 and 2 we show theresults on the abovementioned datasets
From the results we obtain the following observationsFirst both CNN and SVD-CNN outperform unmixed
datasets over mixed datasets across the different evaluationmetrics which shows that the diversity of reference patternsincreases the difficulty of citation recommendation tasks
Second in Tables 1 and 2 we observe that our model isparticularly good at resolving the difficulties in mixeddatasets which come from the diversity of referencepatterns
To better explore why mixed datasets are more complexthan unmixed datasets in Figure 10 we show the change in
S(W) during the training process of SVD-CNN amongvarious datasets
As shown in Figure 10 the increase in S(W) on themixed datasets indicates that SVD-CNN is good at decor-relation We can also see in Tables 1 and 2 that the CNNmodel has pretty performance on unmixed datasets whileachieving poor performance on mixed datasets HoweverSVD-CNN achieves almost the same performance on thetwo types of datasets 0is proves that the correlation fromvarious reference patterns can significantly affect the linkprediction
0e reason why the change in S(W) is not large on theunmixed datasets is that reference patterns of unmixeddatasets have similar features which belong to the samecategory As a result the orthogonality of the weight matrixis hard to improve on unmixed datasets However a citationrecommendation algorithm has pretty performance on theunmixed datasets because there are low complexities
Although mixed datasets are more complicated thanunmixed datasets SVD-CNN still performs well in mixeddatasets 0is indicates that SVD-CNN reduces the negativeimpact of the correlation of reference patterns and ourapproach is more suitable for complex scenarios
45 Comparison with Other Types of Decorrelation In ad-dition to SVD there are still some other methods fordecorrelating the feature matrix However these methodscannot maintain the discriminating ability of the CNNmodel To illustrate this we compare SVD with severalvarieties as follows
(1) Using the originally learned W
(2) Replacing W with US
(3) Replacing W with U
(4) Replacing W with UVT
(5) Replacing Wwith Q D where D is the diagonalmatrix extracted from the upper triangle matrix inQ-R decomposition
(6) Replacing W with WPCA where WPCA is the diagonalmatrix extracted from the weight matrix W after theprocessing of dimension reduction by PCA
After convergence of training different orthogonalmatrices are used to replace the weight matrix W We defineT-cost as the time cost of replacing the weight which isequivalent to the proportion of the added time to the originaltime As shown in Table 3 other types of decorrelationdegrade the performance in addition to W⟶ US andW⟶WPCA However the time cost of W⟶WPCA ismore than that of W⟶ US
46 Ablation Study In our method there are two essentialparameters a term sot which means the number of SOTiterations and a biased parameter d0 In this section weconduct an ablation study of these parameters
We first evaluate the effectiveness of sot by empiri-cally fixing d0 300 Since sot defines the loop time of
8 Computational Intelligence and Neuroscience
orthogonal constraint training it should be set as anonnegative value Figure 11 illustrates the MRR with sotfrom 0 to 10 on the CiteSeer dataset We can see that the
performance improves as the value of sot increasesWhen sot 0 the model has no decorrelation andachieves the worst performance In this situation the
Table 1 MRR metric on various datasets
Introduction Related Main Introduction + related Introduction +main Related +mainCNN 03312 03294 03478 02773 02815 02978SVD-CNN 03995 04078 03989 03878 03889 03845
060055050045040035030025Re
call
020015010005000
20 40 60Number of recommended citations
80 100
W2VsNPMs
RBMsCP_LDAs
SVD_CNNsNCNs
Figure 8 Comparison of recall with different methods on CiteSeer
MRR MAP and nDCG scores for top 10 recommendations04
035
03
025
02
015
01
005
0
0091600997
MRR MAP nCDG
00662
01843
02667
03687
00912009982
00663
01835
02418
03352
01288 0135601476
0256602592
03448
CP-LDARBM
W2VNPM
NCNSVD-CNN
Figure 9 Comparison of MRR MAP and nDCG with different methods on CiteSeer
Table 2 MAP metric on various datasets
Introduction Related Main Introduction + related Introduction +main Related +mainCNN 03001 02909 03107 02572 02601 02637SVD-CNN 03701 03655 03693 03498 03511 03539
Computational Intelligence and Neuroscience 9
weight matrix in the FC layer is highly correlated andS(W) has the lowest value 0e recommendation per-formance then increases while adding sot which indi-cates that reducing the correlative degree of the weightmatrix in the FC layer is critical for improving perfor-mance When sot 10 our model achieves the bestperformance
In our model d0 is the dimension of citation contextand cited document representations Figure 12 shows howthe performance of SVD-CNN varies with d0 on the samesot When d0 is small the information content of thecitation context is very small and produces worse per-formance 0e recommendation performance increases toa maximum point until d0 reaches 300 It should be noted
05
045
04
035
03
S (W
)
025
02
015
01
005
00 1 2 3 4 5
Sot6 7 8 9 10 11
IntroductionRelatedMain
Introduction + relatedIntroduction + mainRelated + main
Figure 10 0e change in S(W) during training on unmixed datasets and mixed datasets
Table 3 0e comparison of related methods in Step 1
Figure 11 0e performance impact of sot on CiteSeer
10 Computational Intelligence and Neuroscience
that although the larger d0 is better the larger d0 willsignificantly increase the training time 0erefore wechoose d0 300
5 Conclusion and Future Works
We propose a convolutional neural network model withorthogonal regularization to solve the context-aware citationrecommendation task In our model orthogonal regulari-zation is achieved by using SVD to factorize the weight of theFC layer which can essentially make each vector in thefeature map more independent 0e orthogonal regulari-zation also enhances the feature extraction ability of CNN0e experimental results show that SVD-CNN outperformsthe other compared methods on CiteSeer Our model onlytakes the abstract as the content of the cited paper In thefuture we will explore the performance of our model byusing the full text of papers
Data Availability
Previously reported CiteSeer data were used to support thisstudy and are available at [httpspsuappboxcomvrefseer] 0ese prior datasets are cited at relevant placeswithin the text as references [4]
Conflicts of Interest
0e authors declare that there are no conflicts of interestregarding the publication of this paper
Acknowledgments
0is work was partially supported by the National NaturalScience Foundation of China (project no 61373046) andthe National Key Research and Development Programs ofChina (project nos 2018AAA0101100 and2019YFB2102500)
References
[1] M A Angrosh S Cranefield and N Stanger ldquoConditionalrandom field based sentence context identification enhancingcitation services for the research communityrdquo in Proceedingsof the First Australasian Web Conference Adelaide AustraliaJanuary 2013
[2] Q He J Pei D Kifer et al ldquoContext-aware citation rec-ommendationrdquo in Proceedings of the International Conferenceon World Wide Web Raleigh NC USA April 2010
[3] Q He D Kifer J Pei et al ldquoCitation recommendationwithout author supervisionrdquo in Proceedings of the FourthACM international Conference on Web Search and DataMining Hong Kong China February 2011
[4] W Huang ldquoA neural probabilistic model for context basedcitation recommendationrdquo in Proceedings of the AAAIConference on Artificial Intelligence Austin TX USA January2015
[5] J Tan X Wan and J Xiao ldquoA neural network approach toquote recommendation in writingsrdquo in Proceedings of theACM International on Conference on Information andKnowledge Management Indianapolis IN USA October2016
[6] X Ren J Liu X Yu et al ldquoCluscite effective citation rec-ommendation by information network-based clusteringrdquo inProceedings of the 20th ACM SIGKDD International Con-ference on Knowledge Discovery and Data Mining New YorkNY USA August 2014
[7] T Ebesu and Y Fang ldquoNeural citation network for context-aware citation recommendationrdquo in Proceedings of the 40thInternational ACM SIGIR Conference on Research and De-velopment in Information Retrieval pp 1093ndash1096 ShinjukuJapan August 2017
[8] D M Blei A Y Ng and M I Jordan ldquoLatentdirichlet allocationrdquo Journal of Machine Learning Researchvol 3 pp 993ndash1022 2003
[9] S Bradshaw ldquoReference directed indexing redeeming rele-vance for subject search in citation indexesrdquo Research andAdvanced Technology for Digital Libraries vol 2769pp 499ndash510 2003
[10] N Meuschke B Gipp and M Lipinsk ldquoCITREC an eval-uation framework for citation-based similarity measures
based on TREC genomics and PubMed centralrdquo 2015 httphdlhandlenet214273680
[11] A Ritchie S Robertson and S Teufel ldquoComparing CitationContexts for information Retrievalrdquo in Proceedings of the 17thACM Conference on Information and Knowledge Manage-ment pp 213ndash222 Napa Valley CA USA October 2008
[12] C F Van Loan e Block Jacobi Method for Computing theSingular Value Decomposition Department of ComputerScience Cornell University Ithaca NY USA 1985
[13] C Bhagavatula S Feldman R Power et al ldquoContent-basedcitation recommendationrdquo 2018 httpsarxivorgpdf18020830201v1pdf
[14] H Jia and E Saule ldquoLocal is good a fast citation recom-mendation approachrdquo Lecture Notes in Computer ScienceVol 10772 Springer Berlin Germany 2018
[15] Y Sun W Ni and R Men ldquoA personalized paper recom-mendation approach based on web paper mining and re-viewerrsquos interest modellingrdquo in Proceedings of theInternational Conference on Research Challenges in ComputerScience Shanghai China December 2009
[16] B Shaparenko and T Joachims ldquoInformation genealogyUncovering the flow of ideas in non-hyperlinked documentdatabasesrdquo in Proceedings of the ACM SIGKDD internationalConference on Knowledge Discovery and Data Mining SanJose CA USA August 2007
[17] T Strohman W B Croft and D Jensen ldquoRecommendingcitations for academic papersrdquo in Proceedings of the Annualinternational ACM SIGIR Conference on Research and De-velopment in information Retrieval Amsterdam NetherlandsJuly 2007
[18] A Livne V Gokuladas J Teevan et al ldquoCiteSight supportingcontextual citation recommendation using differentialsearchrdquo in Proceedings of the International ACM SIGIRConference on Research amp Development in informationRetrieval Gold Coast Australia July 2014
[19] Y Lu J He D Shan et al ldquoRecommending citations withtranslation modelrdquo in Proceedings of the ACM internationalConference on Information and Knowledge ManagementGlasgow UK October 2011
[20] W Huang P Mitra S Kataria et al ldquoRecommending cita-tions translating papers into referencesrdquo in Proceedings of theACM international Conference on Information and KnowledgeManagement Shanghai China November 2014
[21] X Tang X Wan X Zhang et al ldquoCross-language context-aware citation recommendation in scientific articlesrdquo inProceedings of the International ACM SIGIR Conference onResearch amp Development in information Retrieval Gold CoastUK July 2014
[22] A Brock T Lim J M Ritchie et al ldquoNeural photo edi-tingwith introspective adversarial networksrdquo in InternationalConference on Learning Representations 2017
[23] I J Goodfellow J Pouget-Abadie MMirza et al ldquoGenerativeadversarial netsrdquo in Proceedings of the International Con-ference on Neural Information Processing Systems MontrealCanada December 2014
[24] A Brock J Donahue K Simonyan et al ldquoLarge scale GANtraining for high fidelity natural image synthesisrdquo 2018httpsarxivorgabs180911096
[25] Y Sun L Zheng W Deng et al ldquoSVDNet for pedestrianretrievalrdquo in Proceedings of the 2017 IEEE InternationalConference on Computer Vision (ICCV) pp 3820ndash3828Venice Italy October 2017
[26] Q Zheng M Yang J Yang Q Zhang and X ZhangldquoImprovement of generalization ability of deep CNN via
implicit regularization in two-stage training processrdquo IEEEAccess vol 6 no 1109 pp 15844ndash15869 2018
[27] Y Wang D Gong Z Zheng et al ldquoOrthogonal deep featuresdecomposition for age-invariant face recognitionrdquo in Pro-ceedings of the European Conference on Computer Vision(ECCV) Munich Germany September 2018
[28] Y Chen X Jin J Feng et al ldquoTraining group orthogonalneural networks with privileged informationrdquo in Proceedingsof the Twenty-Sixth International Joint Conference on ArtificialIntelligence Melbourne Australia August 2017
[29] T Mikolov I Sutskever K Chen et al ldquoDistributedrepresentations of words and phrases and their composi-tionalityrdquo in Proceedings of the 26th International Con-ference on Neural Information Processing Systems LakeTahoe NV USA December 2013
[30] A Rajaraman and J D Ullman ldquoData miningrdquo Mining ofMassive Datasets vol 3 no 2 pp 1ndash17 2014
[31] J Duchi E Hazan and Y Singer ldquoAdaptive subgradientmethods for online learning and stochastic optimizationrdquoJournal of Machine Learning Research vol 12 no 7pp 2121ndash2159 2011
[32] T Miyato A M Dai and I Goodfellow ldquoAdversarial trainingmethods for semi-supervised text classificationrdquo in Pro-ceedings of the International Conference on LearningRepresentations San Juan Puerto Rico May 2016
[33] I J Goodfellow J Shlens and C Szegedy ldquoExplaining andharnessing adversarial examplesrdquo in Proceedings of the In-ternational Conference on Learning Representations BanffCanada April 2014
[34] K Chandrasekaran S Gauch P Lakkaraju et al ldquoConcept-based document recommendations for CiteSeer authorsrdquo inProceedings of the International Conference on AdaptiveHypermedia and Adaptive Web-Based Systems HannoverGermany August 2008
[35] E Voorhees ldquo0e trec-8 question answering track reportrdquo inProceedings of the TRECrsquo00 pp 77ndash82 Gaithersburg MDUSA 2000
[36] S Kataria P Mitra and S Bhatia ldquoUtilizing context ingenerative bayesian models for linked corpusrdquo in Proceedingsof the Twenty-Fourth AAAI Conference on ArtificialIntelligence Atlanta GA USA July 2010
[37] J Tang and J Zhang ldquoA discriminative approach to topic-based citation recommendationrdquo in Proceedings of the Pacific-Asia Conference Hyderabad India July 2009
12 Computational Intelligence and Neuroscience
correlated the individuals in each full-join descriptionwill also be highly correlated which will highly reduceretrieval performance Sun et al [25] proposed SVD-Netto show that guaranteeing the feature weight of the FClayer can increase the orthogonal constraint of the net-work and improve the accuracy Zheng et al [26] re-ported that regularization was an efficient method forimproving the generalization ability of deep CNN be-cause it makes it possible to train more complex modelswhile maintaining lower overfitting Zheng et al [26]proposed a method for optimizing the feature boundaryof a deep CNN through a two-stage training step to re-duce the overfitting problem However the mixed fea-tures learned from CNN potentially reduce therobustness of network models for identification orclassification To address this problem Wang et al [27]decomposed deep face features into two orthogonalcomponents to represent age-related and identity-relatedfeatures to learn the age-invariant deep face features Inthe above model age-invariant deep features can be ef-fectively obtained to improve AIFR performance Chenet al [28] proposed a group orthogonal convolutionalneural network (GoCNN) model based on the idea oflearning different groups of convolutional functions thatare ldquoorthogonalrdquo to those in other groups ie with nosignificant correlation among the produced featuresOptimizing orthogonality among convolutional func-tions reduces the redundancy and increases the diversitywithin the architecture Moreover it can also obtain asingle CNN model with sufficient inherent diversity suchthat the model learns more diverse representations andhas stronger generalization ability than vanilla CNNs
3 Proposed Method
31 Problem Formulation 0e context-aware citation rec-ommendation is defined as the matching task between citationcontext and candidate papers 0e main architecture of ourmodel is shown in Figure 2 Our model is actually a con-volutional neural network with two inputs and orthogonalconstraints Our model consists of the following main steps
(1) We adopt word2vec to obtain the raw input vectorsand then use CNNs to extract multiple granularitysemantic features
(2) 0e multiple granularity semantic feature is thenimposed orthogonally by an SVD-FC layer
(3) We use fully connected layers to obtain the finalvector representation 0e logistic function or SVMis used to obtain the recommendation result
32 Network Structure
321 Input Layer Word2vec [29] is used to embed the inputof our model Each word is represented as a d0 dimensionalprecomputed vector where d0 300 As a result each sentenceis represented as a feature matrix with dimension d0 times s0rough this layer we can obtain the raw representation ofcitation context c and candidate document d
We also calculate the weight of common wordsaccording to the inputs 0en we can obtain the basic inputfeatures TF minus IDF(c d) for our model which is the productof TF(wc d) and IDF to reflect how important a word incitation context c is for a candidate document d in the corpus[30] wc is a word in citation context c 0ese two variablesare calculated as follows
TF wc d( 1113857 count wc d( 1113857
top wlowast d( 1113857
IDF logN
docs wc D( 1113857
(1)
where count(wc d) is the number of words wc that appear indocument d top(wlowast d) is the occurrence number of theword wlowast that appears most frequently in this candidatedocument d docs(wc D) is the number of documentscontaining the word wc in all candidate citations D N is thetotal number of candidate citations
322 Convolution Layer 0e inputs of the convolutionlayer are the feature matrix of citation context c and doc-ument d 0e process of this layer is demonstrated in Fig-ure 3 We first pad the two inputs to have the same lengths max(c d) by zero vectors For every input letv1 v2 vs be the words in a sentenceWe define gi isin Rwd0 0lt ilt s + w minus 1 as the concatenation of viminusw vi 0enthis layer generates the feature Pi isin Rd1 for the phrasesviminusw vi as follows
Pi tanh W middot gi + b( 1113857 (2)
whereW isin Rd1timeswd0 is a convolution kernel and b isin Rd1 is thebias
323 Average Pooling Layer 0e pooling layer is usuallyused for feature compression In our model we chooseaverage pooling 0e reason is that whole sentences orparagraphs can express more meaningful semantics Asshown in Figure 4 we design two pooling layers 0e firstone is ldquow-aprdquo which is the column average for thewindow of w continuous columns After the convolutionlayer an s column feature map is converted into a news + w minus 1 column feature map By using ldquow-aprdquo the newfeature map is recovered into the s column 0is archi-tecture facilitates the extraction of more useful abstractfeatures
0e second one is ldquoall-aprdquo which normalizes all col-umns As shown in Figure 5 ldquoall-aprdquo generates a repre-sentation vector for each feature map 0e generated featurecombines the information of the whole citation context orcited document
Now we can obtain the features of citation context andindependent features of the cited document 0e next step isto obtain the semantic relationships between the citationcontext and the candidate paper We use cosine similarity tomeasure the semantic relations
Computational Intelligence and Neuroscience 3
simj 1113936
dj
i0 Cji times Dji1113872 1113873
1113936dj
i0 Cji1113872 11138732
times 1113936dj
i0 Dji1113872 11138732
1113969 (j isin [1 10]) (3)
Citation context Document
SVD-FC
Word2vet Word2vet
Convolution
W-ap W-ap
FC
FC
LogisticsSVM
W-ap W-ap
Convolution
Convolution Convolution10 th
1st
USSVDw
Splice
All-ap All-ap
All-ap All-ap
Based-feature
All-ap-feature
Figure 2 An overview of our model
s + w minus 1ws
Figure 3 Convolution extraction generates phrases
ss + w minus 1
Figure 4 ldquoW-aprdquo structure
Figure 5 ldquoAll-aprdquo structure
4 Computational Intelligence and Neuroscience
where Cj and Dj are the distributed representation of ci-tation context and candidate document after the j-th ldquoall-aprdquo layer respectively A total of ten ldquoall-aprdquo layers arecarried out in our model 0erefore j belongs to [1 10] 0ebenefit is that we can obtain the semantic relation betweenthe citation context and the cited document with multiplegranularities As shown in Figure 6 the final output featureconsists of all simj and basic features 0en it is fed into theSVD-FC layer
In most cases we find that if we use all outputs of poollayers as the input of the SVD-FC layer the performance willbe improved0e reason is that features from different layersrepresent the different levels of semantics Neglecting anylayers will obviously cause information loss problems
Next we use the SVD-FC layer to learn the nonlinearcombination features of citation relationships0is layer canforce vectors in the feature map independent and orthogonalto each other 0e added SVD-FC layer can also reduce thenegative impact of excessive parameters
324 SVD-FC Layer In this layer we use SVD to factorizethe weight matrix W (W USVT) and replace it with USOur experimental results show that replacing operations canreduce the negative impact on the sample space
0e Euclidean distance between samples can be used tomeasure whether their feature expression changes in asample space Denoting em and en as the feature maps of twodifferent samples we can obtain two different outputs of thefull connection operation by using the weight matrix W orUS as follows
p e times W (4)
q e times US (5)
As seen in the above equations q is orthogonalizedoutput while p is unorthogonalized0en we can obtain thefollowing theorem
Theorem 1 p and q in equations (4) and (5) will generatethe same Euclidean distance for samples em and en
Proof 0e Euclidean distance L between pm and pn iscalculated as follows
L pm
rarrminus pn
rarr2
emrarr
minus enrarr
( 1113857TWW
Temrarr
minus enrarr
( 1113857
1113969
emrarr
minus enrarr
( 1113857TUSVV
TS
TU
Temrarr
minus enrarr
( 1113857
1113969(6)
Since V is an orthogonal matrix equation (6) isequivalent to
L
emrarr
minus enrarr
( 1113857TUSS
TU
Temrarr
minus enrarr
( 1113857
1113969
qmrarr
minus qnrarr
( 1113857T
qmrarr
minus qnrarr
( 1113857
1113969
qmrarr
minus qnrarr
2
(7)
It can be seen that pm
rarrminus pn
rarr2 qm
rarrminus qn
rarr2
It should be noted that there are no negative impacts andno changes in discrimination ability for the entire samplespace when replacing the weight As shown in Figure 7 weuse SVD of weight matrix W to map the feature map to anorthogonal linear space
325 Output Layer 0e citation recommendation problemis regarded as a classification task in our model In this layerlogistics and SVM can deal with binary classification tasksand predict the final citation relationship
33 Training Details
331 Embeddings In our model words are initialized by300-dimensional word2vec embeddings and will notchange during training A single randomly initializedembedding is created for all unknown words by uniformsampling from[minus001 001] We employ AdaGrad [31] andL2 regularization We introduce adversarial training [32]for embeddings to make the model more robust 0eprocess is achieved by replacing the word vector v afterword2vec embeddings using word vector with disturbingvlowast
vlowast
v times radv (8)
where radv is the worst case of perturbation on the wordvector Goodfellow et al [33] approximated this value bylinearizing the loss function logp(y|x 1113954θ) around x where1113954θ is a constant set to the current parameters of our modeland it only participates in the calculation process of radvwithout a backpropagation algorithm With the linearapproximation and L2 norm constraint the adversarialperturbation is
All-ap-feature
simi
Ci Di
Basic-feature
Figure 6 Generating the feature map
SVD-FC layer input feature
SVD-FC layer output featureSVD-FC layer
Figure 7 SVD-FC layer
Computational Intelligence and Neuroscience 5
radv minusising
g2 whereg nablaxlogp(y|x 1113954θ) (9)
0is perturbation can be easily computed by usingbackpropagation in neural networks
332 Layerwise Training In our training steps we defineconv-pooling block bt (tge 2) which consists of a convo-lution layer and a pooling layer Our network model is thenassembled by the initialization block b1 that initializes usingword2vec and (n minus 1) conv-pooling blocks
First we train the conv-pooling block b2 after b1 istrained On this basis the next conv-pooling block b3 iscreated by keeping the previous block fixed We repeat thisprocedure until all (n minus 1) conv-pooling blocks are trained
Second the following semiorthogonal training proce-dure is used to train the whole network
Semiorthogonal training (SOT) it is crucial to trainSVD-CNN which consists of the following three steps
Step 1 Decompose the weight matrix by SVD ieW USVT W is the weight matrix of the linear layerU is the left-unitary matrix S is the singular valuematrix V is the right-unitary matrix After that wereplace W with US Next we take all eigenvectors ofUS(US)T as weight vectorsStep 2 0e backbone model is fine-tuned by fixing theSVD-FC layerStep 3 0e model keeps fine-tuning with the unfixedSVD-FC layer
Step 1 can generate orthogonal weights but the per-formance of prediction cannot be guaranteed 0e reason isthat over orthogonality will excessively punish synonymoussentences which is apparently inappropriate 0erefore weintroduce Steps 2 and 3 to solve the above problem
0e inputs of SVD-FC are defined as Y
Y (y1 y2 ym)T 0e outputs are defined as O
O (o1 o2 om)T 0e weight matrix is defined asW (w1 w2 wm)T 0e expected outputs are defined asA (a1 a2 am)T 0e error function is defined as
E 12
1113944
l
k1ak minus ok( 1113857
2 (10)
where ok f(1113936mj0 wkjyj) k 1 2 l 0en E with re-
spect to ok is derived and the outcome is
zE
zok
minus ak minus ok( 1113857 (11)
We utilize the gradient descent strategy to find thegradient of the error with respect to weights 0e iterativeupdate of weights is as follows
Δwkj minusηzE
zwkj
(12)
We define an error signal δok zEz netk equation (12) is
equivalent to
Δwkj minusηzE
z netk
z netkzwkj
minusηδok
z netkzwkj
(13)
According to equation (11) δok zEz netk is equivalent
to
δok minus
zE
zok
zok
z netk minus
zE
zok
fprime netk( 1113857
zE
zok
okprime minus dk minus ok( 1113857ok
prime
(14)
We use the sigmoid f(x) 1(1 + ex) as the nonlinearfunction so equation (13) is equivalent to
Δwkj minusηδokyj η dk minus ok( 1113857ok 1 minus ok( 1113857yj (15)
In Step 1 the weight matrix W is decomposed by SVDand replaced with US U (q1 q2 qm)T andS diag(λ1 λ2 λm) Since dk minus ok is given we definethat Loss dk minus ok As a result equation (15) is equivalent to
Δwkj η Loss middot ok minus sigmoid yj 1113944 qiλi + B1113872 11138732
1113876 1113877yj (16)
qi middot qj 0 ine j are in the left-unitary matrix U so themodel operation is not affected by the nonorthogonal ei-genvectors qi 0is is the reason for excessively punishingsynonymous sentences in Step 1 However orthogonalityhas a positive effect on Δwkj in Step 2
0e purpose of SVD is to maintain the orthogonality ofeach weight vector in geometric space When weight vectorsare conditioned by orthogonal regularization the relevancybetween weight vectors decreases We use the followingmethods in Step 3 to measure relevance
H WTW
w1rarrT
w1rarr
middot middot middot w1rarrT
wkrarr
⋮ ⋱ ⋮
wkrarrT
w1rarr
middot middot middot wkrarrT
wkrarr
⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣
⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦
h11 middot middot middot h1k
⋮ ⋱ ⋮
hk1 middot middot middot hkk
⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣
⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦
(17)
where W is a weight matrix that contains k weight vectorswi (i 1 k)hij (i j 1 k) is the dot product of wi
and wj Let us define S(W) as the correlation measurementof all column vectors in W
S(W) 1113936
ki1 hii
1113936ki1 1113936
kj1 hij
11138681113868111386811138681113868
11138681113868111386811138681113868 (18)
When W is an orthogonal matrix the value of S(W) is 1When ine j S(W) obtains the minimum value (1k)0erefore we can see that the value of S(W) falls into
6 Computational Intelligence and Neuroscience
[(1k) 1] As a result when S(W) is close to 1k or 0 theweight matrix will have high relevance
34 Complexity Analysis Assume that the training samplesize is |C| the average number of words in each citationcontext is |c| Cl is the number of kernels in the l-th layer andwis the size of the sliding window For one convolution layerthe training complexity is O(Clminus1 middot Cl middot w middot (s minus w + 1)) 0etraining complexity of one w-ap layer is O(C2
l middot w middot s) 0etraining complexity of one all-ap layer isO(C2
l middot (s minus w + 1))which was improved by C F Van Loan [12] computing theeigenvalue for SVD matrix decomposition with K size takesO(K) on the way of JACOBI Assume that the size of theweight matrix in the SVD-FC layer isK and the channel ofthe input matrix is Cin 0e computational cost for the SVD-FC layer is O(2K2 middot Cin + K)
4 Experiment
41 Dataset We use the CiteSeer dataset [34] to evaluatethe performance of our model 0e dataset was publishedby Huang et al [4] In this dataset citation relationshipsare extracted by a pair of citation contexts and the ab-stracts of cited papers A citation context includes thesentence where the citation placeholder appears and thesentences before and after the citation placeholderWithin each paper in the corpus the 50 words before and50 words after each citation reference are treated as thecorresponding citation context (a discussion on thenumber of words can be found in [7]) Before wordembedding we also remove stop words from the contextsTo preserve the time-sensitive pastpresentfuture tensesof verbs and the singularplural styles of named entitiesno stemming is done but all words are transferred tolower-case 0e training set contains 3989547 pairs ofreference contexts and citations and the test set contains1021685 citation relations
Following common practice in information retrieval(IR) we employ the following four evaluation metrics toevaluate recommendation results recall mean reciprocalrank (MRR) mean average precision (MAP) and normal-ized discounted cumulative gain (nDCG)
42 EvaluationMetric For each query in the test set we usethe original set of references as the ground truth Rg Assumethat the set of recommended citations is Rr and the correctrecommendations are Rg capRr Recall is defined as
recall Rg capRr
11138681113868111386811138681113868
11138681113868111386811138681113868
Rg
(19)
In our experiments the number of recommended ci-tations ranges from 1 to 10 Recall evaluation does not revealthe order of recommended references To address thisproblem we select the following two additional metrics
For a query q let rankq be the rank of the first correctrecommendation within the list MRR [35] is defined as
MRR 1
|Q|1113944qisinQ
1rankq
(20)
where Q is the testing set MRR reveals the average rankingof the first correct recommendation
For each citation placeholder we search the papers thatmay be referenced at this citation placeholder Each retrievalmodel returns a ranked list of papers Since there may be oneor more references for one citation context we use meanaverage precision (MAP) as the evaluation metric
where R(di) is a binary function indicating whether doc-ument di is relevant or not For our problem the papers citedat the citation placeholder are considered relevantdocuments
We use normalized discounted cumulative gain (NDCG)to measure the ranked recommendation list 0e NDCGvalue of a ranking list at position i is calculated as
NDCG d1 dN( 1113857 1113944i
2rel di( ) minus 1lni+1 (22)
where rel (di) is the 4-scale relevance of document di in theranked list We use the average cocited probability [2] oflangdi dlowastrang to weigh the citation relevance score of di to dlowast(anoriginal citation of the query) We report the average NDCGscore over all testing documents
43 BaselineComparison We choose the following methodsfor comparison
Cite-PLSA-LDA (CP-LDA) [36] we use the originalimplementation provided by the author 0e number oftopics is set to 60
(i) Restricted Boltzmann Machine (RBM-CS) [37] Wetrain two layers of RBM-CS according to the sug-gestion of the author We set the hidden layer size to600
(ii) Word2vec Model (W2V) [29] We use the word2vecmodel to learn words and document representa-tions 0e cited document is treated as a ldquowordrdquo (adocument uses a unique marker when it is cited bydifferent papers) 0e dimensions of the word anddocument vectors are set to n 300
(iii) Neural Probabilistic Model (NPM) [4] We followthe original implementation 0e dimensions of theword and document representation vector are set ton 600 For negative sampling we set the numberof negative samples k 10 where k is the number ofnoise words in the citation context For noisecontrast estimation we set the number of noisesamples k 1000
(iv) Neural Citation Network (NCN) [7] In NCN thegradient clipping is 5 the dropout probability is 02and the recurrent layers are 2 0e region sizes for
Computational Intelligence and Neuroscience 7
the encoder are set to 4 4 and 5 and the region sizesfor the author network are set to 1 and 2
Figures 8 and 9 show the performance of eachmethod onthe CiteSeer dataset It is obvious that the SVD-FC modelleads the performance in most cases More detailed analysesare given as follows
First we perform a comparison among CP-LDA RBMW2V and SVD-CNN Our SVD-CNN completely andsignificantly exceeds other models in all metrics 0e successof ourmodel is ascribed to the content and correlation of ournetwork Due to the lack of citation context information wefind that W2V is obviously worse than other methods interms of all metrics CP-LDA works much better than W2Vwhich indicates that link information is very important forfinding relevant papers RBM-CS shows a clear performancegain over W2V because RBM-CS automatically discoverstopical aspects of each paper based on citation contextHowever the vector representations of citation context inRBM-CS are extracted by traditional word vector repre-sentations which fully neglect semantic relations betweenthe citation document and citation context and thus may belimited by vocabulary
Second we compare the performance among NPMNCN and SVD-CNN It is not surprising that NPM andNCN achieve worse performance than SVD-CNN since theirdistributed representation of words and documents reliessolely on deep learning without restraint NPM recommendscitations based on trained distributed representations NCNfurther enhances the performance by considering authorinformation and using a more sophisticated neural networkarchitecture However the CNN in NCN does not haveorthogonal constraints which makes it difficult to capturedifferent types of citing activities In addition NCN onlyutilizes the title of the cited paper for a decoder which isapparently not sufficient for learning good embedding
44 e Influence on the Link Prediction of Reference PatternInteractionalFeatures According to the chapter positions ofcitation context in the article we divide the training set intothree parts the introduction part contains 1307885 pairs ofreference contexts and citations the related word partcontains 1599897 pairs of citations and the main partcontains 1024783 pairs Furthermore these datasets formthree mixed datasets In this part of the experiment we usethe CNN model without SVD as the baseline 0ese datasetsare tested in a ratio of 3 1 In Tables 1 and 2 we show theresults on the abovementioned datasets
From the results we obtain the following observationsFirst both CNN and SVD-CNN outperform unmixed
datasets over mixed datasets across the different evaluationmetrics which shows that the diversity of reference patternsincreases the difficulty of citation recommendation tasks
Second in Tables 1 and 2 we observe that our model isparticularly good at resolving the difficulties in mixeddatasets which come from the diversity of referencepatterns
To better explore why mixed datasets are more complexthan unmixed datasets in Figure 10 we show the change in
S(W) during the training process of SVD-CNN amongvarious datasets
As shown in Figure 10 the increase in S(W) on themixed datasets indicates that SVD-CNN is good at decor-relation We can also see in Tables 1 and 2 that the CNNmodel has pretty performance on unmixed datasets whileachieving poor performance on mixed datasets HoweverSVD-CNN achieves almost the same performance on thetwo types of datasets 0is proves that the correlation fromvarious reference patterns can significantly affect the linkprediction
0e reason why the change in S(W) is not large on theunmixed datasets is that reference patterns of unmixeddatasets have similar features which belong to the samecategory As a result the orthogonality of the weight matrixis hard to improve on unmixed datasets However a citationrecommendation algorithm has pretty performance on theunmixed datasets because there are low complexities
Although mixed datasets are more complicated thanunmixed datasets SVD-CNN still performs well in mixeddatasets 0is indicates that SVD-CNN reduces the negativeimpact of the correlation of reference patterns and ourapproach is more suitable for complex scenarios
45 Comparison with Other Types of Decorrelation In ad-dition to SVD there are still some other methods fordecorrelating the feature matrix However these methodscannot maintain the discriminating ability of the CNNmodel To illustrate this we compare SVD with severalvarieties as follows
(1) Using the originally learned W
(2) Replacing W with US
(3) Replacing W with U
(4) Replacing W with UVT
(5) Replacing Wwith Q D where D is the diagonalmatrix extracted from the upper triangle matrix inQ-R decomposition
(6) Replacing W with WPCA where WPCA is the diagonalmatrix extracted from the weight matrix W after theprocessing of dimension reduction by PCA
After convergence of training different orthogonalmatrices are used to replace the weight matrix W We defineT-cost as the time cost of replacing the weight which isequivalent to the proportion of the added time to the originaltime As shown in Table 3 other types of decorrelationdegrade the performance in addition to W⟶ US andW⟶WPCA However the time cost of W⟶WPCA ismore than that of W⟶ US
46 Ablation Study In our method there are two essentialparameters a term sot which means the number of SOTiterations and a biased parameter d0 In this section weconduct an ablation study of these parameters
We first evaluate the effectiveness of sot by empiri-cally fixing d0 300 Since sot defines the loop time of
8 Computational Intelligence and Neuroscience
orthogonal constraint training it should be set as anonnegative value Figure 11 illustrates the MRR with sotfrom 0 to 10 on the CiteSeer dataset We can see that the
performance improves as the value of sot increasesWhen sot 0 the model has no decorrelation andachieves the worst performance In this situation the
Table 1 MRR metric on various datasets
Introduction Related Main Introduction + related Introduction +main Related +mainCNN 03312 03294 03478 02773 02815 02978SVD-CNN 03995 04078 03989 03878 03889 03845
060055050045040035030025Re
call
020015010005000
20 40 60Number of recommended citations
80 100
W2VsNPMs
RBMsCP_LDAs
SVD_CNNsNCNs
Figure 8 Comparison of recall with different methods on CiteSeer
MRR MAP and nDCG scores for top 10 recommendations04
035
03
025
02
015
01
005
0
0091600997
MRR MAP nCDG
00662
01843
02667
03687
00912009982
00663
01835
02418
03352
01288 0135601476
0256602592
03448
CP-LDARBM
W2VNPM
NCNSVD-CNN
Figure 9 Comparison of MRR MAP and nDCG with different methods on CiteSeer
Table 2 MAP metric on various datasets
Introduction Related Main Introduction + related Introduction +main Related +mainCNN 03001 02909 03107 02572 02601 02637SVD-CNN 03701 03655 03693 03498 03511 03539
Computational Intelligence and Neuroscience 9
weight matrix in the FC layer is highly correlated andS(W) has the lowest value 0e recommendation per-formance then increases while adding sot which indi-cates that reducing the correlative degree of the weightmatrix in the FC layer is critical for improving perfor-mance When sot 10 our model achieves the bestperformance
In our model d0 is the dimension of citation contextand cited document representations Figure 12 shows howthe performance of SVD-CNN varies with d0 on the samesot When d0 is small the information content of thecitation context is very small and produces worse per-formance 0e recommendation performance increases toa maximum point until d0 reaches 300 It should be noted
05
045
04
035
03
S (W
)
025
02
015
01
005
00 1 2 3 4 5
Sot6 7 8 9 10 11
IntroductionRelatedMain
Introduction + relatedIntroduction + mainRelated + main
Figure 10 0e change in S(W) during training on unmixed datasets and mixed datasets
Table 3 0e comparison of related methods in Step 1
Figure 11 0e performance impact of sot on CiteSeer
10 Computational Intelligence and Neuroscience
that although the larger d0 is better the larger d0 willsignificantly increase the training time 0erefore wechoose d0 300
5 Conclusion and Future Works
We propose a convolutional neural network model withorthogonal regularization to solve the context-aware citationrecommendation task In our model orthogonal regulari-zation is achieved by using SVD to factorize the weight of theFC layer which can essentially make each vector in thefeature map more independent 0e orthogonal regulari-zation also enhances the feature extraction ability of CNN0e experimental results show that SVD-CNN outperformsthe other compared methods on CiteSeer Our model onlytakes the abstract as the content of the cited paper In thefuture we will explore the performance of our model byusing the full text of papers
Data Availability
Previously reported CiteSeer data were used to support thisstudy and are available at [httpspsuappboxcomvrefseer] 0ese prior datasets are cited at relevant placeswithin the text as references [4]
Conflicts of Interest
0e authors declare that there are no conflicts of interestregarding the publication of this paper
Acknowledgments
0is work was partially supported by the National NaturalScience Foundation of China (project no 61373046) andthe National Key Research and Development Programs ofChina (project nos 2018AAA0101100 and2019YFB2102500)
References
[1] M A Angrosh S Cranefield and N Stanger ldquoConditionalrandom field based sentence context identification enhancingcitation services for the research communityrdquo in Proceedingsof the First Australasian Web Conference Adelaide AustraliaJanuary 2013
[2] Q He J Pei D Kifer et al ldquoContext-aware citation rec-ommendationrdquo in Proceedings of the International Conferenceon World Wide Web Raleigh NC USA April 2010
[3] Q He D Kifer J Pei et al ldquoCitation recommendationwithout author supervisionrdquo in Proceedings of the FourthACM international Conference on Web Search and DataMining Hong Kong China February 2011
[4] W Huang ldquoA neural probabilistic model for context basedcitation recommendationrdquo in Proceedings of the AAAIConference on Artificial Intelligence Austin TX USA January2015
[5] J Tan X Wan and J Xiao ldquoA neural network approach toquote recommendation in writingsrdquo in Proceedings of theACM International on Conference on Information andKnowledge Management Indianapolis IN USA October2016
[6] X Ren J Liu X Yu et al ldquoCluscite effective citation rec-ommendation by information network-based clusteringrdquo inProceedings of the 20th ACM SIGKDD International Con-ference on Knowledge Discovery and Data Mining New YorkNY USA August 2014
[7] T Ebesu and Y Fang ldquoNeural citation network for context-aware citation recommendationrdquo in Proceedings of the 40thInternational ACM SIGIR Conference on Research and De-velopment in Information Retrieval pp 1093ndash1096 ShinjukuJapan August 2017
[8] D M Blei A Y Ng and M I Jordan ldquoLatentdirichlet allocationrdquo Journal of Machine Learning Researchvol 3 pp 993ndash1022 2003
[9] S Bradshaw ldquoReference directed indexing redeeming rele-vance for subject search in citation indexesrdquo Research andAdvanced Technology for Digital Libraries vol 2769pp 499ndash510 2003
[10] N Meuschke B Gipp and M Lipinsk ldquoCITREC an eval-uation framework for citation-based similarity measures
based on TREC genomics and PubMed centralrdquo 2015 httphdlhandlenet214273680
[11] A Ritchie S Robertson and S Teufel ldquoComparing CitationContexts for information Retrievalrdquo in Proceedings of the 17thACM Conference on Information and Knowledge Manage-ment pp 213ndash222 Napa Valley CA USA October 2008
[12] C F Van Loan e Block Jacobi Method for Computing theSingular Value Decomposition Department of ComputerScience Cornell University Ithaca NY USA 1985
[13] C Bhagavatula S Feldman R Power et al ldquoContent-basedcitation recommendationrdquo 2018 httpsarxivorgpdf18020830201v1pdf
[14] H Jia and E Saule ldquoLocal is good a fast citation recom-mendation approachrdquo Lecture Notes in Computer ScienceVol 10772 Springer Berlin Germany 2018
[15] Y Sun W Ni and R Men ldquoA personalized paper recom-mendation approach based on web paper mining and re-viewerrsquos interest modellingrdquo in Proceedings of theInternational Conference on Research Challenges in ComputerScience Shanghai China December 2009
[16] B Shaparenko and T Joachims ldquoInformation genealogyUncovering the flow of ideas in non-hyperlinked documentdatabasesrdquo in Proceedings of the ACM SIGKDD internationalConference on Knowledge Discovery and Data Mining SanJose CA USA August 2007
[17] T Strohman W B Croft and D Jensen ldquoRecommendingcitations for academic papersrdquo in Proceedings of the Annualinternational ACM SIGIR Conference on Research and De-velopment in information Retrieval Amsterdam NetherlandsJuly 2007
[18] A Livne V Gokuladas J Teevan et al ldquoCiteSight supportingcontextual citation recommendation using differentialsearchrdquo in Proceedings of the International ACM SIGIRConference on Research amp Development in informationRetrieval Gold Coast Australia July 2014
[19] Y Lu J He D Shan et al ldquoRecommending citations withtranslation modelrdquo in Proceedings of the ACM internationalConference on Information and Knowledge ManagementGlasgow UK October 2011
[20] W Huang P Mitra S Kataria et al ldquoRecommending cita-tions translating papers into referencesrdquo in Proceedings of theACM international Conference on Information and KnowledgeManagement Shanghai China November 2014
[21] X Tang X Wan X Zhang et al ldquoCross-language context-aware citation recommendation in scientific articlesrdquo inProceedings of the International ACM SIGIR Conference onResearch amp Development in information Retrieval Gold CoastUK July 2014
[22] A Brock T Lim J M Ritchie et al ldquoNeural photo edi-tingwith introspective adversarial networksrdquo in InternationalConference on Learning Representations 2017
[23] I J Goodfellow J Pouget-Abadie MMirza et al ldquoGenerativeadversarial netsrdquo in Proceedings of the International Con-ference on Neural Information Processing Systems MontrealCanada December 2014
[24] A Brock J Donahue K Simonyan et al ldquoLarge scale GANtraining for high fidelity natural image synthesisrdquo 2018httpsarxivorgabs180911096
[25] Y Sun L Zheng W Deng et al ldquoSVDNet for pedestrianretrievalrdquo in Proceedings of the 2017 IEEE InternationalConference on Computer Vision (ICCV) pp 3820ndash3828Venice Italy October 2017
[26] Q Zheng M Yang J Yang Q Zhang and X ZhangldquoImprovement of generalization ability of deep CNN via
implicit regularization in two-stage training processrdquo IEEEAccess vol 6 no 1109 pp 15844ndash15869 2018
[27] Y Wang D Gong Z Zheng et al ldquoOrthogonal deep featuresdecomposition for age-invariant face recognitionrdquo in Pro-ceedings of the European Conference on Computer Vision(ECCV) Munich Germany September 2018
[28] Y Chen X Jin J Feng et al ldquoTraining group orthogonalneural networks with privileged informationrdquo in Proceedingsof the Twenty-Sixth International Joint Conference on ArtificialIntelligence Melbourne Australia August 2017
[29] T Mikolov I Sutskever K Chen et al ldquoDistributedrepresentations of words and phrases and their composi-tionalityrdquo in Proceedings of the 26th International Con-ference on Neural Information Processing Systems LakeTahoe NV USA December 2013
[30] A Rajaraman and J D Ullman ldquoData miningrdquo Mining ofMassive Datasets vol 3 no 2 pp 1ndash17 2014
[31] J Duchi E Hazan and Y Singer ldquoAdaptive subgradientmethods for online learning and stochastic optimizationrdquoJournal of Machine Learning Research vol 12 no 7pp 2121ndash2159 2011
[32] T Miyato A M Dai and I Goodfellow ldquoAdversarial trainingmethods for semi-supervised text classificationrdquo in Pro-ceedings of the International Conference on LearningRepresentations San Juan Puerto Rico May 2016
[33] I J Goodfellow J Shlens and C Szegedy ldquoExplaining andharnessing adversarial examplesrdquo in Proceedings of the In-ternational Conference on Learning Representations BanffCanada April 2014
[34] K Chandrasekaran S Gauch P Lakkaraju et al ldquoConcept-based document recommendations for CiteSeer authorsrdquo inProceedings of the International Conference on AdaptiveHypermedia and Adaptive Web-Based Systems HannoverGermany August 2008
[35] E Voorhees ldquo0e trec-8 question answering track reportrdquo inProceedings of the TRECrsquo00 pp 77ndash82 Gaithersburg MDUSA 2000
[36] S Kataria P Mitra and S Bhatia ldquoUtilizing context ingenerative bayesian models for linked corpusrdquo in Proceedingsof the Twenty-Fourth AAAI Conference on ArtificialIntelligence Atlanta GA USA July 2010
[37] J Tang and J Zhang ldquoA discriminative approach to topic-based citation recommendationrdquo in Proceedings of the Pacific-Asia Conference Hyderabad India July 2009
12 Computational Intelligence and Neuroscience
simj 1113936
dj
i0 Cji times Dji1113872 1113873
1113936dj
i0 Cji1113872 11138732
times 1113936dj
i0 Dji1113872 11138732
1113969 (j isin [1 10]) (3)
Citation context Document
SVD-FC
Word2vet Word2vet
Convolution
W-ap W-ap
FC
FC
LogisticsSVM
W-ap W-ap
Convolution
Convolution Convolution10 th
1st
USSVDw
Splice
All-ap All-ap
All-ap All-ap
Based-feature
All-ap-feature
Figure 2 An overview of our model
s + w minus 1ws
Figure 3 Convolution extraction generates phrases
ss + w minus 1
Figure 4 ldquoW-aprdquo structure
Figure 5 ldquoAll-aprdquo structure
4 Computational Intelligence and Neuroscience
where Cj and Dj are the distributed representation of ci-tation context and candidate document after the j-th ldquoall-aprdquo layer respectively A total of ten ldquoall-aprdquo layers arecarried out in our model 0erefore j belongs to [1 10] 0ebenefit is that we can obtain the semantic relation betweenthe citation context and the cited document with multiplegranularities As shown in Figure 6 the final output featureconsists of all simj and basic features 0en it is fed into theSVD-FC layer
In most cases we find that if we use all outputs of poollayers as the input of the SVD-FC layer the performance willbe improved0e reason is that features from different layersrepresent the different levels of semantics Neglecting anylayers will obviously cause information loss problems
Next we use the SVD-FC layer to learn the nonlinearcombination features of citation relationships0is layer canforce vectors in the feature map independent and orthogonalto each other 0e added SVD-FC layer can also reduce thenegative impact of excessive parameters
324 SVD-FC Layer In this layer we use SVD to factorizethe weight matrix W (W USVT) and replace it with USOur experimental results show that replacing operations canreduce the negative impact on the sample space
0e Euclidean distance between samples can be used tomeasure whether their feature expression changes in asample space Denoting em and en as the feature maps of twodifferent samples we can obtain two different outputs of thefull connection operation by using the weight matrix W orUS as follows
p e times W (4)
q e times US (5)
As seen in the above equations q is orthogonalizedoutput while p is unorthogonalized0en we can obtain thefollowing theorem
Theorem 1 p and q in equations (4) and (5) will generatethe same Euclidean distance for samples em and en
Proof 0e Euclidean distance L between pm and pn iscalculated as follows
L pm
rarrminus pn
rarr2
emrarr
minus enrarr
( 1113857TWW
Temrarr
minus enrarr
( 1113857
1113969
emrarr
minus enrarr
( 1113857TUSVV
TS
TU
Temrarr
minus enrarr
( 1113857
1113969(6)
Since V is an orthogonal matrix equation (6) isequivalent to
L
emrarr
minus enrarr
( 1113857TUSS
TU
Temrarr
minus enrarr
( 1113857
1113969
qmrarr
minus qnrarr
( 1113857T
qmrarr
minus qnrarr
( 1113857
1113969
qmrarr
minus qnrarr
2
(7)
It can be seen that pm
rarrminus pn
rarr2 qm
rarrminus qn
rarr2
It should be noted that there are no negative impacts andno changes in discrimination ability for the entire samplespace when replacing the weight As shown in Figure 7 weuse SVD of weight matrix W to map the feature map to anorthogonal linear space
325 Output Layer 0e citation recommendation problemis regarded as a classification task in our model In this layerlogistics and SVM can deal with binary classification tasksand predict the final citation relationship
33 Training Details
331 Embeddings In our model words are initialized by300-dimensional word2vec embeddings and will notchange during training A single randomly initializedembedding is created for all unknown words by uniformsampling from[minus001 001] We employ AdaGrad [31] andL2 regularization We introduce adversarial training [32]for embeddings to make the model more robust 0eprocess is achieved by replacing the word vector v afterword2vec embeddings using word vector with disturbingvlowast
vlowast
v times radv (8)
where radv is the worst case of perturbation on the wordvector Goodfellow et al [33] approximated this value bylinearizing the loss function logp(y|x 1113954θ) around x where1113954θ is a constant set to the current parameters of our modeland it only participates in the calculation process of radvwithout a backpropagation algorithm With the linearapproximation and L2 norm constraint the adversarialperturbation is
All-ap-feature
simi
Ci Di
Basic-feature
Figure 6 Generating the feature map
SVD-FC layer input feature
SVD-FC layer output featureSVD-FC layer
Figure 7 SVD-FC layer
Computational Intelligence and Neuroscience 5
radv minusising
g2 whereg nablaxlogp(y|x 1113954θ) (9)
0is perturbation can be easily computed by usingbackpropagation in neural networks
332 Layerwise Training In our training steps we defineconv-pooling block bt (tge 2) which consists of a convo-lution layer and a pooling layer Our network model is thenassembled by the initialization block b1 that initializes usingword2vec and (n minus 1) conv-pooling blocks
First we train the conv-pooling block b2 after b1 istrained On this basis the next conv-pooling block b3 iscreated by keeping the previous block fixed We repeat thisprocedure until all (n minus 1) conv-pooling blocks are trained
Second the following semiorthogonal training proce-dure is used to train the whole network
Semiorthogonal training (SOT) it is crucial to trainSVD-CNN which consists of the following three steps
Step 1 Decompose the weight matrix by SVD ieW USVT W is the weight matrix of the linear layerU is the left-unitary matrix S is the singular valuematrix V is the right-unitary matrix After that wereplace W with US Next we take all eigenvectors ofUS(US)T as weight vectorsStep 2 0e backbone model is fine-tuned by fixing theSVD-FC layerStep 3 0e model keeps fine-tuning with the unfixedSVD-FC layer
Step 1 can generate orthogonal weights but the per-formance of prediction cannot be guaranteed 0e reason isthat over orthogonality will excessively punish synonymoussentences which is apparently inappropriate 0erefore weintroduce Steps 2 and 3 to solve the above problem
0e inputs of SVD-FC are defined as Y
Y (y1 y2 ym)T 0e outputs are defined as O
O (o1 o2 om)T 0e weight matrix is defined asW (w1 w2 wm)T 0e expected outputs are defined asA (a1 a2 am)T 0e error function is defined as
E 12
1113944
l
k1ak minus ok( 1113857
2 (10)
where ok f(1113936mj0 wkjyj) k 1 2 l 0en E with re-
spect to ok is derived and the outcome is
zE
zok
minus ak minus ok( 1113857 (11)
We utilize the gradient descent strategy to find thegradient of the error with respect to weights 0e iterativeupdate of weights is as follows
Δwkj minusηzE
zwkj
(12)
We define an error signal δok zEz netk equation (12) is
equivalent to
Δwkj minusηzE
z netk
z netkzwkj
minusηδok
z netkzwkj
(13)
According to equation (11) δok zEz netk is equivalent
to
δok minus
zE
zok
zok
z netk minus
zE
zok
fprime netk( 1113857
zE
zok
okprime minus dk minus ok( 1113857ok
prime
(14)
We use the sigmoid f(x) 1(1 + ex) as the nonlinearfunction so equation (13) is equivalent to
Δwkj minusηδokyj η dk minus ok( 1113857ok 1 minus ok( 1113857yj (15)
In Step 1 the weight matrix W is decomposed by SVDand replaced with US U (q1 q2 qm)T andS diag(λ1 λ2 λm) Since dk minus ok is given we definethat Loss dk minus ok As a result equation (15) is equivalent to
Δwkj η Loss middot ok minus sigmoid yj 1113944 qiλi + B1113872 11138732
1113876 1113877yj (16)
qi middot qj 0 ine j are in the left-unitary matrix U so themodel operation is not affected by the nonorthogonal ei-genvectors qi 0is is the reason for excessively punishingsynonymous sentences in Step 1 However orthogonalityhas a positive effect on Δwkj in Step 2
0e purpose of SVD is to maintain the orthogonality ofeach weight vector in geometric space When weight vectorsare conditioned by orthogonal regularization the relevancybetween weight vectors decreases We use the followingmethods in Step 3 to measure relevance
H WTW
w1rarrT
w1rarr
middot middot middot w1rarrT
wkrarr
⋮ ⋱ ⋮
wkrarrT
w1rarr
middot middot middot wkrarrT
wkrarr
⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣
⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦
h11 middot middot middot h1k
⋮ ⋱ ⋮
hk1 middot middot middot hkk
⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣
⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦
(17)
where W is a weight matrix that contains k weight vectorswi (i 1 k)hij (i j 1 k) is the dot product of wi
and wj Let us define S(W) as the correlation measurementof all column vectors in W
S(W) 1113936
ki1 hii
1113936ki1 1113936
kj1 hij
11138681113868111386811138681113868
11138681113868111386811138681113868 (18)
When W is an orthogonal matrix the value of S(W) is 1When ine j S(W) obtains the minimum value (1k)0erefore we can see that the value of S(W) falls into
6 Computational Intelligence and Neuroscience
[(1k) 1] As a result when S(W) is close to 1k or 0 theweight matrix will have high relevance
34 Complexity Analysis Assume that the training samplesize is |C| the average number of words in each citationcontext is |c| Cl is the number of kernels in the l-th layer andwis the size of the sliding window For one convolution layerthe training complexity is O(Clminus1 middot Cl middot w middot (s minus w + 1)) 0etraining complexity of one w-ap layer is O(C2
l middot w middot s) 0etraining complexity of one all-ap layer isO(C2
l middot (s minus w + 1))which was improved by C F Van Loan [12] computing theeigenvalue for SVD matrix decomposition with K size takesO(K) on the way of JACOBI Assume that the size of theweight matrix in the SVD-FC layer isK and the channel ofthe input matrix is Cin 0e computational cost for the SVD-FC layer is O(2K2 middot Cin + K)
4 Experiment
41 Dataset We use the CiteSeer dataset [34] to evaluatethe performance of our model 0e dataset was publishedby Huang et al [4] In this dataset citation relationshipsare extracted by a pair of citation contexts and the ab-stracts of cited papers A citation context includes thesentence where the citation placeholder appears and thesentences before and after the citation placeholderWithin each paper in the corpus the 50 words before and50 words after each citation reference are treated as thecorresponding citation context (a discussion on thenumber of words can be found in [7]) Before wordembedding we also remove stop words from the contextsTo preserve the time-sensitive pastpresentfuture tensesof verbs and the singularplural styles of named entitiesno stemming is done but all words are transferred tolower-case 0e training set contains 3989547 pairs ofreference contexts and citations and the test set contains1021685 citation relations
Following common practice in information retrieval(IR) we employ the following four evaluation metrics toevaluate recommendation results recall mean reciprocalrank (MRR) mean average precision (MAP) and normal-ized discounted cumulative gain (nDCG)
42 EvaluationMetric For each query in the test set we usethe original set of references as the ground truth Rg Assumethat the set of recommended citations is Rr and the correctrecommendations are Rg capRr Recall is defined as
recall Rg capRr
11138681113868111386811138681113868
11138681113868111386811138681113868
Rg
(19)
In our experiments the number of recommended ci-tations ranges from 1 to 10 Recall evaluation does not revealthe order of recommended references To address thisproblem we select the following two additional metrics
For a query q let rankq be the rank of the first correctrecommendation within the list MRR [35] is defined as
MRR 1
|Q|1113944qisinQ
1rankq
(20)
where Q is the testing set MRR reveals the average rankingof the first correct recommendation
For each citation placeholder we search the papers thatmay be referenced at this citation placeholder Each retrievalmodel returns a ranked list of papers Since there may be oneor more references for one citation context we use meanaverage precision (MAP) as the evaluation metric
where R(di) is a binary function indicating whether doc-ument di is relevant or not For our problem the papers citedat the citation placeholder are considered relevantdocuments
We use normalized discounted cumulative gain (NDCG)to measure the ranked recommendation list 0e NDCGvalue of a ranking list at position i is calculated as
NDCG d1 dN( 1113857 1113944i
2rel di( ) minus 1lni+1 (22)
where rel (di) is the 4-scale relevance of document di in theranked list We use the average cocited probability [2] oflangdi dlowastrang to weigh the citation relevance score of di to dlowast(anoriginal citation of the query) We report the average NDCGscore over all testing documents
43 BaselineComparison We choose the following methodsfor comparison
Cite-PLSA-LDA (CP-LDA) [36] we use the originalimplementation provided by the author 0e number oftopics is set to 60
(i) Restricted Boltzmann Machine (RBM-CS) [37] Wetrain two layers of RBM-CS according to the sug-gestion of the author We set the hidden layer size to600
(ii) Word2vec Model (W2V) [29] We use the word2vecmodel to learn words and document representa-tions 0e cited document is treated as a ldquowordrdquo (adocument uses a unique marker when it is cited bydifferent papers) 0e dimensions of the word anddocument vectors are set to n 300
(iii) Neural Probabilistic Model (NPM) [4] We followthe original implementation 0e dimensions of theword and document representation vector are set ton 600 For negative sampling we set the numberof negative samples k 10 where k is the number ofnoise words in the citation context For noisecontrast estimation we set the number of noisesamples k 1000
(iv) Neural Citation Network (NCN) [7] In NCN thegradient clipping is 5 the dropout probability is 02and the recurrent layers are 2 0e region sizes for
Computational Intelligence and Neuroscience 7
the encoder are set to 4 4 and 5 and the region sizesfor the author network are set to 1 and 2
Figures 8 and 9 show the performance of eachmethod onthe CiteSeer dataset It is obvious that the SVD-FC modelleads the performance in most cases More detailed analysesare given as follows
First we perform a comparison among CP-LDA RBMW2V and SVD-CNN Our SVD-CNN completely andsignificantly exceeds other models in all metrics 0e successof ourmodel is ascribed to the content and correlation of ournetwork Due to the lack of citation context information wefind that W2V is obviously worse than other methods interms of all metrics CP-LDA works much better than W2Vwhich indicates that link information is very important forfinding relevant papers RBM-CS shows a clear performancegain over W2V because RBM-CS automatically discoverstopical aspects of each paper based on citation contextHowever the vector representations of citation context inRBM-CS are extracted by traditional word vector repre-sentations which fully neglect semantic relations betweenthe citation document and citation context and thus may belimited by vocabulary
Second we compare the performance among NPMNCN and SVD-CNN It is not surprising that NPM andNCN achieve worse performance than SVD-CNN since theirdistributed representation of words and documents reliessolely on deep learning without restraint NPM recommendscitations based on trained distributed representations NCNfurther enhances the performance by considering authorinformation and using a more sophisticated neural networkarchitecture However the CNN in NCN does not haveorthogonal constraints which makes it difficult to capturedifferent types of citing activities In addition NCN onlyutilizes the title of the cited paper for a decoder which isapparently not sufficient for learning good embedding
44 e Influence on the Link Prediction of Reference PatternInteractionalFeatures According to the chapter positions ofcitation context in the article we divide the training set intothree parts the introduction part contains 1307885 pairs ofreference contexts and citations the related word partcontains 1599897 pairs of citations and the main partcontains 1024783 pairs Furthermore these datasets formthree mixed datasets In this part of the experiment we usethe CNN model without SVD as the baseline 0ese datasetsare tested in a ratio of 3 1 In Tables 1 and 2 we show theresults on the abovementioned datasets
From the results we obtain the following observationsFirst both CNN and SVD-CNN outperform unmixed
datasets over mixed datasets across the different evaluationmetrics which shows that the diversity of reference patternsincreases the difficulty of citation recommendation tasks
Second in Tables 1 and 2 we observe that our model isparticularly good at resolving the difficulties in mixeddatasets which come from the diversity of referencepatterns
To better explore why mixed datasets are more complexthan unmixed datasets in Figure 10 we show the change in
S(W) during the training process of SVD-CNN amongvarious datasets
As shown in Figure 10 the increase in S(W) on themixed datasets indicates that SVD-CNN is good at decor-relation We can also see in Tables 1 and 2 that the CNNmodel has pretty performance on unmixed datasets whileachieving poor performance on mixed datasets HoweverSVD-CNN achieves almost the same performance on thetwo types of datasets 0is proves that the correlation fromvarious reference patterns can significantly affect the linkprediction
0e reason why the change in S(W) is not large on theunmixed datasets is that reference patterns of unmixeddatasets have similar features which belong to the samecategory As a result the orthogonality of the weight matrixis hard to improve on unmixed datasets However a citationrecommendation algorithm has pretty performance on theunmixed datasets because there are low complexities
Although mixed datasets are more complicated thanunmixed datasets SVD-CNN still performs well in mixeddatasets 0is indicates that SVD-CNN reduces the negativeimpact of the correlation of reference patterns and ourapproach is more suitable for complex scenarios
45 Comparison with Other Types of Decorrelation In ad-dition to SVD there are still some other methods fordecorrelating the feature matrix However these methodscannot maintain the discriminating ability of the CNNmodel To illustrate this we compare SVD with severalvarieties as follows
(1) Using the originally learned W
(2) Replacing W with US
(3) Replacing W with U
(4) Replacing W with UVT
(5) Replacing Wwith Q D where D is the diagonalmatrix extracted from the upper triangle matrix inQ-R decomposition
(6) Replacing W with WPCA where WPCA is the diagonalmatrix extracted from the weight matrix W after theprocessing of dimension reduction by PCA
After convergence of training different orthogonalmatrices are used to replace the weight matrix W We defineT-cost as the time cost of replacing the weight which isequivalent to the proportion of the added time to the originaltime As shown in Table 3 other types of decorrelationdegrade the performance in addition to W⟶ US andW⟶WPCA However the time cost of W⟶WPCA ismore than that of W⟶ US
46 Ablation Study In our method there are two essentialparameters a term sot which means the number of SOTiterations and a biased parameter d0 In this section weconduct an ablation study of these parameters
We first evaluate the effectiveness of sot by empiri-cally fixing d0 300 Since sot defines the loop time of
8 Computational Intelligence and Neuroscience
orthogonal constraint training it should be set as anonnegative value Figure 11 illustrates the MRR with sotfrom 0 to 10 on the CiteSeer dataset We can see that the
performance improves as the value of sot increasesWhen sot 0 the model has no decorrelation andachieves the worst performance In this situation the
Table 1 MRR metric on various datasets
Introduction Related Main Introduction + related Introduction +main Related +mainCNN 03312 03294 03478 02773 02815 02978SVD-CNN 03995 04078 03989 03878 03889 03845
060055050045040035030025Re
call
020015010005000
20 40 60Number of recommended citations
80 100
W2VsNPMs
RBMsCP_LDAs
SVD_CNNsNCNs
Figure 8 Comparison of recall with different methods on CiteSeer
MRR MAP and nDCG scores for top 10 recommendations04
035
03
025
02
015
01
005
0
0091600997
MRR MAP nCDG
00662
01843
02667
03687
00912009982
00663
01835
02418
03352
01288 0135601476
0256602592
03448
CP-LDARBM
W2VNPM
NCNSVD-CNN
Figure 9 Comparison of MRR MAP and nDCG with different methods on CiteSeer
Table 2 MAP metric on various datasets
Introduction Related Main Introduction + related Introduction +main Related +mainCNN 03001 02909 03107 02572 02601 02637SVD-CNN 03701 03655 03693 03498 03511 03539
Computational Intelligence and Neuroscience 9
weight matrix in the FC layer is highly correlated andS(W) has the lowest value 0e recommendation per-formance then increases while adding sot which indi-cates that reducing the correlative degree of the weightmatrix in the FC layer is critical for improving perfor-mance When sot 10 our model achieves the bestperformance
In our model d0 is the dimension of citation contextand cited document representations Figure 12 shows howthe performance of SVD-CNN varies with d0 on the samesot When d0 is small the information content of thecitation context is very small and produces worse per-formance 0e recommendation performance increases toa maximum point until d0 reaches 300 It should be noted
05
045
04
035
03
S (W
)
025
02
015
01
005
00 1 2 3 4 5
Sot6 7 8 9 10 11
IntroductionRelatedMain
Introduction + relatedIntroduction + mainRelated + main
Figure 10 0e change in S(W) during training on unmixed datasets and mixed datasets
Table 3 0e comparison of related methods in Step 1
Figure 11 0e performance impact of sot on CiteSeer
10 Computational Intelligence and Neuroscience
that although the larger d0 is better the larger d0 willsignificantly increase the training time 0erefore wechoose d0 300
5 Conclusion and Future Works
We propose a convolutional neural network model withorthogonal regularization to solve the context-aware citationrecommendation task In our model orthogonal regulari-zation is achieved by using SVD to factorize the weight of theFC layer which can essentially make each vector in thefeature map more independent 0e orthogonal regulari-zation also enhances the feature extraction ability of CNN0e experimental results show that SVD-CNN outperformsthe other compared methods on CiteSeer Our model onlytakes the abstract as the content of the cited paper In thefuture we will explore the performance of our model byusing the full text of papers
Data Availability
Previously reported CiteSeer data were used to support thisstudy and are available at [httpspsuappboxcomvrefseer] 0ese prior datasets are cited at relevant placeswithin the text as references [4]
Conflicts of Interest
0e authors declare that there are no conflicts of interestregarding the publication of this paper
Acknowledgments
0is work was partially supported by the National NaturalScience Foundation of China (project no 61373046) andthe National Key Research and Development Programs ofChina (project nos 2018AAA0101100 and2019YFB2102500)
References
[1] M A Angrosh S Cranefield and N Stanger ldquoConditionalrandom field based sentence context identification enhancingcitation services for the research communityrdquo in Proceedingsof the First Australasian Web Conference Adelaide AustraliaJanuary 2013
[2] Q He J Pei D Kifer et al ldquoContext-aware citation rec-ommendationrdquo in Proceedings of the International Conferenceon World Wide Web Raleigh NC USA April 2010
[3] Q He D Kifer J Pei et al ldquoCitation recommendationwithout author supervisionrdquo in Proceedings of the FourthACM international Conference on Web Search and DataMining Hong Kong China February 2011
[4] W Huang ldquoA neural probabilistic model for context basedcitation recommendationrdquo in Proceedings of the AAAIConference on Artificial Intelligence Austin TX USA January2015
[5] J Tan X Wan and J Xiao ldquoA neural network approach toquote recommendation in writingsrdquo in Proceedings of theACM International on Conference on Information andKnowledge Management Indianapolis IN USA October2016
[6] X Ren J Liu X Yu et al ldquoCluscite effective citation rec-ommendation by information network-based clusteringrdquo inProceedings of the 20th ACM SIGKDD International Con-ference on Knowledge Discovery and Data Mining New YorkNY USA August 2014
[7] T Ebesu and Y Fang ldquoNeural citation network for context-aware citation recommendationrdquo in Proceedings of the 40thInternational ACM SIGIR Conference on Research and De-velopment in Information Retrieval pp 1093ndash1096 ShinjukuJapan August 2017
[8] D M Blei A Y Ng and M I Jordan ldquoLatentdirichlet allocationrdquo Journal of Machine Learning Researchvol 3 pp 993ndash1022 2003
[9] S Bradshaw ldquoReference directed indexing redeeming rele-vance for subject search in citation indexesrdquo Research andAdvanced Technology for Digital Libraries vol 2769pp 499ndash510 2003
[10] N Meuschke B Gipp and M Lipinsk ldquoCITREC an eval-uation framework for citation-based similarity measures
based on TREC genomics and PubMed centralrdquo 2015 httphdlhandlenet214273680
[11] A Ritchie S Robertson and S Teufel ldquoComparing CitationContexts for information Retrievalrdquo in Proceedings of the 17thACM Conference on Information and Knowledge Manage-ment pp 213ndash222 Napa Valley CA USA October 2008
[12] C F Van Loan e Block Jacobi Method for Computing theSingular Value Decomposition Department of ComputerScience Cornell University Ithaca NY USA 1985
[13] C Bhagavatula S Feldman R Power et al ldquoContent-basedcitation recommendationrdquo 2018 httpsarxivorgpdf18020830201v1pdf
[14] H Jia and E Saule ldquoLocal is good a fast citation recom-mendation approachrdquo Lecture Notes in Computer ScienceVol 10772 Springer Berlin Germany 2018
[15] Y Sun W Ni and R Men ldquoA personalized paper recom-mendation approach based on web paper mining and re-viewerrsquos interest modellingrdquo in Proceedings of theInternational Conference on Research Challenges in ComputerScience Shanghai China December 2009
[16] B Shaparenko and T Joachims ldquoInformation genealogyUncovering the flow of ideas in non-hyperlinked documentdatabasesrdquo in Proceedings of the ACM SIGKDD internationalConference on Knowledge Discovery and Data Mining SanJose CA USA August 2007
[17] T Strohman W B Croft and D Jensen ldquoRecommendingcitations for academic papersrdquo in Proceedings of the Annualinternational ACM SIGIR Conference on Research and De-velopment in information Retrieval Amsterdam NetherlandsJuly 2007
[18] A Livne V Gokuladas J Teevan et al ldquoCiteSight supportingcontextual citation recommendation using differentialsearchrdquo in Proceedings of the International ACM SIGIRConference on Research amp Development in informationRetrieval Gold Coast Australia July 2014
[19] Y Lu J He D Shan et al ldquoRecommending citations withtranslation modelrdquo in Proceedings of the ACM internationalConference on Information and Knowledge ManagementGlasgow UK October 2011
[20] W Huang P Mitra S Kataria et al ldquoRecommending cita-tions translating papers into referencesrdquo in Proceedings of theACM international Conference on Information and KnowledgeManagement Shanghai China November 2014
[21] X Tang X Wan X Zhang et al ldquoCross-language context-aware citation recommendation in scientific articlesrdquo inProceedings of the International ACM SIGIR Conference onResearch amp Development in information Retrieval Gold CoastUK July 2014
[22] A Brock T Lim J M Ritchie et al ldquoNeural photo edi-tingwith introspective adversarial networksrdquo in InternationalConference on Learning Representations 2017
[23] I J Goodfellow J Pouget-Abadie MMirza et al ldquoGenerativeadversarial netsrdquo in Proceedings of the International Con-ference on Neural Information Processing Systems MontrealCanada December 2014
[24] A Brock J Donahue K Simonyan et al ldquoLarge scale GANtraining for high fidelity natural image synthesisrdquo 2018httpsarxivorgabs180911096
[25] Y Sun L Zheng W Deng et al ldquoSVDNet for pedestrianretrievalrdquo in Proceedings of the 2017 IEEE InternationalConference on Computer Vision (ICCV) pp 3820ndash3828Venice Italy October 2017
[26] Q Zheng M Yang J Yang Q Zhang and X ZhangldquoImprovement of generalization ability of deep CNN via
implicit regularization in two-stage training processrdquo IEEEAccess vol 6 no 1109 pp 15844ndash15869 2018
[27] Y Wang D Gong Z Zheng et al ldquoOrthogonal deep featuresdecomposition for age-invariant face recognitionrdquo in Pro-ceedings of the European Conference on Computer Vision(ECCV) Munich Germany September 2018
[28] Y Chen X Jin J Feng et al ldquoTraining group orthogonalneural networks with privileged informationrdquo in Proceedingsof the Twenty-Sixth International Joint Conference on ArtificialIntelligence Melbourne Australia August 2017
[29] T Mikolov I Sutskever K Chen et al ldquoDistributedrepresentations of words and phrases and their composi-tionalityrdquo in Proceedings of the 26th International Con-ference on Neural Information Processing Systems LakeTahoe NV USA December 2013
[30] A Rajaraman and J D Ullman ldquoData miningrdquo Mining ofMassive Datasets vol 3 no 2 pp 1ndash17 2014
[31] J Duchi E Hazan and Y Singer ldquoAdaptive subgradientmethods for online learning and stochastic optimizationrdquoJournal of Machine Learning Research vol 12 no 7pp 2121ndash2159 2011
[32] T Miyato A M Dai and I Goodfellow ldquoAdversarial trainingmethods for semi-supervised text classificationrdquo in Pro-ceedings of the International Conference on LearningRepresentations San Juan Puerto Rico May 2016
[33] I J Goodfellow J Shlens and C Szegedy ldquoExplaining andharnessing adversarial examplesrdquo in Proceedings of the In-ternational Conference on Learning Representations BanffCanada April 2014
[34] K Chandrasekaran S Gauch P Lakkaraju et al ldquoConcept-based document recommendations for CiteSeer authorsrdquo inProceedings of the International Conference on AdaptiveHypermedia and Adaptive Web-Based Systems HannoverGermany August 2008
[35] E Voorhees ldquo0e trec-8 question answering track reportrdquo inProceedings of the TRECrsquo00 pp 77ndash82 Gaithersburg MDUSA 2000
[36] S Kataria P Mitra and S Bhatia ldquoUtilizing context ingenerative bayesian models for linked corpusrdquo in Proceedingsof the Twenty-Fourth AAAI Conference on ArtificialIntelligence Atlanta GA USA July 2010
[37] J Tang and J Zhang ldquoA discriminative approach to topic-based citation recommendationrdquo in Proceedings of the Pacific-Asia Conference Hyderabad India July 2009
12 Computational Intelligence and Neuroscience
where Cj and Dj are the distributed representation of ci-tation context and candidate document after the j-th ldquoall-aprdquo layer respectively A total of ten ldquoall-aprdquo layers arecarried out in our model 0erefore j belongs to [1 10] 0ebenefit is that we can obtain the semantic relation betweenthe citation context and the cited document with multiplegranularities As shown in Figure 6 the final output featureconsists of all simj and basic features 0en it is fed into theSVD-FC layer
In most cases we find that if we use all outputs of poollayers as the input of the SVD-FC layer the performance willbe improved0e reason is that features from different layersrepresent the different levels of semantics Neglecting anylayers will obviously cause information loss problems
Next we use the SVD-FC layer to learn the nonlinearcombination features of citation relationships0is layer canforce vectors in the feature map independent and orthogonalto each other 0e added SVD-FC layer can also reduce thenegative impact of excessive parameters
324 SVD-FC Layer In this layer we use SVD to factorizethe weight matrix W (W USVT) and replace it with USOur experimental results show that replacing operations canreduce the negative impact on the sample space
0e Euclidean distance between samples can be used tomeasure whether their feature expression changes in asample space Denoting em and en as the feature maps of twodifferent samples we can obtain two different outputs of thefull connection operation by using the weight matrix W orUS as follows
p e times W (4)
q e times US (5)
As seen in the above equations q is orthogonalizedoutput while p is unorthogonalized0en we can obtain thefollowing theorem
Theorem 1 p and q in equations (4) and (5) will generatethe same Euclidean distance for samples em and en
Proof 0e Euclidean distance L between pm and pn iscalculated as follows
L pm
rarrminus pn
rarr2
emrarr
minus enrarr
( 1113857TWW
Temrarr
minus enrarr
( 1113857
1113969
emrarr
minus enrarr
( 1113857TUSVV
TS
TU
Temrarr
minus enrarr
( 1113857
1113969(6)
Since V is an orthogonal matrix equation (6) isequivalent to
L
emrarr
minus enrarr
( 1113857TUSS
TU
Temrarr
minus enrarr
( 1113857
1113969
qmrarr
minus qnrarr
( 1113857T
qmrarr
minus qnrarr
( 1113857
1113969
qmrarr
minus qnrarr
2
(7)
It can be seen that pm
rarrminus pn
rarr2 qm
rarrminus qn
rarr2
It should be noted that there are no negative impacts andno changes in discrimination ability for the entire samplespace when replacing the weight As shown in Figure 7 weuse SVD of weight matrix W to map the feature map to anorthogonal linear space
325 Output Layer 0e citation recommendation problemis regarded as a classification task in our model In this layerlogistics and SVM can deal with binary classification tasksand predict the final citation relationship
33 Training Details
331 Embeddings In our model words are initialized by300-dimensional word2vec embeddings and will notchange during training A single randomly initializedembedding is created for all unknown words by uniformsampling from[minus001 001] We employ AdaGrad [31] andL2 regularization We introduce adversarial training [32]for embeddings to make the model more robust 0eprocess is achieved by replacing the word vector v afterword2vec embeddings using word vector with disturbingvlowast
vlowast
v times radv (8)
where radv is the worst case of perturbation on the wordvector Goodfellow et al [33] approximated this value bylinearizing the loss function logp(y|x 1113954θ) around x where1113954θ is a constant set to the current parameters of our modeland it only participates in the calculation process of radvwithout a backpropagation algorithm With the linearapproximation and L2 norm constraint the adversarialperturbation is
All-ap-feature
simi
Ci Di
Basic-feature
Figure 6 Generating the feature map
SVD-FC layer input feature
SVD-FC layer output featureSVD-FC layer
Figure 7 SVD-FC layer
Computational Intelligence and Neuroscience 5
radv minusising
g2 whereg nablaxlogp(y|x 1113954θ) (9)
0is perturbation can be easily computed by usingbackpropagation in neural networks
332 Layerwise Training In our training steps we defineconv-pooling block bt (tge 2) which consists of a convo-lution layer and a pooling layer Our network model is thenassembled by the initialization block b1 that initializes usingword2vec and (n minus 1) conv-pooling blocks
First we train the conv-pooling block b2 after b1 istrained On this basis the next conv-pooling block b3 iscreated by keeping the previous block fixed We repeat thisprocedure until all (n minus 1) conv-pooling blocks are trained
Second the following semiorthogonal training proce-dure is used to train the whole network
Semiorthogonal training (SOT) it is crucial to trainSVD-CNN which consists of the following three steps
Step 1 Decompose the weight matrix by SVD ieW USVT W is the weight matrix of the linear layerU is the left-unitary matrix S is the singular valuematrix V is the right-unitary matrix After that wereplace W with US Next we take all eigenvectors ofUS(US)T as weight vectorsStep 2 0e backbone model is fine-tuned by fixing theSVD-FC layerStep 3 0e model keeps fine-tuning with the unfixedSVD-FC layer
Step 1 can generate orthogonal weights but the per-formance of prediction cannot be guaranteed 0e reason isthat over orthogonality will excessively punish synonymoussentences which is apparently inappropriate 0erefore weintroduce Steps 2 and 3 to solve the above problem
0e inputs of SVD-FC are defined as Y
Y (y1 y2 ym)T 0e outputs are defined as O
O (o1 o2 om)T 0e weight matrix is defined asW (w1 w2 wm)T 0e expected outputs are defined asA (a1 a2 am)T 0e error function is defined as
E 12
1113944
l
k1ak minus ok( 1113857
2 (10)
where ok f(1113936mj0 wkjyj) k 1 2 l 0en E with re-
spect to ok is derived and the outcome is
zE
zok
minus ak minus ok( 1113857 (11)
We utilize the gradient descent strategy to find thegradient of the error with respect to weights 0e iterativeupdate of weights is as follows
Δwkj minusηzE
zwkj
(12)
We define an error signal δok zEz netk equation (12) is
equivalent to
Δwkj minusηzE
z netk
z netkzwkj
minusηδok
z netkzwkj
(13)
According to equation (11) δok zEz netk is equivalent
to
δok minus
zE
zok
zok
z netk minus
zE
zok
fprime netk( 1113857
zE
zok
okprime minus dk minus ok( 1113857ok
prime
(14)
We use the sigmoid f(x) 1(1 + ex) as the nonlinearfunction so equation (13) is equivalent to
Δwkj minusηδokyj η dk minus ok( 1113857ok 1 minus ok( 1113857yj (15)
In Step 1 the weight matrix W is decomposed by SVDand replaced with US U (q1 q2 qm)T andS diag(λ1 λ2 λm) Since dk minus ok is given we definethat Loss dk minus ok As a result equation (15) is equivalent to
Δwkj η Loss middot ok minus sigmoid yj 1113944 qiλi + B1113872 11138732
1113876 1113877yj (16)
qi middot qj 0 ine j are in the left-unitary matrix U so themodel operation is not affected by the nonorthogonal ei-genvectors qi 0is is the reason for excessively punishingsynonymous sentences in Step 1 However orthogonalityhas a positive effect on Δwkj in Step 2
0e purpose of SVD is to maintain the orthogonality ofeach weight vector in geometric space When weight vectorsare conditioned by orthogonal regularization the relevancybetween weight vectors decreases We use the followingmethods in Step 3 to measure relevance
H WTW
w1rarrT
w1rarr
middot middot middot w1rarrT
wkrarr
⋮ ⋱ ⋮
wkrarrT
w1rarr
middot middot middot wkrarrT
wkrarr
⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣
⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦
h11 middot middot middot h1k
⋮ ⋱ ⋮
hk1 middot middot middot hkk
⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣
⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦
(17)
where W is a weight matrix that contains k weight vectorswi (i 1 k)hij (i j 1 k) is the dot product of wi
and wj Let us define S(W) as the correlation measurementof all column vectors in W
S(W) 1113936
ki1 hii
1113936ki1 1113936
kj1 hij
11138681113868111386811138681113868
11138681113868111386811138681113868 (18)
When W is an orthogonal matrix the value of S(W) is 1When ine j S(W) obtains the minimum value (1k)0erefore we can see that the value of S(W) falls into
6 Computational Intelligence and Neuroscience
[(1k) 1] As a result when S(W) is close to 1k or 0 theweight matrix will have high relevance
34 Complexity Analysis Assume that the training samplesize is |C| the average number of words in each citationcontext is |c| Cl is the number of kernels in the l-th layer andwis the size of the sliding window For one convolution layerthe training complexity is O(Clminus1 middot Cl middot w middot (s minus w + 1)) 0etraining complexity of one w-ap layer is O(C2
l middot w middot s) 0etraining complexity of one all-ap layer isO(C2
l middot (s minus w + 1))which was improved by C F Van Loan [12] computing theeigenvalue for SVD matrix decomposition with K size takesO(K) on the way of JACOBI Assume that the size of theweight matrix in the SVD-FC layer isK and the channel ofthe input matrix is Cin 0e computational cost for the SVD-FC layer is O(2K2 middot Cin + K)
4 Experiment
41 Dataset We use the CiteSeer dataset [34] to evaluatethe performance of our model 0e dataset was publishedby Huang et al [4] In this dataset citation relationshipsare extracted by a pair of citation contexts and the ab-stracts of cited papers A citation context includes thesentence where the citation placeholder appears and thesentences before and after the citation placeholderWithin each paper in the corpus the 50 words before and50 words after each citation reference are treated as thecorresponding citation context (a discussion on thenumber of words can be found in [7]) Before wordembedding we also remove stop words from the contextsTo preserve the time-sensitive pastpresentfuture tensesof verbs and the singularplural styles of named entitiesno stemming is done but all words are transferred tolower-case 0e training set contains 3989547 pairs ofreference contexts and citations and the test set contains1021685 citation relations
Following common practice in information retrieval(IR) we employ the following four evaluation metrics toevaluate recommendation results recall mean reciprocalrank (MRR) mean average precision (MAP) and normal-ized discounted cumulative gain (nDCG)
42 EvaluationMetric For each query in the test set we usethe original set of references as the ground truth Rg Assumethat the set of recommended citations is Rr and the correctrecommendations are Rg capRr Recall is defined as
recall Rg capRr
11138681113868111386811138681113868
11138681113868111386811138681113868
Rg
(19)
In our experiments the number of recommended ci-tations ranges from 1 to 10 Recall evaluation does not revealthe order of recommended references To address thisproblem we select the following two additional metrics
For a query q let rankq be the rank of the first correctrecommendation within the list MRR [35] is defined as
MRR 1
|Q|1113944qisinQ
1rankq
(20)
where Q is the testing set MRR reveals the average rankingof the first correct recommendation
For each citation placeholder we search the papers thatmay be referenced at this citation placeholder Each retrievalmodel returns a ranked list of papers Since there may be oneor more references for one citation context we use meanaverage precision (MAP) as the evaluation metric
where R(di) is a binary function indicating whether doc-ument di is relevant or not For our problem the papers citedat the citation placeholder are considered relevantdocuments
We use normalized discounted cumulative gain (NDCG)to measure the ranked recommendation list 0e NDCGvalue of a ranking list at position i is calculated as
NDCG d1 dN( 1113857 1113944i
2rel di( ) minus 1lni+1 (22)
where rel (di) is the 4-scale relevance of document di in theranked list We use the average cocited probability [2] oflangdi dlowastrang to weigh the citation relevance score of di to dlowast(anoriginal citation of the query) We report the average NDCGscore over all testing documents
43 BaselineComparison We choose the following methodsfor comparison
Cite-PLSA-LDA (CP-LDA) [36] we use the originalimplementation provided by the author 0e number oftopics is set to 60
(i) Restricted Boltzmann Machine (RBM-CS) [37] Wetrain two layers of RBM-CS according to the sug-gestion of the author We set the hidden layer size to600
(ii) Word2vec Model (W2V) [29] We use the word2vecmodel to learn words and document representa-tions 0e cited document is treated as a ldquowordrdquo (adocument uses a unique marker when it is cited bydifferent papers) 0e dimensions of the word anddocument vectors are set to n 300
(iii) Neural Probabilistic Model (NPM) [4] We followthe original implementation 0e dimensions of theword and document representation vector are set ton 600 For negative sampling we set the numberof negative samples k 10 where k is the number ofnoise words in the citation context For noisecontrast estimation we set the number of noisesamples k 1000
(iv) Neural Citation Network (NCN) [7] In NCN thegradient clipping is 5 the dropout probability is 02and the recurrent layers are 2 0e region sizes for
Computational Intelligence and Neuroscience 7
the encoder are set to 4 4 and 5 and the region sizesfor the author network are set to 1 and 2
Figures 8 and 9 show the performance of eachmethod onthe CiteSeer dataset It is obvious that the SVD-FC modelleads the performance in most cases More detailed analysesare given as follows
First we perform a comparison among CP-LDA RBMW2V and SVD-CNN Our SVD-CNN completely andsignificantly exceeds other models in all metrics 0e successof ourmodel is ascribed to the content and correlation of ournetwork Due to the lack of citation context information wefind that W2V is obviously worse than other methods interms of all metrics CP-LDA works much better than W2Vwhich indicates that link information is very important forfinding relevant papers RBM-CS shows a clear performancegain over W2V because RBM-CS automatically discoverstopical aspects of each paper based on citation contextHowever the vector representations of citation context inRBM-CS are extracted by traditional word vector repre-sentations which fully neglect semantic relations betweenthe citation document and citation context and thus may belimited by vocabulary
Second we compare the performance among NPMNCN and SVD-CNN It is not surprising that NPM andNCN achieve worse performance than SVD-CNN since theirdistributed representation of words and documents reliessolely on deep learning without restraint NPM recommendscitations based on trained distributed representations NCNfurther enhances the performance by considering authorinformation and using a more sophisticated neural networkarchitecture However the CNN in NCN does not haveorthogonal constraints which makes it difficult to capturedifferent types of citing activities In addition NCN onlyutilizes the title of the cited paper for a decoder which isapparently not sufficient for learning good embedding
44 e Influence on the Link Prediction of Reference PatternInteractionalFeatures According to the chapter positions ofcitation context in the article we divide the training set intothree parts the introduction part contains 1307885 pairs ofreference contexts and citations the related word partcontains 1599897 pairs of citations and the main partcontains 1024783 pairs Furthermore these datasets formthree mixed datasets In this part of the experiment we usethe CNN model without SVD as the baseline 0ese datasetsare tested in a ratio of 3 1 In Tables 1 and 2 we show theresults on the abovementioned datasets
From the results we obtain the following observationsFirst both CNN and SVD-CNN outperform unmixed
datasets over mixed datasets across the different evaluationmetrics which shows that the diversity of reference patternsincreases the difficulty of citation recommendation tasks
Second in Tables 1 and 2 we observe that our model isparticularly good at resolving the difficulties in mixeddatasets which come from the diversity of referencepatterns
To better explore why mixed datasets are more complexthan unmixed datasets in Figure 10 we show the change in
S(W) during the training process of SVD-CNN amongvarious datasets
As shown in Figure 10 the increase in S(W) on themixed datasets indicates that SVD-CNN is good at decor-relation We can also see in Tables 1 and 2 that the CNNmodel has pretty performance on unmixed datasets whileachieving poor performance on mixed datasets HoweverSVD-CNN achieves almost the same performance on thetwo types of datasets 0is proves that the correlation fromvarious reference patterns can significantly affect the linkprediction
0e reason why the change in S(W) is not large on theunmixed datasets is that reference patterns of unmixeddatasets have similar features which belong to the samecategory As a result the orthogonality of the weight matrixis hard to improve on unmixed datasets However a citationrecommendation algorithm has pretty performance on theunmixed datasets because there are low complexities
Although mixed datasets are more complicated thanunmixed datasets SVD-CNN still performs well in mixeddatasets 0is indicates that SVD-CNN reduces the negativeimpact of the correlation of reference patterns and ourapproach is more suitable for complex scenarios
45 Comparison with Other Types of Decorrelation In ad-dition to SVD there are still some other methods fordecorrelating the feature matrix However these methodscannot maintain the discriminating ability of the CNNmodel To illustrate this we compare SVD with severalvarieties as follows
(1) Using the originally learned W
(2) Replacing W with US
(3) Replacing W with U
(4) Replacing W with UVT
(5) Replacing Wwith Q D where D is the diagonalmatrix extracted from the upper triangle matrix inQ-R decomposition
(6) Replacing W with WPCA where WPCA is the diagonalmatrix extracted from the weight matrix W after theprocessing of dimension reduction by PCA
After convergence of training different orthogonalmatrices are used to replace the weight matrix W We defineT-cost as the time cost of replacing the weight which isequivalent to the proportion of the added time to the originaltime As shown in Table 3 other types of decorrelationdegrade the performance in addition to W⟶ US andW⟶WPCA However the time cost of W⟶WPCA ismore than that of W⟶ US
46 Ablation Study In our method there are two essentialparameters a term sot which means the number of SOTiterations and a biased parameter d0 In this section weconduct an ablation study of these parameters
We first evaluate the effectiveness of sot by empiri-cally fixing d0 300 Since sot defines the loop time of
8 Computational Intelligence and Neuroscience
orthogonal constraint training it should be set as anonnegative value Figure 11 illustrates the MRR with sotfrom 0 to 10 on the CiteSeer dataset We can see that the
performance improves as the value of sot increasesWhen sot 0 the model has no decorrelation andachieves the worst performance In this situation the
Table 1 MRR metric on various datasets
Introduction Related Main Introduction + related Introduction +main Related +mainCNN 03312 03294 03478 02773 02815 02978SVD-CNN 03995 04078 03989 03878 03889 03845
060055050045040035030025Re
call
020015010005000
20 40 60Number of recommended citations
80 100
W2VsNPMs
RBMsCP_LDAs
SVD_CNNsNCNs
Figure 8 Comparison of recall with different methods on CiteSeer
MRR MAP and nDCG scores for top 10 recommendations04
035
03
025
02
015
01
005
0
0091600997
MRR MAP nCDG
00662
01843
02667
03687
00912009982
00663
01835
02418
03352
01288 0135601476
0256602592
03448
CP-LDARBM
W2VNPM
NCNSVD-CNN
Figure 9 Comparison of MRR MAP and nDCG with different methods on CiteSeer
Table 2 MAP metric on various datasets
Introduction Related Main Introduction + related Introduction +main Related +mainCNN 03001 02909 03107 02572 02601 02637SVD-CNN 03701 03655 03693 03498 03511 03539
Computational Intelligence and Neuroscience 9
weight matrix in the FC layer is highly correlated andS(W) has the lowest value 0e recommendation per-formance then increases while adding sot which indi-cates that reducing the correlative degree of the weightmatrix in the FC layer is critical for improving perfor-mance When sot 10 our model achieves the bestperformance
In our model d0 is the dimension of citation contextand cited document representations Figure 12 shows howthe performance of SVD-CNN varies with d0 on the samesot When d0 is small the information content of thecitation context is very small and produces worse per-formance 0e recommendation performance increases toa maximum point until d0 reaches 300 It should be noted
05
045
04
035
03
S (W
)
025
02
015
01
005
00 1 2 3 4 5
Sot6 7 8 9 10 11
IntroductionRelatedMain
Introduction + relatedIntroduction + mainRelated + main
Figure 10 0e change in S(W) during training on unmixed datasets and mixed datasets
Table 3 0e comparison of related methods in Step 1
Figure 11 0e performance impact of sot on CiteSeer
10 Computational Intelligence and Neuroscience
that although the larger d0 is better the larger d0 willsignificantly increase the training time 0erefore wechoose d0 300
5 Conclusion and Future Works
We propose a convolutional neural network model withorthogonal regularization to solve the context-aware citationrecommendation task In our model orthogonal regulari-zation is achieved by using SVD to factorize the weight of theFC layer which can essentially make each vector in thefeature map more independent 0e orthogonal regulari-zation also enhances the feature extraction ability of CNN0e experimental results show that SVD-CNN outperformsthe other compared methods on CiteSeer Our model onlytakes the abstract as the content of the cited paper In thefuture we will explore the performance of our model byusing the full text of papers
Data Availability
Previously reported CiteSeer data were used to support thisstudy and are available at [httpspsuappboxcomvrefseer] 0ese prior datasets are cited at relevant placeswithin the text as references [4]
Conflicts of Interest
0e authors declare that there are no conflicts of interestregarding the publication of this paper
Acknowledgments
0is work was partially supported by the National NaturalScience Foundation of China (project no 61373046) andthe National Key Research and Development Programs ofChina (project nos 2018AAA0101100 and2019YFB2102500)
References
[1] M A Angrosh S Cranefield and N Stanger ldquoConditionalrandom field based sentence context identification enhancingcitation services for the research communityrdquo in Proceedingsof the First Australasian Web Conference Adelaide AustraliaJanuary 2013
[2] Q He J Pei D Kifer et al ldquoContext-aware citation rec-ommendationrdquo in Proceedings of the International Conferenceon World Wide Web Raleigh NC USA April 2010
[3] Q He D Kifer J Pei et al ldquoCitation recommendationwithout author supervisionrdquo in Proceedings of the FourthACM international Conference on Web Search and DataMining Hong Kong China February 2011
[4] W Huang ldquoA neural probabilistic model for context basedcitation recommendationrdquo in Proceedings of the AAAIConference on Artificial Intelligence Austin TX USA January2015
[5] J Tan X Wan and J Xiao ldquoA neural network approach toquote recommendation in writingsrdquo in Proceedings of theACM International on Conference on Information andKnowledge Management Indianapolis IN USA October2016
[6] X Ren J Liu X Yu et al ldquoCluscite effective citation rec-ommendation by information network-based clusteringrdquo inProceedings of the 20th ACM SIGKDD International Con-ference on Knowledge Discovery and Data Mining New YorkNY USA August 2014
[7] T Ebesu and Y Fang ldquoNeural citation network for context-aware citation recommendationrdquo in Proceedings of the 40thInternational ACM SIGIR Conference on Research and De-velopment in Information Retrieval pp 1093ndash1096 ShinjukuJapan August 2017
[8] D M Blei A Y Ng and M I Jordan ldquoLatentdirichlet allocationrdquo Journal of Machine Learning Researchvol 3 pp 993ndash1022 2003
[9] S Bradshaw ldquoReference directed indexing redeeming rele-vance for subject search in citation indexesrdquo Research andAdvanced Technology for Digital Libraries vol 2769pp 499ndash510 2003
[10] N Meuschke B Gipp and M Lipinsk ldquoCITREC an eval-uation framework for citation-based similarity measures
based on TREC genomics and PubMed centralrdquo 2015 httphdlhandlenet214273680
[11] A Ritchie S Robertson and S Teufel ldquoComparing CitationContexts for information Retrievalrdquo in Proceedings of the 17thACM Conference on Information and Knowledge Manage-ment pp 213ndash222 Napa Valley CA USA October 2008
[12] C F Van Loan e Block Jacobi Method for Computing theSingular Value Decomposition Department of ComputerScience Cornell University Ithaca NY USA 1985
[13] C Bhagavatula S Feldman R Power et al ldquoContent-basedcitation recommendationrdquo 2018 httpsarxivorgpdf18020830201v1pdf
[14] H Jia and E Saule ldquoLocal is good a fast citation recom-mendation approachrdquo Lecture Notes in Computer ScienceVol 10772 Springer Berlin Germany 2018
[15] Y Sun W Ni and R Men ldquoA personalized paper recom-mendation approach based on web paper mining and re-viewerrsquos interest modellingrdquo in Proceedings of theInternational Conference on Research Challenges in ComputerScience Shanghai China December 2009
[16] B Shaparenko and T Joachims ldquoInformation genealogyUncovering the flow of ideas in non-hyperlinked documentdatabasesrdquo in Proceedings of the ACM SIGKDD internationalConference on Knowledge Discovery and Data Mining SanJose CA USA August 2007
[17] T Strohman W B Croft and D Jensen ldquoRecommendingcitations for academic papersrdquo in Proceedings of the Annualinternational ACM SIGIR Conference on Research and De-velopment in information Retrieval Amsterdam NetherlandsJuly 2007
[18] A Livne V Gokuladas J Teevan et al ldquoCiteSight supportingcontextual citation recommendation using differentialsearchrdquo in Proceedings of the International ACM SIGIRConference on Research amp Development in informationRetrieval Gold Coast Australia July 2014
[19] Y Lu J He D Shan et al ldquoRecommending citations withtranslation modelrdquo in Proceedings of the ACM internationalConference on Information and Knowledge ManagementGlasgow UK October 2011
[20] W Huang P Mitra S Kataria et al ldquoRecommending cita-tions translating papers into referencesrdquo in Proceedings of theACM international Conference on Information and KnowledgeManagement Shanghai China November 2014
[21] X Tang X Wan X Zhang et al ldquoCross-language context-aware citation recommendation in scientific articlesrdquo inProceedings of the International ACM SIGIR Conference onResearch amp Development in information Retrieval Gold CoastUK July 2014
[22] A Brock T Lim J M Ritchie et al ldquoNeural photo edi-tingwith introspective adversarial networksrdquo in InternationalConference on Learning Representations 2017
[23] I J Goodfellow J Pouget-Abadie MMirza et al ldquoGenerativeadversarial netsrdquo in Proceedings of the International Con-ference on Neural Information Processing Systems MontrealCanada December 2014
[24] A Brock J Donahue K Simonyan et al ldquoLarge scale GANtraining for high fidelity natural image synthesisrdquo 2018httpsarxivorgabs180911096
[25] Y Sun L Zheng W Deng et al ldquoSVDNet for pedestrianretrievalrdquo in Proceedings of the 2017 IEEE InternationalConference on Computer Vision (ICCV) pp 3820ndash3828Venice Italy October 2017
[26] Q Zheng M Yang J Yang Q Zhang and X ZhangldquoImprovement of generalization ability of deep CNN via
implicit regularization in two-stage training processrdquo IEEEAccess vol 6 no 1109 pp 15844ndash15869 2018
[27] Y Wang D Gong Z Zheng et al ldquoOrthogonal deep featuresdecomposition for age-invariant face recognitionrdquo in Pro-ceedings of the European Conference on Computer Vision(ECCV) Munich Germany September 2018
[28] Y Chen X Jin J Feng et al ldquoTraining group orthogonalneural networks with privileged informationrdquo in Proceedingsof the Twenty-Sixth International Joint Conference on ArtificialIntelligence Melbourne Australia August 2017
[29] T Mikolov I Sutskever K Chen et al ldquoDistributedrepresentations of words and phrases and their composi-tionalityrdquo in Proceedings of the 26th International Con-ference on Neural Information Processing Systems LakeTahoe NV USA December 2013
[30] A Rajaraman and J D Ullman ldquoData miningrdquo Mining ofMassive Datasets vol 3 no 2 pp 1ndash17 2014
[31] J Duchi E Hazan and Y Singer ldquoAdaptive subgradientmethods for online learning and stochastic optimizationrdquoJournal of Machine Learning Research vol 12 no 7pp 2121ndash2159 2011
[32] T Miyato A M Dai and I Goodfellow ldquoAdversarial trainingmethods for semi-supervised text classificationrdquo in Pro-ceedings of the International Conference on LearningRepresentations San Juan Puerto Rico May 2016
[33] I J Goodfellow J Shlens and C Szegedy ldquoExplaining andharnessing adversarial examplesrdquo in Proceedings of the In-ternational Conference on Learning Representations BanffCanada April 2014
[34] K Chandrasekaran S Gauch P Lakkaraju et al ldquoConcept-based document recommendations for CiteSeer authorsrdquo inProceedings of the International Conference on AdaptiveHypermedia and Adaptive Web-Based Systems HannoverGermany August 2008
[35] E Voorhees ldquo0e trec-8 question answering track reportrdquo inProceedings of the TRECrsquo00 pp 77ndash82 Gaithersburg MDUSA 2000
[36] S Kataria P Mitra and S Bhatia ldquoUtilizing context ingenerative bayesian models for linked corpusrdquo in Proceedingsof the Twenty-Fourth AAAI Conference on ArtificialIntelligence Atlanta GA USA July 2010
[37] J Tang and J Zhang ldquoA discriminative approach to topic-based citation recommendationrdquo in Proceedings of the Pacific-Asia Conference Hyderabad India July 2009
12 Computational Intelligence and Neuroscience
radv minusising
g2 whereg nablaxlogp(y|x 1113954θ) (9)
0is perturbation can be easily computed by usingbackpropagation in neural networks
332 Layerwise Training In our training steps we defineconv-pooling block bt (tge 2) which consists of a convo-lution layer and a pooling layer Our network model is thenassembled by the initialization block b1 that initializes usingword2vec and (n minus 1) conv-pooling blocks
First we train the conv-pooling block b2 after b1 istrained On this basis the next conv-pooling block b3 iscreated by keeping the previous block fixed We repeat thisprocedure until all (n minus 1) conv-pooling blocks are trained
Second the following semiorthogonal training proce-dure is used to train the whole network
Semiorthogonal training (SOT) it is crucial to trainSVD-CNN which consists of the following three steps
Step 1 Decompose the weight matrix by SVD ieW USVT W is the weight matrix of the linear layerU is the left-unitary matrix S is the singular valuematrix V is the right-unitary matrix After that wereplace W with US Next we take all eigenvectors ofUS(US)T as weight vectorsStep 2 0e backbone model is fine-tuned by fixing theSVD-FC layerStep 3 0e model keeps fine-tuning with the unfixedSVD-FC layer
Step 1 can generate orthogonal weights but the per-formance of prediction cannot be guaranteed 0e reason isthat over orthogonality will excessively punish synonymoussentences which is apparently inappropriate 0erefore weintroduce Steps 2 and 3 to solve the above problem
0e inputs of SVD-FC are defined as Y
Y (y1 y2 ym)T 0e outputs are defined as O
O (o1 o2 om)T 0e weight matrix is defined asW (w1 w2 wm)T 0e expected outputs are defined asA (a1 a2 am)T 0e error function is defined as
E 12
1113944
l
k1ak minus ok( 1113857
2 (10)
where ok f(1113936mj0 wkjyj) k 1 2 l 0en E with re-
spect to ok is derived and the outcome is
zE
zok
minus ak minus ok( 1113857 (11)
We utilize the gradient descent strategy to find thegradient of the error with respect to weights 0e iterativeupdate of weights is as follows
Δwkj minusηzE
zwkj
(12)
We define an error signal δok zEz netk equation (12) is
equivalent to
Δwkj minusηzE
z netk
z netkzwkj
minusηδok
z netkzwkj
(13)
According to equation (11) δok zEz netk is equivalent
to
δok minus
zE
zok
zok
z netk minus
zE
zok
fprime netk( 1113857
zE
zok
okprime minus dk minus ok( 1113857ok
prime
(14)
We use the sigmoid f(x) 1(1 + ex) as the nonlinearfunction so equation (13) is equivalent to
Δwkj minusηδokyj η dk minus ok( 1113857ok 1 minus ok( 1113857yj (15)
In Step 1 the weight matrix W is decomposed by SVDand replaced with US U (q1 q2 qm)T andS diag(λ1 λ2 λm) Since dk minus ok is given we definethat Loss dk minus ok As a result equation (15) is equivalent to
Δwkj η Loss middot ok minus sigmoid yj 1113944 qiλi + B1113872 11138732
1113876 1113877yj (16)
qi middot qj 0 ine j are in the left-unitary matrix U so themodel operation is not affected by the nonorthogonal ei-genvectors qi 0is is the reason for excessively punishingsynonymous sentences in Step 1 However orthogonalityhas a positive effect on Δwkj in Step 2
0e purpose of SVD is to maintain the orthogonality ofeach weight vector in geometric space When weight vectorsare conditioned by orthogonal regularization the relevancybetween weight vectors decreases We use the followingmethods in Step 3 to measure relevance
H WTW
w1rarrT
w1rarr
middot middot middot w1rarrT
wkrarr
⋮ ⋱ ⋮
wkrarrT
w1rarr
middot middot middot wkrarrT
wkrarr
⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣
⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦
h11 middot middot middot h1k
⋮ ⋱ ⋮
hk1 middot middot middot hkk
⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣
⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦
(17)
where W is a weight matrix that contains k weight vectorswi (i 1 k)hij (i j 1 k) is the dot product of wi
and wj Let us define S(W) as the correlation measurementof all column vectors in W
S(W) 1113936
ki1 hii
1113936ki1 1113936
kj1 hij
11138681113868111386811138681113868
11138681113868111386811138681113868 (18)
When W is an orthogonal matrix the value of S(W) is 1When ine j S(W) obtains the minimum value (1k)0erefore we can see that the value of S(W) falls into
6 Computational Intelligence and Neuroscience
[(1k) 1] As a result when S(W) is close to 1k or 0 theweight matrix will have high relevance
34 Complexity Analysis Assume that the training samplesize is |C| the average number of words in each citationcontext is |c| Cl is the number of kernels in the l-th layer andwis the size of the sliding window For one convolution layerthe training complexity is O(Clminus1 middot Cl middot w middot (s minus w + 1)) 0etraining complexity of one w-ap layer is O(C2
l middot w middot s) 0etraining complexity of one all-ap layer isO(C2
l middot (s minus w + 1))which was improved by C F Van Loan [12] computing theeigenvalue for SVD matrix decomposition with K size takesO(K) on the way of JACOBI Assume that the size of theweight matrix in the SVD-FC layer isK and the channel ofthe input matrix is Cin 0e computational cost for the SVD-FC layer is O(2K2 middot Cin + K)
4 Experiment
41 Dataset We use the CiteSeer dataset [34] to evaluatethe performance of our model 0e dataset was publishedby Huang et al [4] In this dataset citation relationshipsare extracted by a pair of citation contexts and the ab-stracts of cited papers A citation context includes thesentence where the citation placeholder appears and thesentences before and after the citation placeholderWithin each paper in the corpus the 50 words before and50 words after each citation reference are treated as thecorresponding citation context (a discussion on thenumber of words can be found in [7]) Before wordembedding we also remove stop words from the contextsTo preserve the time-sensitive pastpresentfuture tensesof verbs and the singularplural styles of named entitiesno stemming is done but all words are transferred tolower-case 0e training set contains 3989547 pairs ofreference contexts and citations and the test set contains1021685 citation relations
Following common practice in information retrieval(IR) we employ the following four evaluation metrics toevaluate recommendation results recall mean reciprocalrank (MRR) mean average precision (MAP) and normal-ized discounted cumulative gain (nDCG)
42 EvaluationMetric For each query in the test set we usethe original set of references as the ground truth Rg Assumethat the set of recommended citations is Rr and the correctrecommendations are Rg capRr Recall is defined as
recall Rg capRr
11138681113868111386811138681113868
11138681113868111386811138681113868
Rg
(19)
In our experiments the number of recommended ci-tations ranges from 1 to 10 Recall evaluation does not revealthe order of recommended references To address thisproblem we select the following two additional metrics
For a query q let rankq be the rank of the first correctrecommendation within the list MRR [35] is defined as
MRR 1
|Q|1113944qisinQ
1rankq
(20)
where Q is the testing set MRR reveals the average rankingof the first correct recommendation
For each citation placeholder we search the papers thatmay be referenced at this citation placeholder Each retrievalmodel returns a ranked list of papers Since there may be oneor more references for one citation context we use meanaverage precision (MAP) as the evaluation metric
where R(di) is a binary function indicating whether doc-ument di is relevant or not For our problem the papers citedat the citation placeholder are considered relevantdocuments
We use normalized discounted cumulative gain (NDCG)to measure the ranked recommendation list 0e NDCGvalue of a ranking list at position i is calculated as
NDCG d1 dN( 1113857 1113944i
2rel di( ) minus 1lni+1 (22)
where rel (di) is the 4-scale relevance of document di in theranked list We use the average cocited probability [2] oflangdi dlowastrang to weigh the citation relevance score of di to dlowast(anoriginal citation of the query) We report the average NDCGscore over all testing documents
43 BaselineComparison We choose the following methodsfor comparison
Cite-PLSA-LDA (CP-LDA) [36] we use the originalimplementation provided by the author 0e number oftopics is set to 60
(i) Restricted Boltzmann Machine (RBM-CS) [37] Wetrain two layers of RBM-CS according to the sug-gestion of the author We set the hidden layer size to600
(ii) Word2vec Model (W2V) [29] We use the word2vecmodel to learn words and document representa-tions 0e cited document is treated as a ldquowordrdquo (adocument uses a unique marker when it is cited bydifferent papers) 0e dimensions of the word anddocument vectors are set to n 300
(iii) Neural Probabilistic Model (NPM) [4] We followthe original implementation 0e dimensions of theword and document representation vector are set ton 600 For negative sampling we set the numberof negative samples k 10 where k is the number ofnoise words in the citation context For noisecontrast estimation we set the number of noisesamples k 1000
(iv) Neural Citation Network (NCN) [7] In NCN thegradient clipping is 5 the dropout probability is 02and the recurrent layers are 2 0e region sizes for
Computational Intelligence and Neuroscience 7
the encoder are set to 4 4 and 5 and the region sizesfor the author network are set to 1 and 2
Figures 8 and 9 show the performance of eachmethod onthe CiteSeer dataset It is obvious that the SVD-FC modelleads the performance in most cases More detailed analysesare given as follows
First we perform a comparison among CP-LDA RBMW2V and SVD-CNN Our SVD-CNN completely andsignificantly exceeds other models in all metrics 0e successof ourmodel is ascribed to the content and correlation of ournetwork Due to the lack of citation context information wefind that W2V is obviously worse than other methods interms of all metrics CP-LDA works much better than W2Vwhich indicates that link information is very important forfinding relevant papers RBM-CS shows a clear performancegain over W2V because RBM-CS automatically discoverstopical aspects of each paper based on citation contextHowever the vector representations of citation context inRBM-CS are extracted by traditional word vector repre-sentations which fully neglect semantic relations betweenthe citation document and citation context and thus may belimited by vocabulary
Second we compare the performance among NPMNCN and SVD-CNN It is not surprising that NPM andNCN achieve worse performance than SVD-CNN since theirdistributed representation of words and documents reliessolely on deep learning without restraint NPM recommendscitations based on trained distributed representations NCNfurther enhances the performance by considering authorinformation and using a more sophisticated neural networkarchitecture However the CNN in NCN does not haveorthogonal constraints which makes it difficult to capturedifferent types of citing activities In addition NCN onlyutilizes the title of the cited paper for a decoder which isapparently not sufficient for learning good embedding
44 e Influence on the Link Prediction of Reference PatternInteractionalFeatures According to the chapter positions ofcitation context in the article we divide the training set intothree parts the introduction part contains 1307885 pairs ofreference contexts and citations the related word partcontains 1599897 pairs of citations and the main partcontains 1024783 pairs Furthermore these datasets formthree mixed datasets In this part of the experiment we usethe CNN model without SVD as the baseline 0ese datasetsare tested in a ratio of 3 1 In Tables 1 and 2 we show theresults on the abovementioned datasets
From the results we obtain the following observationsFirst both CNN and SVD-CNN outperform unmixed
datasets over mixed datasets across the different evaluationmetrics which shows that the diversity of reference patternsincreases the difficulty of citation recommendation tasks
Second in Tables 1 and 2 we observe that our model isparticularly good at resolving the difficulties in mixeddatasets which come from the diversity of referencepatterns
To better explore why mixed datasets are more complexthan unmixed datasets in Figure 10 we show the change in
S(W) during the training process of SVD-CNN amongvarious datasets
As shown in Figure 10 the increase in S(W) on themixed datasets indicates that SVD-CNN is good at decor-relation We can also see in Tables 1 and 2 that the CNNmodel has pretty performance on unmixed datasets whileachieving poor performance on mixed datasets HoweverSVD-CNN achieves almost the same performance on thetwo types of datasets 0is proves that the correlation fromvarious reference patterns can significantly affect the linkprediction
0e reason why the change in S(W) is not large on theunmixed datasets is that reference patterns of unmixeddatasets have similar features which belong to the samecategory As a result the orthogonality of the weight matrixis hard to improve on unmixed datasets However a citationrecommendation algorithm has pretty performance on theunmixed datasets because there are low complexities
Although mixed datasets are more complicated thanunmixed datasets SVD-CNN still performs well in mixeddatasets 0is indicates that SVD-CNN reduces the negativeimpact of the correlation of reference patterns and ourapproach is more suitable for complex scenarios
45 Comparison with Other Types of Decorrelation In ad-dition to SVD there are still some other methods fordecorrelating the feature matrix However these methodscannot maintain the discriminating ability of the CNNmodel To illustrate this we compare SVD with severalvarieties as follows
(1) Using the originally learned W
(2) Replacing W with US
(3) Replacing W with U
(4) Replacing W with UVT
(5) Replacing Wwith Q D where D is the diagonalmatrix extracted from the upper triangle matrix inQ-R decomposition
(6) Replacing W with WPCA where WPCA is the diagonalmatrix extracted from the weight matrix W after theprocessing of dimension reduction by PCA
After convergence of training different orthogonalmatrices are used to replace the weight matrix W We defineT-cost as the time cost of replacing the weight which isequivalent to the proportion of the added time to the originaltime As shown in Table 3 other types of decorrelationdegrade the performance in addition to W⟶ US andW⟶WPCA However the time cost of W⟶WPCA ismore than that of W⟶ US
46 Ablation Study In our method there are two essentialparameters a term sot which means the number of SOTiterations and a biased parameter d0 In this section weconduct an ablation study of these parameters
We first evaluate the effectiveness of sot by empiri-cally fixing d0 300 Since sot defines the loop time of
8 Computational Intelligence and Neuroscience
orthogonal constraint training it should be set as anonnegative value Figure 11 illustrates the MRR with sotfrom 0 to 10 on the CiteSeer dataset We can see that the
performance improves as the value of sot increasesWhen sot 0 the model has no decorrelation andachieves the worst performance In this situation the
Table 1 MRR metric on various datasets
Introduction Related Main Introduction + related Introduction +main Related +mainCNN 03312 03294 03478 02773 02815 02978SVD-CNN 03995 04078 03989 03878 03889 03845
060055050045040035030025Re
call
020015010005000
20 40 60Number of recommended citations
80 100
W2VsNPMs
RBMsCP_LDAs
SVD_CNNsNCNs
Figure 8 Comparison of recall with different methods on CiteSeer
MRR MAP and nDCG scores for top 10 recommendations04
035
03
025
02
015
01
005
0
0091600997
MRR MAP nCDG
00662
01843
02667
03687
00912009982
00663
01835
02418
03352
01288 0135601476
0256602592
03448
CP-LDARBM
W2VNPM
NCNSVD-CNN
Figure 9 Comparison of MRR MAP and nDCG with different methods on CiteSeer
Table 2 MAP metric on various datasets
Introduction Related Main Introduction + related Introduction +main Related +mainCNN 03001 02909 03107 02572 02601 02637SVD-CNN 03701 03655 03693 03498 03511 03539
Computational Intelligence and Neuroscience 9
weight matrix in the FC layer is highly correlated andS(W) has the lowest value 0e recommendation per-formance then increases while adding sot which indi-cates that reducing the correlative degree of the weightmatrix in the FC layer is critical for improving perfor-mance When sot 10 our model achieves the bestperformance
In our model d0 is the dimension of citation contextand cited document representations Figure 12 shows howthe performance of SVD-CNN varies with d0 on the samesot When d0 is small the information content of thecitation context is very small and produces worse per-formance 0e recommendation performance increases toa maximum point until d0 reaches 300 It should be noted
05
045
04
035
03
S (W
)
025
02
015
01
005
00 1 2 3 4 5
Sot6 7 8 9 10 11
IntroductionRelatedMain
Introduction + relatedIntroduction + mainRelated + main
Figure 10 0e change in S(W) during training on unmixed datasets and mixed datasets
Table 3 0e comparison of related methods in Step 1
Figure 11 0e performance impact of sot on CiteSeer
10 Computational Intelligence and Neuroscience
that although the larger d0 is better the larger d0 willsignificantly increase the training time 0erefore wechoose d0 300
5 Conclusion and Future Works
We propose a convolutional neural network model withorthogonal regularization to solve the context-aware citationrecommendation task In our model orthogonal regulari-zation is achieved by using SVD to factorize the weight of theFC layer which can essentially make each vector in thefeature map more independent 0e orthogonal regulari-zation also enhances the feature extraction ability of CNN0e experimental results show that SVD-CNN outperformsthe other compared methods on CiteSeer Our model onlytakes the abstract as the content of the cited paper In thefuture we will explore the performance of our model byusing the full text of papers
Data Availability
Previously reported CiteSeer data were used to support thisstudy and are available at [httpspsuappboxcomvrefseer] 0ese prior datasets are cited at relevant placeswithin the text as references [4]
Conflicts of Interest
0e authors declare that there are no conflicts of interestregarding the publication of this paper
Acknowledgments
0is work was partially supported by the National NaturalScience Foundation of China (project no 61373046) andthe National Key Research and Development Programs ofChina (project nos 2018AAA0101100 and2019YFB2102500)
References
[1] M A Angrosh S Cranefield and N Stanger ldquoConditionalrandom field based sentence context identification enhancingcitation services for the research communityrdquo in Proceedingsof the First Australasian Web Conference Adelaide AustraliaJanuary 2013
[2] Q He J Pei D Kifer et al ldquoContext-aware citation rec-ommendationrdquo in Proceedings of the International Conferenceon World Wide Web Raleigh NC USA April 2010
[3] Q He D Kifer J Pei et al ldquoCitation recommendationwithout author supervisionrdquo in Proceedings of the FourthACM international Conference on Web Search and DataMining Hong Kong China February 2011
[4] W Huang ldquoA neural probabilistic model for context basedcitation recommendationrdquo in Proceedings of the AAAIConference on Artificial Intelligence Austin TX USA January2015
[5] J Tan X Wan and J Xiao ldquoA neural network approach toquote recommendation in writingsrdquo in Proceedings of theACM International on Conference on Information andKnowledge Management Indianapolis IN USA October2016
[6] X Ren J Liu X Yu et al ldquoCluscite effective citation rec-ommendation by information network-based clusteringrdquo inProceedings of the 20th ACM SIGKDD International Con-ference on Knowledge Discovery and Data Mining New YorkNY USA August 2014
[7] T Ebesu and Y Fang ldquoNeural citation network for context-aware citation recommendationrdquo in Proceedings of the 40thInternational ACM SIGIR Conference on Research and De-velopment in Information Retrieval pp 1093ndash1096 ShinjukuJapan August 2017
[8] D M Blei A Y Ng and M I Jordan ldquoLatentdirichlet allocationrdquo Journal of Machine Learning Researchvol 3 pp 993ndash1022 2003
[9] S Bradshaw ldquoReference directed indexing redeeming rele-vance for subject search in citation indexesrdquo Research andAdvanced Technology for Digital Libraries vol 2769pp 499ndash510 2003
[10] N Meuschke B Gipp and M Lipinsk ldquoCITREC an eval-uation framework for citation-based similarity measures
based on TREC genomics and PubMed centralrdquo 2015 httphdlhandlenet214273680
[11] A Ritchie S Robertson and S Teufel ldquoComparing CitationContexts for information Retrievalrdquo in Proceedings of the 17thACM Conference on Information and Knowledge Manage-ment pp 213ndash222 Napa Valley CA USA October 2008
[12] C F Van Loan e Block Jacobi Method for Computing theSingular Value Decomposition Department of ComputerScience Cornell University Ithaca NY USA 1985
[13] C Bhagavatula S Feldman R Power et al ldquoContent-basedcitation recommendationrdquo 2018 httpsarxivorgpdf18020830201v1pdf
[14] H Jia and E Saule ldquoLocal is good a fast citation recom-mendation approachrdquo Lecture Notes in Computer ScienceVol 10772 Springer Berlin Germany 2018
[15] Y Sun W Ni and R Men ldquoA personalized paper recom-mendation approach based on web paper mining and re-viewerrsquos interest modellingrdquo in Proceedings of theInternational Conference on Research Challenges in ComputerScience Shanghai China December 2009
[16] B Shaparenko and T Joachims ldquoInformation genealogyUncovering the flow of ideas in non-hyperlinked documentdatabasesrdquo in Proceedings of the ACM SIGKDD internationalConference on Knowledge Discovery and Data Mining SanJose CA USA August 2007
[17] T Strohman W B Croft and D Jensen ldquoRecommendingcitations for academic papersrdquo in Proceedings of the Annualinternational ACM SIGIR Conference on Research and De-velopment in information Retrieval Amsterdam NetherlandsJuly 2007
[18] A Livne V Gokuladas J Teevan et al ldquoCiteSight supportingcontextual citation recommendation using differentialsearchrdquo in Proceedings of the International ACM SIGIRConference on Research amp Development in informationRetrieval Gold Coast Australia July 2014
[19] Y Lu J He D Shan et al ldquoRecommending citations withtranslation modelrdquo in Proceedings of the ACM internationalConference on Information and Knowledge ManagementGlasgow UK October 2011
[20] W Huang P Mitra S Kataria et al ldquoRecommending cita-tions translating papers into referencesrdquo in Proceedings of theACM international Conference on Information and KnowledgeManagement Shanghai China November 2014
[21] X Tang X Wan X Zhang et al ldquoCross-language context-aware citation recommendation in scientific articlesrdquo inProceedings of the International ACM SIGIR Conference onResearch amp Development in information Retrieval Gold CoastUK July 2014
[22] A Brock T Lim J M Ritchie et al ldquoNeural photo edi-tingwith introspective adversarial networksrdquo in InternationalConference on Learning Representations 2017
[23] I J Goodfellow J Pouget-Abadie MMirza et al ldquoGenerativeadversarial netsrdquo in Proceedings of the International Con-ference on Neural Information Processing Systems MontrealCanada December 2014
[24] A Brock J Donahue K Simonyan et al ldquoLarge scale GANtraining for high fidelity natural image synthesisrdquo 2018httpsarxivorgabs180911096
[25] Y Sun L Zheng W Deng et al ldquoSVDNet for pedestrianretrievalrdquo in Proceedings of the 2017 IEEE InternationalConference on Computer Vision (ICCV) pp 3820ndash3828Venice Italy October 2017
[26] Q Zheng M Yang J Yang Q Zhang and X ZhangldquoImprovement of generalization ability of deep CNN via
implicit regularization in two-stage training processrdquo IEEEAccess vol 6 no 1109 pp 15844ndash15869 2018
[27] Y Wang D Gong Z Zheng et al ldquoOrthogonal deep featuresdecomposition for age-invariant face recognitionrdquo in Pro-ceedings of the European Conference on Computer Vision(ECCV) Munich Germany September 2018
[28] Y Chen X Jin J Feng et al ldquoTraining group orthogonalneural networks with privileged informationrdquo in Proceedingsof the Twenty-Sixth International Joint Conference on ArtificialIntelligence Melbourne Australia August 2017
[29] T Mikolov I Sutskever K Chen et al ldquoDistributedrepresentations of words and phrases and their composi-tionalityrdquo in Proceedings of the 26th International Con-ference on Neural Information Processing Systems LakeTahoe NV USA December 2013
[30] A Rajaraman and J D Ullman ldquoData miningrdquo Mining ofMassive Datasets vol 3 no 2 pp 1ndash17 2014
[31] J Duchi E Hazan and Y Singer ldquoAdaptive subgradientmethods for online learning and stochastic optimizationrdquoJournal of Machine Learning Research vol 12 no 7pp 2121ndash2159 2011
[32] T Miyato A M Dai and I Goodfellow ldquoAdversarial trainingmethods for semi-supervised text classificationrdquo in Pro-ceedings of the International Conference on LearningRepresentations San Juan Puerto Rico May 2016
[33] I J Goodfellow J Shlens and C Szegedy ldquoExplaining andharnessing adversarial examplesrdquo in Proceedings of the In-ternational Conference on Learning Representations BanffCanada April 2014
[34] K Chandrasekaran S Gauch P Lakkaraju et al ldquoConcept-based document recommendations for CiteSeer authorsrdquo inProceedings of the International Conference on AdaptiveHypermedia and Adaptive Web-Based Systems HannoverGermany August 2008
[35] E Voorhees ldquo0e trec-8 question answering track reportrdquo inProceedings of the TRECrsquo00 pp 77ndash82 Gaithersburg MDUSA 2000
[36] S Kataria P Mitra and S Bhatia ldquoUtilizing context ingenerative bayesian models for linked corpusrdquo in Proceedingsof the Twenty-Fourth AAAI Conference on ArtificialIntelligence Atlanta GA USA July 2010
[37] J Tang and J Zhang ldquoA discriminative approach to topic-based citation recommendationrdquo in Proceedings of the Pacific-Asia Conference Hyderabad India July 2009
12 Computational Intelligence and Neuroscience
[(1k) 1] As a result when S(W) is close to 1k or 0 theweight matrix will have high relevance
34 Complexity Analysis Assume that the training samplesize is |C| the average number of words in each citationcontext is |c| Cl is the number of kernels in the l-th layer andwis the size of the sliding window For one convolution layerthe training complexity is O(Clminus1 middot Cl middot w middot (s minus w + 1)) 0etraining complexity of one w-ap layer is O(C2
l middot w middot s) 0etraining complexity of one all-ap layer isO(C2
l middot (s minus w + 1))which was improved by C F Van Loan [12] computing theeigenvalue for SVD matrix decomposition with K size takesO(K) on the way of JACOBI Assume that the size of theweight matrix in the SVD-FC layer isK and the channel ofthe input matrix is Cin 0e computational cost for the SVD-FC layer is O(2K2 middot Cin + K)
4 Experiment
41 Dataset We use the CiteSeer dataset [34] to evaluatethe performance of our model 0e dataset was publishedby Huang et al [4] In this dataset citation relationshipsare extracted by a pair of citation contexts and the ab-stracts of cited papers A citation context includes thesentence where the citation placeholder appears and thesentences before and after the citation placeholderWithin each paper in the corpus the 50 words before and50 words after each citation reference are treated as thecorresponding citation context (a discussion on thenumber of words can be found in [7]) Before wordembedding we also remove stop words from the contextsTo preserve the time-sensitive pastpresentfuture tensesof verbs and the singularplural styles of named entitiesno stemming is done but all words are transferred tolower-case 0e training set contains 3989547 pairs ofreference contexts and citations and the test set contains1021685 citation relations
Following common practice in information retrieval(IR) we employ the following four evaluation metrics toevaluate recommendation results recall mean reciprocalrank (MRR) mean average precision (MAP) and normal-ized discounted cumulative gain (nDCG)
42 EvaluationMetric For each query in the test set we usethe original set of references as the ground truth Rg Assumethat the set of recommended citations is Rr and the correctrecommendations are Rg capRr Recall is defined as
recall Rg capRr
11138681113868111386811138681113868
11138681113868111386811138681113868
Rg
(19)
In our experiments the number of recommended ci-tations ranges from 1 to 10 Recall evaluation does not revealthe order of recommended references To address thisproblem we select the following two additional metrics
For a query q let rankq be the rank of the first correctrecommendation within the list MRR [35] is defined as
MRR 1
|Q|1113944qisinQ
1rankq
(20)
where Q is the testing set MRR reveals the average rankingof the first correct recommendation
For each citation placeholder we search the papers thatmay be referenced at this citation placeholder Each retrievalmodel returns a ranked list of papers Since there may be oneor more references for one citation context we use meanaverage precision (MAP) as the evaluation metric
where R(di) is a binary function indicating whether doc-ument di is relevant or not For our problem the papers citedat the citation placeholder are considered relevantdocuments
We use normalized discounted cumulative gain (NDCG)to measure the ranked recommendation list 0e NDCGvalue of a ranking list at position i is calculated as
NDCG d1 dN( 1113857 1113944i
2rel di( ) minus 1lni+1 (22)
where rel (di) is the 4-scale relevance of document di in theranked list We use the average cocited probability [2] oflangdi dlowastrang to weigh the citation relevance score of di to dlowast(anoriginal citation of the query) We report the average NDCGscore over all testing documents
43 BaselineComparison We choose the following methodsfor comparison
Cite-PLSA-LDA (CP-LDA) [36] we use the originalimplementation provided by the author 0e number oftopics is set to 60
(i) Restricted Boltzmann Machine (RBM-CS) [37] Wetrain two layers of RBM-CS according to the sug-gestion of the author We set the hidden layer size to600
(ii) Word2vec Model (W2V) [29] We use the word2vecmodel to learn words and document representa-tions 0e cited document is treated as a ldquowordrdquo (adocument uses a unique marker when it is cited bydifferent papers) 0e dimensions of the word anddocument vectors are set to n 300
(iii) Neural Probabilistic Model (NPM) [4] We followthe original implementation 0e dimensions of theword and document representation vector are set ton 600 For negative sampling we set the numberof negative samples k 10 where k is the number ofnoise words in the citation context For noisecontrast estimation we set the number of noisesamples k 1000
(iv) Neural Citation Network (NCN) [7] In NCN thegradient clipping is 5 the dropout probability is 02and the recurrent layers are 2 0e region sizes for
Computational Intelligence and Neuroscience 7
the encoder are set to 4 4 and 5 and the region sizesfor the author network are set to 1 and 2
Figures 8 and 9 show the performance of eachmethod onthe CiteSeer dataset It is obvious that the SVD-FC modelleads the performance in most cases More detailed analysesare given as follows
First we perform a comparison among CP-LDA RBMW2V and SVD-CNN Our SVD-CNN completely andsignificantly exceeds other models in all metrics 0e successof ourmodel is ascribed to the content and correlation of ournetwork Due to the lack of citation context information wefind that W2V is obviously worse than other methods interms of all metrics CP-LDA works much better than W2Vwhich indicates that link information is very important forfinding relevant papers RBM-CS shows a clear performancegain over W2V because RBM-CS automatically discoverstopical aspects of each paper based on citation contextHowever the vector representations of citation context inRBM-CS are extracted by traditional word vector repre-sentations which fully neglect semantic relations betweenthe citation document and citation context and thus may belimited by vocabulary
Second we compare the performance among NPMNCN and SVD-CNN It is not surprising that NPM andNCN achieve worse performance than SVD-CNN since theirdistributed representation of words and documents reliessolely on deep learning without restraint NPM recommendscitations based on trained distributed representations NCNfurther enhances the performance by considering authorinformation and using a more sophisticated neural networkarchitecture However the CNN in NCN does not haveorthogonal constraints which makes it difficult to capturedifferent types of citing activities In addition NCN onlyutilizes the title of the cited paper for a decoder which isapparently not sufficient for learning good embedding
44 e Influence on the Link Prediction of Reference PatternInteractionalFeatures According to the chapter positions ofcitation context in the article we divide the training set intothree parts the introduction part contains 1307885 pairs ofreference contexts and citations the related word partcontains 1599897 pairs of citations and the main partcontains 1024783 pairs Furthermore these datasets formthree mixed datasets In this part of the experiment we usethe CNN model without SVD as the baseline 0ese datasetsare tested in a ratio of 3 1 In Tables 1 and 2 we show theresults on the abovementioned datasets
From the results we obtain the following observationsFirst both CNN and SVD-CNN outperform unmixed
datasets over mixed datasets across the different evaluationmetrics which shows that the diversity of reference patternsincreases the difficulty of citation recommendation tasks
Second in Tables 1 and 2 we observe that our model isparticularly good at resolving the difficulties in mixeddatasets which come from the diversity of referencepatterns
To better explore why mixed datasets are more complexthan unmixed datasets in Figure 10 we show the change in
S(W) during the training process of SVD-CNN amongvarious datasets
As shown in Figure 10 the increase in S(W) on themixed datasets indicates that SVD-CNN is good at decor-relation We can also see in Tables 1 and 2 that the CNNmodel has pretty performance on unmixed datasets whileachieving poor performance on mixed datasets HoweverSVD-CNN achieves almost the same performance on thetwo types of datasets 0is proves that the correlation fromvarious reference patterns can significantly affect the linkprediction
0e reason why the change in S(W) is not large on theunmixed datasets is that reference patterns of unmixeddatasets have similar features which belong to the samecategory As a result the orthogonality of the weight matrixis hard to improve on unmixed datasets However a citationrecommendation algorithm has pretty performance on theunmixed datasets because there are low complexities
Although mixed datasets are more complicated thanunmixed datasets SVD-CNN still performs well in mixeddatasets 0is indicates that SVD-CNN reduces the negativeimpact of the correlation of reference patterns and ourapproach is more suitable for complex scenarios
45 Comparison with Other Types of Decorrelation In ad-dition to SVD there are still some other methods fordecorrelating the feature matrix However these methodscannot maintain the discriminating ability of the CNNmodel To illustrate this we compare SVD with severalvarieties as follows
(1) Using the originally learned W
(2) Replacing W with US
(3) Replacing W with U
(4) Replacing W with UVT
(5) Replacing Wwith Q D where D is the diagonalmatrix extracted from the upper triangle matrix inQ-R decomposition
(6) Replacing W with WPCA where WPCA is the diagonalmatrix extracted from the weight matrix W after theprocessing of dimension reduction by PCA
After convergence of training different orthogonalmatrices are used to replace the weight matrix W We defineT-cost as the time cost of replacing the weight which isequivalent to the proportion of the added time to the originaltime As shown in Table 3 other types of decorrelationdegrade the performance in addition to W⟶ US andW⟶WPCA However the time cost of W⟶WPCA ismore than that of W⟶ US
46 Ablation Study In our method there are two essentialparameters a term sot which means the number of SOTiterations and a biased parameter d0 In this section weconduct an ablation study of these parameters
We first evaluate the effectiveness of sot by empiri-cally fixing d0 300 Since sot defines the loop time of
8 Computational Intelligence and Neuroscience
orthogonal constraint training it should be set as anonnegative value Figure 11 illustrates the MRR with sotfrom 0 to 10 on the CiteSeer dataset We can see that the
performance improves as the value of sot increasesWhen sot 0 the model has no decorrelation andachieves the worst performance In this situation the
Table 1 MRR metric on various datasets
Introduction Related Main Introduction + related Introduction +main Related +mainCNN 03312 03294 03478 02773 02815 02978SVD-CNN 03995 04078 03989 03878 03889 03845
060055050045040035030025Re
call
020015010005000
20 40 60Number of recommended citations
80 100
W2VsNPMs
RBMsCP_LDAs
SVD_CNNsNCNs
Figure 8 Comparison of recall with different methods on CiteSeer
MRR MAP and nDCG scores for top 10 recommendations04
035
03
025
02
015
01
005
0
0091600997
MRR MAP nCDG
00662
01843
02667
03687
00912009982
00663
01835
02418
03352
01288 0135601476
0256602592
03448
CP-LDARBM
W2VNPM
NCNSVD-CNN
Figure 9 Comparison of MRR MAP and nDCG with different methods on CiteSeer
Table 2 MAP metric on various datasets
Introduction Related Main Introduction + related Introduction +main Related +mainCNN 03001 02909 03107 02572 02601 02637SVD-CNN 03701 03655 03693 03498 03511 03539
Computational Intelligence and Neuroscience 9
weight matrix in the FC layer is highly correlated andS(W) has the lowest value 0e recommendation per-formance then increases while adding sot which indi-cates that reducing the correlative degree of the weightmatrix in the FC layer is critical for improving perfor-mance When sot 10 our model achieves the bestperformance
In our model d0 is the dimension of citation contextand cited document representations Figure 12 shows howthe performance of SVD-CNN varies with d0 on the samesot When d0 is small the information content of thecitation context is very small and produces worse per-formance 0e recommendation performance increases toa maximum point until d0 reaches 300 It should be noted
05
045
04
035
03
S (W
)
025
02
015
01
005
00 1 2 3 4 5
Sot6 7 8 9 10 11
IntroductionRelatedMain
Introduction + relatedIntroduction + mainRelated + main
Figure 10 0e change in S(W) during training on unmixed datasets and mixed datasets
Table 3 0e comparison of related methods in Step 1
Figure 11 0e performance impact of sot on CiteSeer
10 Computational Intelligence and Neuroscience
that although the larger d0 is better the larger d0 willsignificantly increase the training time 0erefore wechoose d0 300
5 Conclusion and Future Works
We propose a convolutional neural network model withorthogonal regularization to solve the context-aware citationrecommendation task In our model orthogonal regulari-zation is achieved by using SVD to factorize the weight of theFC layer which can essentially make each vector in thefeature map more independent 0e orthogonal regulari-zation also enhances the feature extraction ability of CNN0e experimental results show that SVD-CNN outperformsthe other compared methods on CiteSeer Our model onlytakes the abstract as the content of the cited paper In thefuture we will explore the performance of our model byusing the full text of papers
Data Availability
Previously reported CiteSeer data were used to support thisstudy and are available at [httpspsuappboxcomvrefseer] 0ese prior datasets are cited at relevant placeswithin the text as references [4]
Conflicts of Interest
0e authors declare that there are no conflicts of interestregarding the publication of this paper
Acknowledgments
0is work was partially supported by the National NaturalScience Foundation of China (project no 61373046) andthe National Key Research and Development Programs ofChina (project nos 2018AAA0101100 and2019YFB2102500)
References
[1] M A Angrosh S Cranefield and N Stanger ldquoConditionalrandom field based sentence context identification enhancingcitation services for the research communityrdquo in Proceedingsof the First Australasian Web Conference Adelaide AustraliaJanuary 2013
[2] Q He J Pei D Kifer et al ldquoContext-aware citation rec-ommendationrdquo in Proceedings of the International Conferenceon World Wide Web Raleigh NC USA April 2010
[3] Q He D Kifer J Pei et al ldquoCitation recommendationwithout author supervisionrdquo in Proceedings of the FourthACM international Conference on Web Search and DataMining Hong Kong China February 2011
[4] W Huang ldquoA neural probabilistic model for context basedcitation recommendationrdquo in Proceedings of the AAAIConference on Artificial Intelligence Austin TX USA January2015
[5] J Tan X Wan and J Xiao ldquoA neural network approach toquote recommendation in writingsrdquo in Proceedings of theACM International on Conference on Information andKnowledge Management Indianapolis IN USA October2016
[6] X Ren J Liu X Yu et al ldquoCluscite effective citation rec-ommendation by information network-based clusteringrdquo inProceedings of the 20th ACM SIGKDD International Con-ference on Knowledge Discovery and Data Mining New YorkNY USA August 2014
[7] T Ebesu and Y Fang ldquoNeural citation network for context-aware citation recommendationrdquo in Proceedings of the 40thInternational ACM SIGIR Conference on Research and De-velopment in Information Retrieval pp 1093ndash1096 ShinjukuJapan August 2017
[8] D M Blei A Y Ng and M I Jordan ldquoLatentdirichlet allocationrdquo Journal of Machine Learning Researchvol 3 pp 993ndash1022 2003
[9] S Bradshaw ldquoReference directed indexing redeeming rele-vance for subject search in citation indexesrdquo Research andAdvanced Technology for Digital Libraries vol 2769pp 499ndash510 2003
[10] N Meuschke B Gipp and M Lipinsk ldquoCITREC an eval-uation framework for citation-based similarity measures
based on TREC genomics and PubMed centralrdquo 2015 httphdlhandlenet214273680
[11] A Ritchie S Robertson and S Teufel ldquoComparing CitationContexts for information Retrievalrdquo in Proceedings of the 17thACM Conference on Information and Knowledge Manage-ment pp 213ndash222 Napa Valley CA USA October 2008
[12] C F Van Loan e Block Jacobi Method for Computing theSingular Value Decomposition Department of ComputerScience Cornell University Ithaca NY USA 1985
[13] C Bhagavatula S Feldman R Power et al ldquoContent-basedcitation recommendationrdquo 2018 httpsarxivorgpdf18020830201v1pdf
[14] H Jia and E Saule ldquoLocal is good a fast citation recom-mendation approachrdquo Lecture Notes in Computer ScienceVol 10772 Springer Berlin Germany 2018
[15] Y Sun W Ni and R Men ldquoA personalized paper recom-mendation approach based on web paper mining and re-viewerrsquos interest modellingrdquo in Proceedings of theInternational Conference on Research Challenges in ComputerScience Shanghai China December 2009
[16] B Shaparenko and T Joachims ldquoInformation genealogyUncovering the flow of ideas in non-hyperlinked documentdatabasesrdquo in Proceedings of the ACM SIGKDD internationalConference on Knowledge Discovery and Data Mining SanJose CA USA August 2007
[17] T Strohman W B Croft and D Jensen ldquoRecommendingcitations for academic papersrdquo in Proceedings of the Annualinternational ACM SIGIR Conference on Research and De-velopment in information Retrieval Amsterdam NetherlandsJuly 2007
[18] A Livne V Gokuladas J Teevan et al ldquoCiteSight supportingcontextual citation recommendation using differentialsearchrdquo in Proceedings of the International ACM SIGIRConference on Research amp Development in informationRetrieval Gold Coast Australia July 2014
[19] Y Lu J He D Shan et al ldquoRecommending citations withtranslation modelrdquo in Proceedings of the ACM internationalConference on Information and Knowledge ManagementGlasgow UK October 2011
[20] W Huang P Mitra S Kataria et al ldquoRecommending cita-tions translating papers into referencesrdquo in Proceedings of theACM international Conference on Information and KnowledgeManagement Shanghai China November 2014
[21] X Tang X Wan X Zhang et al ldquoCross-language context-aware citation recommendation in scientific articlesrdquo inProceedings of the International ACM SIGIR Conference onResearch amp Development in information Retrieval Gold CoastUK July 2014
[22] A Brock T Lim J M Ritchie et al ldquoNeural photo edi-tingwith introspective adversarial networksrdquo in InternationalConference on Learning Representations 2017
[23] I J Goodfellow J Pouget-Abadie MMirza et al ldquoGenerativeadversarial netsrdquo in Proceedings of the International Con-ference on Neural Information Processing Systems MontrealCanada December 2014
[24] A Brock J Donahue K Simonyan et al ldquoLarge scale GANtraining for high fidelity natural image synthesisrdquo 2018httpsarxivorgabs180911096
[25] Y Sun L Zheng W Deng et al ldquoSVDNet for pedestrianretrievalrdquo in Proceedings of the 2017 IEEE InternationalConference on Computer Vision (ICCV) pp 3820ndash3828Venice Italy October 2017
[26] Q Zheng M Yang J Yang Q Zhang and X ZhangldquoImprovement of generalization ability of deep CNN via
implicit regularization in two-stage training processrdquo IEEEAccess vol 6 no 1109 pp 15844ndash15869 2018
[27] Y Wang D Gong Z Zheng et al ldquoOrthogonal deep featuresdecomposition for age-invariant face recognitionrdquo in Pro-ceedings of the European Conference on Computer Vision(ECCV) Munich Germany September 2018
[28] Y Chen X Jin J Feng et al ldquoTraining group orthogonalneural networks with privileged informationrdquo in Proceedingsof the Twenty-Sixth International Joint Conference on ArtificialIntelligence Melbourne Australia August 2017
[29] T Mikolov I Sutskever K Chen et al ldquoDistributedrepresentations of words and phrases and their composi-tionalityrdquo in Proceedings of the 26th International Con-ference on Neural Information Processing Systems LakeTahoe NV USA December 2013
[30] A Rajaraman and J D Ullman ldquoData miningrdquo Mining ofMassive Datasets vol 3 no 2 pp 1ndash17 2014
[31] J Duchi E Hazan and Y Singer ldquoAdaptive subgradientmethods for online learning and stochastic optimizationrdquoJournal of Machine Learning Research vol 12 no 7pp 2121ndash2159 2011
[32] T Miyato A M Dai and I Goodfellow ldquoAdversarial trainingmethods for semi-supervised text classificationrdquo in Pro-ceedings of the International Conference on LearningRepresentations San Juan Puerto Rico May 2016
[33] I J Goodfellow J Shlens and C Szegedy ldquoExplaining andharnessing adversarial examplesrdquo in Proceedings of the In-ternational Conference on Learning Representations BanffCanada April 2014
[34] K Chandrasekaran S Gauch P Lakkaraju et al ldquoConcept-based document recommendations for CiteSeer authorsrdquo inProceedings of the International Conference on AdaptiveHypermedia and Adaptive Web-Based Systems HannoverGermany August 2008
[35] E Voorhees ldquo0e trec-8 question answering track reportrdquo inProceedings of the TRECrsquo00 pp 77ndash82 Gaithersburg MDUSA 2000
[36] S Kataria P Mitra and S Bhatia ldquoUtilizing context ingenerative bayesian models for linked corpusrdquo in Proceedingsof the Twenty-Fourth AAAI Conference on ArtificialIntelligence Atlanta GA USA July 2010
[37] J Tang and J Zhang ldquoA discriminative approach to topic-based citation recommendationrdquo in Proceedings of the Pacific-Asia Conference Hyderabad India July 2009
12 Computational Intelligence and Neuroscience
the encoder are set to 4 4 and 5 and the region sizesfor the author network are set to 1 and 2
Figures 8 and 9 show the performance of eachmethod onthe CiteSeer dataset It is obvious that the SVD-FC modelleads the performance in most cases More detailed analysesare given as follows
First we perform a comparison among CP-LDA RBMW2V and SVD-CNN Our SVD-CNN completely andsignificantly exceeds other models in all metrics 0e successof ourmodel is ascribed to the content and correlation of ournetwork Due to the lack of citation context information wefind that W2V is obviously worse than other methods interms of all metrics CP-LDA works much better than W2Vwhich indicates that link information is very important forfinding relevant papers RBM-CS shows a clear performancegain over W2V because RBM-CS automatically discoverstopical aspects of each paper based on citation contextHowever the vector representations of citation context inRBM-CS are extracted by traditional word vector repre-sentations which fully neglect semantic relations betweenthe citation document and citation context and thus may belimited by vocabulary
Second we compare the performance among NPMNCN and SVD-CNN It is not surprising that NPM andNCN achieve worse performance than SVD-CNN since theirdistributed representation of words and documents reliessolely on deep learning without restraint NPM recommendscitations based on trained distributed representations NCNfurther enhances the performance by considering authorinformation and using a more sophisticated neural networkarchitecture However the CNN in NCN does not haveorthogonal constraints which makes it difficult to capturedifferent types of citing activities In addition NCN onlyutilizes the title of the cited paper for a decoder which isapparently not sufficient for learning good embedding
44 e Influence on the Link Prediction of Reference PatternInteractionalFeatures According to the chapter positions ofcitation context in the article we divide the training set intothree parts the introduction part contains 1307885 pairs ofreference contexts and citations the related word partcontains 1599897 pairs of citations and the main partcontains 1024783 pairs Furthermore these datasets formthree mixed datasets In this part of the experiment we usethe CNN model without SVD as the baseline 0ese datasetsare tested in a ratio of 3 1 In Tables 1 and 2 we show theresults on the abovementioned datasets
From the results we obtain the following observationsFirst both CNN and SVD-CNN outperform unmixed
datasets over mixed datasets across the different evaluationmetrics which shows that the diversity of reference patternsincreases the difficulty of citation recommendation tasks
Second in Tables 1 and 2 we observe that our model isparticularly good at resolving the difficulties in mixeddatasets which come from the diversity of referencepatterns
To better explore why mixed datasets are more complexthan unmixed datasets in Figure 10 we show the change in
S(W) during the training process of SVD-CNN amongvarious datasets
As shown in Figure 10 the increase in S(W) on themixed datasets indicates that SVD-CNN is good at decor-relation We can also see in Tables 1 and 2 that the CNNmodel has pretty performance on unmixed datasets whileachieving poor performance on mixed datasets HoweverSVD-CNN achieves almost the same performance on thetwo types of datasets 0is proves that the correlation fromvarious reference patterns can significantly affect the linkprediction
0e reason why the change in S(W) is not large on theunmixed datasets is that reference patterns of unmixeddatasets have similar features which belong to the samecategory As a result the orthogonality of the weight matrixis hard to improve on unmixed datasets However a citationrecommendation algorithm has pretty performance on theunmixed datasets because there are low complexities
Although mixed datasets are more complicated thanunmixed datasets SVD-CNN still performs well in mixeddatasets 0is indicates that SVD-CNN reduces the negativeimpact of the correlation of reference patterns and ourapproach is more suitable for complex scenarios
45 Comparison with Other Types of Decorrelation In ad-dition to SVD there are still some other methods fordecorrelating the feature matrix However these methodscannot maintain the discriminating ability of the CNNmodel To illustrate this we compare SVD with severalvarieties as follows
(1) Using the originally learned W
(2) Replacing W with US
(3) Replacing W with U
(4) Replacing W with UVT
(5) Replacing Wwith Q D where D is the diagonalmatrix extracted from the upper triangle matrix inQ-R decomposition
(6) Replacing W with WPCA where WPCA is the diagonalmatrix extracted from the weight matrix W after theprocessing of dimension reduction by PCA
After convergence of training different orthogonalmatrices are used to replace the weight matrix W We defineT-cost as the time cost of replacing the weight which isequivalent to the proportion of the added time to the originaltime As shown in Table 3 other types of decorrelationdegrade the performance in addition to W⟶ US andW⟶WPCA However the time cost of W⟶WPCA ismore than that of W⟶ US
46 Ablation Study In our method there are two essentialparameters a term sot which means the number of SOTiterations and a biased parameter d0 In this section weconduct an ablation study of these parameters
We first evaluate the effectiveness of sot by empiri-cally fixing d0 300 Since sot defines the loop time of
8 Computational Intelligence and Neuroscience
orthogonal constraint training it should be set as anonnegative value Figure 11 illustrates the MRR with sotfrom 0 to 10 on the CiteSeer dataset We can see that the
performance improves as the value of sot increasesWhen sot 0 the model has no decorrelation andachieves the worst performance In this situation the
Table 1 MRR metric on various datasets
Introduction Related Main Introduction + related Introduction +main Related +mainCNN 03312 03294 03478 02773 02815 02978SVD-CNN 03995 04078 03989 03878 03889 03845
060055050045040035030025Re
call
020015010005000
20 40 60Number of recommended citations
80 100
W2VsNPMs
RBMsCP_LDAs
SVD_CNNsNCNs
Figure 8 Comparison of recall with different methods on CiteSeer
MRR MAP and nDCG scores for top 10 recommendations04
035
03
025
02
015
01
005
0
0091600997
MRR MAP nCDG
00662
01843
02667
03687
00912009982
00663
01835
02418
03352
01288 0135601476
0256602592
03448
CP-LDARBM
W2VNPM
NCNSVD-CNN
Figure 9 Comparison of MRR MAP and nDCG with different methods on CiteSeer
Table 2 MAP metric on various datasets
Introduction Related Main Introduction + related Introduction +main Related +mainCNN 03001 02909 03107 02572 02601 02637SVD-CNN 03701 03655 03693 03498 03511 03539
Computational Intelligence and Neuroscience 9
weight matrix in the FC layer is highly correlated andS(W) has the lowest value 0e recommendation per-formance then increases while adding sot which indi-cates that reducing the correlative degree of the weightmatrix in the FC layer is critical for improving perfor-mance When sot 10 our model achieves the bestperformance
In our model d0 is the dimension of citation contextand cited document representations Figure 12 shows howthe performance of SVD-CNN varies with d0 on the samesot When d0 is small the information content of thecitation context is very small and produces worse per-formance 0e recommendation performance increases toa maximum point until d0 reaches 300 It should be noted
05
045
04
035
03
S (W
)
025
02
015
01
005
00 1 2 3 4 5
Sot6 7 8 9 10 11
IntroductionRelatedMain
Introduction + relatedIntroduction + mainRelated + main
Figure 10 0e change in S(W) during training on unmixed datasets and mixed datasets
Table 3 0e comparison of related methods in Step 1
Figure 11 0e performance impact of sot on CiteSeer
10 Computational Intelligence and Neuroscience
that although the larger d0 is better the larger d0 willsignificantly increase the training time 0erefore wechoose d0 300
5 Conclusion and Future Works
We propose a convolutional neural network model withorthogonal regularization to solve the context-aware citationrecommendation task In our model orthogonal regulari-zation is achieved by using SVD to factorize the weight of theFC layer which can essentially make each vector in thefeature map more independent 0e orthogonal regulari-zation also enhances the feature extraction ability of CNN0e experimental results show that SVD-CNN outperformsthe other compared methods on CiteSeer Our model onlytakes the abstract as the content of the cited paper In thefuture we will explore the performance of our model byusing the full text of papers
Data Availability
Previously reported CiteSeer data were used to support thisstudy and are available at [httpspsuappboxcomvrefseer] 0ese prior datasets are cited at relevant placeswithin the text as references [4]
Conflicts of Interest
0e authors declare that there are no conflicts of interestregarding the publication of this paper
Acknowledgments
0is work was partially supported by the National NaturalScience Foundation of China (project no 61373046) andthe National Key Research and Development Programs ofChina (project nos 2018AAA0101100 and2019YFB2102500)
References
[1] M A Angrosh S Cranefield and N Stanger ldquoConditionalrandom field based sentence context identification enhancingcitation services for the research communityrdquo in Proceedingsof the First Australasian Web Conference Adelaide AustraliaJanuary 2013
[2] Q He J Pei D Kifer et al ldquoContext-aware citation rec-ommendationrdquo in Proceedings of the International Conferenceon World Wide Web Raleigh NC USA April 2010
[3] Q He D Kifer J Pei et al ldquoCitation recommendationwithout author supervisionrdquo in Proceedings of the FourthACM international Conference on Web Search and DataMining Hong Kong China February 2011
[4] W Huang ldquoA neural probabilistic model for context basedcitation recommendationrdquo in Proceedings of the AAAIConference on Artificial Intelligence Austin TX USA January2015
[5] J Tan X Wan and J Xiao ldquoA neural network approach toquote recommendation in writingsrdquo in Proceedings of theACM International on Conference on Information andKnowledge Management Indianapolis IN USA October2016
[6] X Ren J Liu X Yu et al ldquoCluscite effective citation rec-ommendation by information network-based clusteringrdquo inProceedings of the 20th ACM SIGKDD International Con-ference on Knowledge Discovery and Data Mining New YorkNY USA August 2014
[7] T Ebesu and Y Fang ldquoNeural citation network for context-aware citation recommendationrdquo in Proceedings of the 40thInternational ACM SIGIR Conference on Research and De-velopment in Information Retrieval pp 1093ndash1096 ShinjukuJapan August 2017
[8] D M Blei A Y Ng and M I Jordan ldquoLatentdirichlet allocationrdquo Journal of Machine Learning Researchvol 3 pp 993ndash1022 2003
[9] S Bradshaw ldquoReference directed indexing redeeming rele-vance for subject search in citation indexesrdquo Research andAdvanced Technology for Digital Libraries vol 2769pp 499ndash510 2003
[10] N Meuschke B Gipp and M Lipinsk ldquoCITREC an eval-uation framework for citation-based similarity measures
based on TREC genomics and PubMed centralrdquo 2015 httphdlhandlenet214273680
[11] A Ritchie S Robertson and S Teufel ldquoComparing CitationContexts for information Retrievalrdquo in Proceedings of the 17thACM Conference on Information and Knowledge Manage-ment pp 213ndash222 Napa Valley CA USA October 2008
[12] C F Van Loan e Block Jacobi Method for Computing theSingular Value Decomposition Department of ComputerScience Cornell University Ithaca NY USA 1985
[13] C Bhagavatula S Feldman R Power et al ldquoContent-basedcitation recommendationrdquo 2018 httpsarxivorgpdf18020830201v1pdf
[14] H Jia and E Saule ldquoLocal is good a fast citation recom-mendation approachrdquo Lecture Notes in Computer ScienceVol 10772 Springer Berlin Germany 2018
[15] Y Sun W Ni and R Men ldquoA personalized paper recom-mendation approach based on web paper mining and re-viewerrsquos interest modellingrdquo in Proceedings of theInternational Conference on Research Challenges in ComputerScience Shanghai China December 2009
[16] B Shaparenko and T Joachims ldquoInformation genealogyUncovering the flow of ideas in non-hyperlinked documentdatabasesrdquo in Proceedings of the ACM SIGKDD internationalConference on Knowledge Discovery and Data Mining SanJose CA USA August 2007
[17] T Strohman W B Croft and D Jensen ldquoRecommendingcitations for academic papersrdquo in Proceedings of the Annualinternational ACM SIGIR Conference on Research and De-velopment in information Retrieval Amsterdam NetherlandsJuly 2007
[18] A Livne V Gokuladas J Teevan et al ldquoCiteSight supportingcontextual citation recommendation using differentialsearchrdquo in Proceedings of the International ACM SIGIRConference on Research amp Development in informationRetrieval Gold Coast Australia July 2014
[19] Y Lu J He D Shan et al ldquoRecommending citations withtranslation modelrdquo in Proceedings of the ACM internationalConference on Information and Knowledge ManagementGlasgow UK October 2011
[20] W Huang P Mitra S Kataria et al ldquoRecommending cita-tions translating papers into referencesrdquo in Proceedings of theACM international Conference on Information and KnowledgeManagement Shanghai China November 2014
[21] X Tang X Wan X Zhang et al ldquoCross-language context-aware citation recommendation in scientific articlesrdquo inProceedings of the International ACM SIGIR Conference onResearch amp Development in information Retrieval Gold CoastUK July 2014
[22] A Brock T Lim J M Ritchie et al ldquoNeural photo edi-tingwith introspective adversarial networksrdquo in InternationalConference on Learning Representations 2017
[23] I J Goodfellow J Pouget-Abadie MMirza et al ldquoGenerativeadversarial netsrdquo in Proceedings of the International Con-ference on Neural Information Processing Systems MontrealCanada December 2014
[24] A Brock J Donahue K Simonyan et al ldquoLarge scale GANtraining for high fidelity natural image synthesisrdquo 2018httpsarxivorgabs180911096
[25] Y Sun L Zheng W Deng et al ldquoSVDNet for pedestrianretrievalrdquo in Proceedings of the 2017 IEEE InternationalConference on Computer Vision (ICCV) pp 3820ndash3828Venice Italy October 2017
[26] Q Zheng M Yang J Yang Q Zhang and X ZhangldquoImprovement of generalization ability of deep CNN via
implicit regularization in two-stage training processrdquo IEEEAccess vol 6 no 1109 pp 15844ndash15869 2018
[27] Y Wang D Gong Z Zheng et al ldquoOrthogonal deep featuresdecomposition for age-invariant face recognitionrdquo in Pro-ceedings of the European Conference on Computer Vision(ECCV) Munich Germany September 2018
[28] Y Chen X Jin J Feng et al ldquoTraining group orthogonalneural networks with privileged informationrdquo in Proceedingsof the Twenty-Sixth International Joint Conference on ArtificialIntelligence Melbourne Australia August 2017
[29] T Mikolov I Sutskever K Chen et al ldquoDistributedrepresentations of words and phrases and their composi-tionalityrdquo in Proceedings of the 26th International Con-ference on Neural Information Processing Systems LakeTahoe NV USA December 2013
[30] A Rajaraman and J D Ullman ldquoData miningrdquo Mining ofMassive Datasets vol 3 no 2 pp 1ndash17 2014
[31] J Duchi E Hazan and Y Singer ldquoAdaptive subgradientmethods for online learning and stochastic optimizationrdquoJournal of Machine Learning Research vol 12 no 7pp 2121ndash2159 2011
[32] T Miyato A M Dai and I Goodfellow ldquoAdversarial trainingmethods for semi-supervised text classificationrdquo in Pro-ceedings of the International Conference on LearningRepresentations San Juan Puerto Rico May 2016
[33] I J Goodfellow J Shlens and C Szegedy ldquoExplaining andharnessing adversarial examplesrdquo in Proceedings of the In-ternational Conference on Learning Representations BanffCanada April 2014
[34] K Chandrasekaran S Gauch P Lakkaraju et al ldquoConcept-based document recommendations for CiteSeer authorsrdquo inProceedings of the International Conference on AdaptiveHypermedia and Adaptive Web-Based Systems HannoverGermany August 2008
[35] E Voorhees ldquo0e trec-8 question answering track reportrdquo inProceedings of the TRECrsquo00 pp 77ndash82 Gaithersburg MDUSA 2000
[36] S Kataria P Mitra and S Bhatia ldquoUtilizing context ingenerative bayesian models for linked corpusrdquo in Proceedingsof the Twenty-Fourth AAAI Conference on ArtificialIntelligence Atlanta GA USA July 2010
[37] J Tang and J Zhang ldquoA discriminative approach to topic-based citation recommendationrdquo in Proceedings of the Pacific-Asia Conference Hyderabad India July 2009
12 Computational Intelligence and Neuroscience
orthogonal constraint training it should be set as anonnegative value Figure 11 illustrates the MRR with sotfrom 0 to 10 on the CiteSeer dataset We can see that the
performance improves as the value of sot increasesWhen sot 0 the model has no decorrelation andachieves the worst performance In this situation the
Table 1 MRR metric on various datasets
Introduction Related Main Introduction + related Introduction +main Related +mainCNN 03312 03294 03478 02773 02815 02978SVD-CNN 03995 04078 03989 03878 03889 03845
060055050045040035030025Re
call
020015010005000
20 40 60Number of recommended citations
80 100
W2VsNPMs
RBMsCP_LDAs
SVD_CNNsNCNs
Figure 8 Comparison of recall with different methods on CiteSeer
MRR MAP and nDCG scores for top 10 recommendations04
035
03
025
02
015
01
005
0
0091600997
MRR MAP nCDG
00662
01843
02667
03687
00912009982
00663
01835
02418
03352
01288 0135601476
0256602592
03448
CP-LDARBM
W2VNPM
NCNSVD-CNN
Figure 9 Comparison of MRR MAP and nDCG with different methods on CiteSeer
Table 2 MAP metric on various datasets
Introduction Related Main Introduction + related Introduction +main Related +mainCNN 03001 02909 03107 02572 02601 02637SVD-CNN 03701 03655 03693 03498 03511 03539
Computational Intelligence and Neuroscience 9
weight matrix in the FC layer is highly correlated andS(W) has the lowest value 0e recommendation per-formance then increases while adding sot which indi-cates that reducing the correlative degree of the weightmatrix in the FC layer is critical for improving perfor-mance When sot 10 our model achieves the bestperformance
In our model d0 is the dimension of citation contextand cited document representations Figure 12 shows howthe performance of SVD-CNN varies with d0 on the samesot When d0 is small the information content of thecitation context is very small and produces worse per-formance 0e recommendation performance increases toa maximum point until d0 reaches 300 It should be noted
05
045
04
035
03
S (W
)
025
02
015
01
005
00 1 2 3 4 5
Sot6 7 8 9 10 11
IntroductionRelatedMain
Introduction + relatedIntroduction + mainRelated + main
Figure 10 0e change in S(W) during training on unmixed datasets and mixed datasets
Table 3 0e comparison of related methods in Step 1
Figure 11 0e performance impact of sot on CiteSeer
10 Computational Intelligence and Neuroscience
that although the larger d0 is better the larger d0 willsignificantly increase the training time 0erefore wechoose d0 300
5 Conclusion and Future Works
We propose a convolutional neural network model withorthogonal regularization to solve the context-aware citationrecommendation task In our model orthogonal regulari-zation is achieved by using SVD to factorize the weight of theFC layer which can essentially make each vector in thefeature map more independent 0e orthogonal regulari-zation also enhances the feature extraction ability of CNN0e experimental results show that SVD-CNN outperformsthe other compared methods on CiteSeer Our model onlytakes the abstract as the content of the cited paper In thefuture we will explore the performance of our model byusing the full text of papers
Data Availability
Previously reported CiteSeer data were used to support thisstudy and are available at [httpspsuappboxcomvrefseer] 0ese prior datasets are cited at relevant placeswithin the text as references [4]
Conflicts of Interest
0e authors declare that there are no conflicts of interestregarding the publication of this paper
Acknowledgments
0is work was partially supported by the National NaturalScience Foundation of China (project no 61373046) andthe National Key Research and Development Programs ofChina (project nos 2018AAA0101100 and2019YFB2102500)
References
[1] M A Angrosh S Cranefield and N Stanger ldquoConditionalrandom field based sentence context identification enhancingcitation services for the research communityrdquo in Proceedingsof the First Australasian Web Conference Adelaide AustraliaJanuary 2013
[2] Q He J Pei D Kifer et al ldquoContext-aware citation rec-ommendationrdquo in Proceedings of the International Conferenceon World Wide Web Raleigh NC USA April 2010
[3] Q He D Kifer J Pei et al ldquoCitation recommendationwithout author supervisionrdquo in Proceedings of the FourthACM international Conference on Web Search and DataMining Hong Kong China February 2011
[4] W Huang ldquoA neural probabilistic model for context basedcitation recommendationrdquo in Proceedings of the AAAIConference on Artificial Intelligence Austin TX USA January2015
[5] J Tan X Wan and J Xiao ldquoA neural network approach toquote recommendation in writingsrdquo in Proceedings of theACM International on Conference on Information andKnowledge Management Indianapolis IN USA October2016
[6] X Ren J Liu X Yu et al ldquoCluscite effective citation rec-ommendation by information network-based clusteringrdquo inProceedings of the 20th ACM SIGKDD International Con-ference on Knowledge Discovery and Data Mining New YorkNY USA August 2014
[7] T Ebesu and Y Fang ldquoNeural citation network for context-aware citation recommendationrdquo in Proceedings of the 40thInternational ACM SIGIR Conference on Research and De-velopment in Information Retrieval pp 1093ndash1096 ShinjukuJapan August 2017
[8] D M Blei A Y Ng and M I Jordan ldquoLatentdirichlet allocationrdquo Journal of Machine Learning Researchvol 3 pp 993ndash1022 2003
[9] S Bradshaw ldquoReference directed indexing redeeming rele-vance for subject search in citation indexesrdquo Research andAdvanced Technology for Digital Libraries vol 2769pp 499ndash510 2003
[10] N Meuschke B Gipp and M Lipinsk ldquoCITREC an eval-uation framework for citation-based similarity measures
based on TREC genomics and PubMed centralrdquo 2015 httphdlhandlenet214273680
[11] A Ritchie S Robertson and S Teufel ldquoComparing CitationContexts for information Retrievalrdquo in Proceedings of the 17thACM Conference on Information and Knowledge Manage-ment pp 213ndash222 Napa Valley CA USA October 2008
[12] C F Van Loan e Block Jacobi Method for Computing theSingular Value Decomposition Department of ComputerScience Cornell University Ithaca NY USA 1985
[13] C Bhagavatula S Feldman R Power et al ldquoContent-basedcitation recommendationrdquo 2018 httpsarxivorgpdf18020830201v1pdf
[14] H Jia and E Saule ldquoLocal is good a fast citation recom-mendation approachrdquo Lecture Notes in Computer ScienceVol 10772 Springer Berlin Germany 2018
[15] Y Sun W Ni and R Men ldquoA personalized paper recom-mendation approach based on web paper mining and re-viewerrsquos interest modellingrdquo in Proceedings of theInternational Conference on Research Challenges in ComputerScience Shanghai China December 2009
[16] B Shaparenko and T Joachims ldquoInformation genealogyUncovering the flow of ideas in non-hyperlinked documentdatabasesrdquo in Proceedings of the ACM SIGKDD internationalConference on Knowledge Discovery and Data Mining SanJose CA USA August 2007
[17] T Strohman W B Croft and D Jensen ldquoRecommendingcitations for academic papersrdquo in Proceedings of the Annualinternational ACM SIGIR Conference on Research and De-velopment in information Retrieval Amsterdam NetherlandsJuly 2007
[18] A Livne V Gokuladas J Teevan et al ldquoCiteSight supportingcontextual citation recommendation using differentialsearchrdquo in Proceedings of the International ACM SIGIRConference on Research amp Development in informationRetrieval Gold Coast Australia July 2014
[19] Y Lu J He D Shan et al ldquoRecommending citations withtranslation modelrdquo in Proceedings of the ACM internationalConference on Information and Knowledge ManagementGlasgow UK October 2011
[20] W Huang P Mitra S Kataria et al ldquoRecommending cita-tions translating papers into referencesrdquo in Proceedings of theACM international Conference on Information and KnowledgeManagement Shanghai China November 2014
[21] X Tang X Wan X Zhang et al ldquoCross-language context-aware citation recommendation in scientific articlesrdquo inProceedings of the International ACM SIGIR Conference onResearch amp Development in information Retrieval Gold CoastUK July 2014
[22] A Brock T Lim J M Ritchie et al ldquoNeural photo edi-tingwith introspective adversarial networksrdquo in InternationalConference on Learning Representations 2017
[23] I J Goodfellow J Pouget-Abadie MMirza et al ldquoGenerativeadversarial netsrdquo in Proceedings of the International Con-ference on Neural Information Processing Systems MontrealCanada December 2014
[24] A Brock J Donahue K Simonyan et al ldquoLarge scale GANtraining for high fidelity natural image synthesisrdquo 2018httpsarxivorgabs180911096
[25] Y Sun L Zheng W Deng et al ldquoSVDNet for pedestrianretrievalrdquo in Proceedings of the 2017 IEEE InternationalConference on Computer Vision (ICCV) pp 3820ndash3828Venice Italy October 2017
[26] Q Zheng M Yang J Yang Q Zhang and X ZhangldquoImprovement of generalization ability of deep CNN via
implicit regularization in two-stage training processrdquo IEEEAccess vol 6 no 1109 pp 15844ndash15869 2018
[27] Y Wang D Gong Z Zheng et al ldquoOrthogonal deep featuresdecomposition for age-invariant face recognitionrdquo in Pro-ceedings of the European Conference on Computer Vision(ECCV) Munich Germany September 2018
[28] Y Chen X Jin J Feng et al ldquoTraining group orthogonalneural networks with privileged informationrdquo in Proceedingsof the Twenty-Sixth International Joint Conference on ArtificialIntelligence Melbourne Australia August 2017
[29] T Mikolov I Sutskever K Chen et al ldquoDistributedrepresentations of words and phrases and their composi-tionalityrdquo in Proceedings of the 26th International Con-ference on Neural Information Processing Systems LakeTahoe NV USA December 2013
[30] A Rajaraman and J D Ullman ldquoData miningrdquo Mining ofMassive Datasets vol 3 no 2 pp 1ndash17 2014
[31] J Duchi E Hazan and Y Singer ldquoAdaptive subgradientmethods for online learning and stochastic optimizationrdquoJournal of Machine Learning Research vol 12 no 7pp 2121ndash2159 2011
[32] T Miyato A M Dai and I Goodfellow ldquoAdversarial trainingmethods for semi-supervised text classificationrdquo in Pro-ceedings of the International Conference on LearningRepresentations San Juan Puerto Rico May 2016
[33] I J Goodfellow J Shlens and C Szegedy ldquoExplaining andharnessing adversarial examplesrdquo in Proceedings of the In-ternational Conference on Learning Representations BanffCanada April 2014
[34] K Chandrasekaran S Gauch P Lakkaraju et al ldquoConcept-based document recommendations for CiteSeer authorsrdquo inProceedings of the International Conference on AdaptiveHypermedia and Adaptive Web-Based Systems HannoverGermany August 2008
[35] E Voorhees ldquo0e trec-8 question answering track reportrdquo inProceedings of the TRECrsquo00 pp 77ndash82 Gaithersburg MDUSA 2000
[36] S Kataria P Mitra and S Bhatia ldquoUtilizing context ingenerative bayesian models for linked corpusrdquo in Proceedingsof the Twenty-Fourth AAAI Conference on ArtificialIntelligence Atlanta GA USA July 2010
[37] J Tang and J Zhang ldquoA discriminative approach to topic-based citation recommendationrdquo in Proceedings of the Pacific-Asia Conference Hyderabad India July 2009
12 Computational Intelligence and Neuroscience
weight matrix in the FC layer is highly correlated andS(W) has the lowest value 0e recommendation per-formance then increases while adding sot which indi-cates that reducing the correlative degree of the weightmatrix in the FC layer is critical for improving perfor-mance When sot 10 our model achieves the bestperformance
In our model d0 is the dimension of citation contextand cited document representations Figure 12 shows howthe performance of SVD-CNN varies with d0 on the samesot When d0 is small the information content of thecitation context is very small and produces worse per-formance 0e recommendation performance increases toa maximum point until d0 reaches 300 It should be noted
05
045
04
035
03
S (W
)
025
02
015
01
005
00 1 2 3 4 5
Sot6 7 8 9 10 11
IntroductionRelatedMain
Introduction + relatedIntroduction + mainRelated + main
Figure 10 0e change in S(W) during training on unmixed datasets and mixed datasets
Table 3 0e comparison of related methods in Step 1
Figure 11 0e performance impact of sot on CiteSeer
10 Computational Intelligence and Neuroscience
that although the larger d0 is better the larger d0 willsignificantly increase the training time 0erefore wechoose d0 300
5 Conclusion and Future Works
We propose a convolutional neural network model withorthogonal regularization to solve the context-aware citationrecommendation task In our model orthogonal regulari-zation is achieved by using SVD to factorize the weight of theFC layer which can essentially make each vector in thefeature map more independent 0e orthogonal regulari-zation also enhances the feature extraction ability of CNN0e experimental results show that SVD-CNN outperformsthe other compared methods on CiteSeer Our model onlytakes the abstract as the content of the cited paper In thefuture we will explore the performance of our model byusing the full text of papers
Data Availability
Previously reported CiteSeer data were used to support thisstudy and are available at [httpspsuappboxcomvrefseer] 0ese prior datasets are cited at relevant placeswithin the text as references [4]
Conflicts of Interest
0e authors declare that there are no conflicts of interestregarding the publication of this paper
Acknowledgments
0is work was partially supported by the National NaturalScience Foundation of China (project no 61373046) andthe National Key Research and Development Programs ofChina (project nos 2018AAA0101100 and2019YFB2102500)
References
[1] M A Angrosh S Cranefield and N Stanger ldquoConditionalrandom field based sentence context identification enhancingcitation services for the research communityrdquo in Proceedingsof the First Australasian Web Conference Adelaide AustraliaJanuary 2013
[2] Q He J Pei D Kifer et al ldquoContext-aware citation rec-ommendationrdquo in Proceedings of the International Conferenceon World Wide Web Raleigh NC USA April 2010
[3] Q He D Kifer J Pei et al ldquoCitation recommendationwithout author supervisionrdquo in Proceedings of the FourthACM international Conference on Web Search and DataMining Hong Kong China February 2011
[4] W Huang ldquoA neural probabilistic model for context basedcitation recommendationrdquo in Proceedings of the AAAIConference on Artificial Intelligence Austin TX USA January2015
[5] J Tan X Wan and J Xiao ldquoA neural network approach toquote recommendation in writingsrdquo in Proceedings of theACM International on Conference on Information andKnowledge Management Indianapolis IN USA October2016
[6] X Ren J Liu X Yu et al ldquoCluscite effective citation rec-ommendation by information network-based clusteringrdquo inProceedings of the 20th ACM SIGKDD International Con-ference on Knowledge Discovery and Data Mining New YorkNY USA August 2014
[7] T Ebesu and Y Fang ldquoNeural citation network for context-aware citation recommendationrdquo in Proceedings of the 40thInternational ACM SIGIR Conference on Research and De-velopment in Information Retrieval pp 1093ndash1096 ShinjukuJapan August 2017
[8] D M Blei A Y Ng and M I Jordan ldquoLatentdirichlet allocationrdquo Journal of Machine Learning Researchvol 3 pp 993ndash1022 2003
[9] S Bradshaw ldquoReference directed indexing redeeming rele-vance for subject search in citation indexesrdquo Research andAdvanced Technology for Digital Libraries vol 2769pp 499ndash510 2003
[10] N Meuschke B Gipp and M Lipinsk ldquoCITREC an eval-uation framework for citation-based similarity measures
based on TREC genomics and PubMed centralrdquo 2015 httphdlhandlenet214273680
[11] A Ritchie S Robertson and S Teufel ldquoComparing CitationContexts for information Retrievalrdquo in Proceedings of the 17thACM Conference on Information and Knowledge Manage-ment pp 213ndash222 Napa Valley CA USA October 2008
[12] C F Van Loan e Block Jacobi Method for Computing theSingular Value Decomposition Department of ComputerScience Cornell University Ithaca NY USA 1985
[13] C Bhagavatula S Feldman R Power et al ldquoContent-basedcitation recommendationrdquo 2018 httpsarxivorgpdf18020830201v1pdf
[14] H Jia and E Saule ldquoLocal is good a fast citation recom-mendation approachrdquo Lecture Notes in Computer ScienceVol 10772 Springer Berlin Germany 2018
[15] Y Sun W Ni and R Men ldquoA personalized paper recom-mendation approach based on web paper mining and re-viewerrsquos interest modellingrdquo in Proceedings of theInternational Conference on Research Challenges in ComputerScience Shanghai China December 2009
[16] B Shaparenko and T Joachims ldquoInformation genealogyUncovering the flow of ideas in non-hyperlinked documentdatabasesrdquo in Proceedings of the ACM SIGKDD internationalConference on Knowledge Discovery and Data Mining SanJose CA USA August 2007
[17] T Strohman W B Croft and D Jensen ldquoRecommendingcitations for academic papersrdquo in Proceedings of the Annualinternational ACM SIGIR Conference on Research and De-velopment in information Retrieval Amsterdam NetherlandsJuly 2007
[18] A Livne V Gokuladas J Teevan et al ldquoCiteSight supportingcontextual citation recommendation using differentialsearchrdquo in Proceedings of the International ACM SIGIRConference on Research amp Development in informationRetrieval Gold Coast Australia July 2014
[19] Y Lu J He D Shan et al ldquoRecommending citations withtranslation modelrdquo in Proceedings of the ACM internationalConference on Information and Knowledge ManagementGlasgow UK October 2011
[20] W Huang P Mitra S Kataria et al ldquoRecommending cita-tions translating papers into referencesrdquo in Proceedings of theACM international Conference on Information and KnowledgeManagement Shanghai China November 2014
[21] X Tang X Wan X Zhang et al ldquoCross-language context-aware citation recommendation in scientific articlesrdquo inProceedings of the International ACM SIGIR Conference onResearch amp Development in information Retrieval Gold CoastUK July 2014
[22] A Brock T Lim J M Ritchie et al ldquoNeural photo edi-tingwith introspective adversarial networksrdquo in InternationalConference on Learning Representations 2017
[23] I J Goodfellow J Pouget-Abadie MMirza et al ldquoGenerativeadversarial netsrdquo in Proceedings of the International Con-ference on Neural Information Processing Systems MontrealCanada December 2014
[24] A Brock J Donahue K Simonyan et al ldquoLarge scale GANtraining for high fidelity natural image synthesisrdquo 2018httpsarxivorgabs180911096
[25] Y Sun L Zheng W Deng et al ldquoSVDNet for pedestrianretrievalrdquo in Proceedings of the 2017 IEEE InternationalConference on Computer Vision (ICCV) pp 3820ndash3828Venice Italy October 2017
[26] Q Zheng M Yang J Yang Q Zhang and X ZhangldquoImprovement of generalization ability of deep CNN via
implicit regularization in two-stage training processrdquo IEEEAccess vol 6 no 1109 pp 15844ndash15869 2018
[27] Y Wang D Gong Z Zheng et al ldquoOrthogonal deep featuresdecomposition for age-invariant face recognitionrdquo in Pro-ceedings of the European Conference on Computer Vision(ECCV) Munich Germany September 2018
[28] Y Chen X Jin J Feng et al ldquoTraining group orthogonalneural networks with privileged informationrdquo in Proceedingsof the Twenty-Sixth International Joint Conference on ArtificialIntelligence Melbourne Australia August 2017
[29] T Mikolov I Sutskever K Chen et al ldquoDistributedrepresentations of words and phrases and their composi-tionalityrdquo in Proceedings of the 26th International Con-ference on Neural Information Processing Systems LakeTahoe NV USA December 2013
[30] A Rajaraman and J D Ullman ldquoData miningrdquo Mining ofMassive Datasets vol 3 no 2 pp 1ndash17 2014
[31] J Duchi E Hazan and Y Singer ldquoAdaptive subgradientmethods for online learning and stochastic optimizationrdquoJournal of Machine Learning Research vol 12 no 7pp 2121ndash2159 2011
[32] T Miyato A M Dai and I Goodfellow ldquoAdversarial trainingmethods for semi-supervised text classificationrdquo in Pro-ceedings of the International Conference on LearningRepresentations San Juan Puerto Rico May 2016
[33] I J Goodfellow J Shlens and C Szegedy ldquoExplaining andharnessing adversarial examplesrdquo in Proceedings of the In-ternational Conference on Learning Representations BanffCanada April 2014
[34] K Chandrasekaran S Gauch P Lakkaraju et al ldquoConcept-based document recommendations for CiteSeer authorsrdquo inProceedings of the International Conference on AdaptiveHypermedia and Adaptive Web-Based Systems HannoverGermany August 2008
[35] E Voorhees ldquo0e trec-8 question answering track reportrdquo inProceedings of the TRECrsquo00 pp 77ndash82 Gaithersburg MDUSA 2000
[36] S Kataria P Mitra and S Bhatia ldquoUtilizing context ingenerative bayesian models for linked corpusrdquo in Proceedingsof the Twenty-Fourth AAAI Conference on ArtificialIntelligence Atlanta GA USA July 2010
[37] J Tang and J Zhang ldquoA discriminative approach to topic-based citation recommendationrdquo in Proceedings of the Pacific-Asia Conference Hyderabad India July 2009
12 Computational Intelligence and Neuroscience
that although the larger d0 is better the larger d0 willsignificantly increase the training time 0erefore wechoose d0 300
5 Conclusion and Future Works
We propose a convolutional neural network model withorthogonal regularization to solve the context-aware citationrecommendation task In our model orthogonal regulari-zation is achieved by using SVD to factorize the weight of theFC layer which can essentially make each vector in thefeature map more independent 0e orthogonal regulari-zation also enhances the feature extraction ability of CNN0e experimental results show that SVD-CNN outperformsthe other compared methods on CiteSeer Our model onlytakes the abstract as the content of the cited paper In thefuture we will explore the performance of our model byusing the full text of papers
Data Availability
Previously reported CiteSeer data were used to support thisstudy and are available at [httpspsuappboxcomvrefseer] 0ese prior datasets are cited at relevant placeswithin the text as references [4]
Conflicts of Interest
0e authors declare that there are no conflicts of interestregarding the publication of this paper
Acknowledgments
0is work was partially supported by the National NaturalScience Foundation of China (project no 61373046) andthe National Key Research and Development Programs ofChina (project nos 2018AAA0101100 and2019YFB2102500)
References
[1] M A Angrosh S Cranefield and N Stanger ldquoConditionalrandom field based sentence context identification enhancingcitation services for the research communityrdquo in Proceedingsof the First Australasian Web Conference Adelaide AustraliaJanuary 2013
[2] Q He J Pei D Kifer et al ldquoContext-aware citation rec-ommendationrdquo in Proceedings of the International Conferenceon World Wide Web Raleigh NC USA April 2010
[3] Q He D Kifer J Pei et al ldquoCitation recommendationwithout author supervisionrdquo in Proceedings of the FourthACM international Conference on Web Search and DataMining Hong Kong China February 2011
[4] W Huang ldquoA neural probabilistic model for context basedcitation recommendationrdquo in Proceedings of the AAAIConference on Artificial Intelligence Austin TX USA January2015
[5] J Tan X Wan and J Xiao ldquoA neural network approach toquote recommendation in writingsrdquo in Proceedings of theACM International on Conference on Information andKnowledge Management Indianapolis IN USA October2016
[6] X Ren J Liu X Yu et al ldquoCluscite effective citation rec-ommendation by information network-based clusteringrdquo inProceedings of the 20th ACM SIGKDD International Con-ference on Knowledge Discovery and Data Mining New YorkNY USA August 2014
[7] T Ebesu and Y Fang ldquoNeural citation network for context-aware citation recommendationrdquo in Proceedings of the 40thInternational ACM SIGIR Conference on Research and De-velopment in Information Retrieval pp 1093ndash1096 ShinjukuJapan August 2017
[8] D M Blei A Y Ng and M I Jordan ldquoLatentdirichlet allocationrdquo Journal of Machine Learning Researchvol 3 pp 993ndash1022 2003
[9] S Bradshaw ldquoReference directed indexing redeeming rele-vance for subject search in citation indexesrdquo Research andAdvanced Technology for Digital Libraries vol 2769pp 499ndash510 2003
[10] N Meuschke B Gipp and M Lipinsk ldquoCITREC an eval-uation framework for citation-based similarity measures
based on TREC genomics and PubMed centralrdquo 2015 httphdlhandlenet214273680
[11] A Ritchie S Robertson and S Teufel ldquoComparing CitationContexts for information Retrievalrdquo in Proceedings of the 17thACM Conference on Information and Knowledge Manage-ment pp 213ndash222 Napa Valley CA USA October 2008
[12] C F Van Loan e Block Jacobi Method for Computing theSingular Value Decomposition Department of ComputerScience Cornell University Ithaca NY USA 1985
[13] C Bhagavatula S Feldman R Power et al ldquoContent-basedcitation recommendationrdquo 2018 httpsarxivorgpdf18020830201v1pdf
[14] H Jia and E Saule ldquoLocal is good a fast citation recom-mendation approachrdquo Lecture Notes in Computer ScienceVol 10772 Springer Berlin Germany 2018
[15] Y Sun W Ni and R Men ldquoA personalized paper recom-mendation approach based on web paper mining and re-viewerrsquos interest modellingrdquo in Proceedings of theInternational Conference on Research Challenges in ComputerScience Shanghai China December 2009
[16] B Shaparenko and T Joachims ldquoInformation genealogyUncovering the flow of ideas in non-hyperlinked documentdatabasesrdquo in Proceedings of the ACM SIGKDD internationalConference on Knowledge Discovery and Data Mining SanJose CA USA August 2007
[17] T Strohman W B Croft and D Jensen ldquoRecommendingcitations for academic papersrdquo in Proceedings of the Annualinternational ACM SIGIR Conference on Research and De-velopment in information Retrieval Amsterdam NetherlandsJuly 2007
[18] A Livne V Gokuladas J Teevan et al ldquoCiteSight supportingcontextual citation recommendation using differentialsearchrdquo in Proceedings of the International ACM SIGIRConference on Research amp Development in informationRetrieval Gold Coast Australia July 2014
[19] Y Lu J He D Shan et al ldquoRecommending citations withtranslation modelrdquo in Proceedings of the ACM internationalConference on Information and Knowledge ManagementGlasgow UK October 2011
[20] W Huang P Mitra S Kataria et al ldquoRecommending cita-tions translating papers into referencesrdquo in Proceedings of theACM international Conference on Information and KnowledgeManagement Shanghai China November 2014
[21] X Tang X Wan X Zhang et al ldquoCross-language context-aware citation recommendation in scientific articlesrdquo inProceedings of the International ACM SIGIR Conference onResearch amp Development in information Retrieval Gold CoastUK July 2014
[22] A Brock T Lim J M Ritchie et al ldquoNeural photo edi-tingwith introspective adversarial networksrdquo in InternationalConference on Learning Representations 2017
[23] I J Goodfellow J Pouget-Abadie MMirza et al ldquoGenerativeadversarial netsrdquo in Proceedings of the International Con-ference on Neural Information Processing Systems MontrealCanada December 2014
[24] A Brock J Donahue K Simonyan et al ldquoLarge scale GANtraining for high fidelity natural image synthesisrdquo 2018httpsarxivorgabs180911096
[25] Y Sun L Zheng W Deng et al ldquoSVDNet for pedestrianretrievalrdquo in Proceedings of the 2017 IEEE InternationalConference on Computer Vision (ICCV) pp 3820ndash3828Venice Italy October 2017
[26] Q Zheng M Yang J Yang Q Zhang and X ZhangldquoImprovement of generalization ability of deep CNN via
implicit regularization in two-stage training processrdquo IEEEAccess vol 6 no 1109 pp 15844ndash15869 2018
[27] Y Wang D Gong Z Zheng et al ldquoOrthogonal deep featuresdecomposition for age-invariant face recognitionrdquo in Pro-ceedings of the European Conference on Computer Vision(ECCV) Munich Germany September 2018
[28] Y Chen X Jin J Feng et al ldquoTraining group orthogonalneural networks with privileged informationrdquo in Proceedingsof the Twenty-Sixth International Joint Conference on ArtificialIntelligence Melbourne Australia August 2017
[29] T Mikolov I Sutskever K Chen et al ldquoDistributedrepresentations of words and phrases and their composi-tionalityrdquo in Proceedings of the 26th International Con-ference on Neural Information Processing Systems LakeTahoe NV USA December 2013
[30] A Rajaraman and J D Ullman ldquoData miningrdquo Mining ofMassive Datasets vol 3 no 2 pp 1ndash17 2014
[31] J Duchi E Hazan and Y Singer ldquoAdaptive subgradientmethods for online learning and stochastic optimizationrdquoJournal of Machine Learning Research vol 12 no 7pp 2121ndash2159 2011
[32] T Miyato A M Dai and I Goodfellow ldquoAdversarial trainingmethods for semi-supervised text classificationrdquo in Pro-ceedings of the International Conference on LearningRepresentations San Juan Puerto Rico May 2016
[33] I J Goodfellow J Shlens and C Szegedy ldquoExplaining andharnessing adversarial examplesrdquo in Proceedings of the In-ternational Conference on Learning Representations BanffCanada April 2014
[34] K Chandrasekaran S Gauch P Lakkaraju et al ldquoConcept-based document recommendations for CiteSeer authorsrdquo inProceedings of the International Conference on AdaptiveHypermedia and Adaptive Web-Based Systems HannoverGermany August 2008
[35] E Voorhees ldquo0e trec-8 question answering track reportrdquo inProceedings of the TRECrsquo00 pp 77ndash82 Gaithersburg MDUSA 2000
[36] S Kataria P Mitra and S Bhatia ldquoUtilizing context ingenerative bayesian models for linked corpusrdquo in Proceedingsof the Twenty-Fourth AAAI Conference on ArtificialIntelligence Atlanta GA USA July 2010
[37] J Tang and J Zhang ldquoA discriminative approach to topic-based citation recommendationrdquo in Proceedings of the Pacific-Asia Conference Hyderabad India July 2009
12 Computational Intelligence and Neuroscience
based on TREC genomics and PubMed centralrdquo 2015 httphdlhandlenet214273680
[11] A Ritchie S Robertson and S Teufel ldquoComparing CitationContexts for information Retrievalrdquo in Proceedings of the 17thACM Conference on Information and Knowledge Manage-ment pp 213ndash222 Napa Valley CA USA October 2008
[12] C F Van Loan e Block Jacobi Method for Computing theSingular Value Decomposition Department of ComputerScience Cornell University Ithaca NY USA 1985
[13] C Bhagavatula S Feldman R Power et al ldquoContent-basedcitation recommendationrdquo 2018 httpsarxivorgpdf18020830201v1pdf
[14] H Jia and E Saule ldquoLocal is good a fast citation recom-mendation approachrdquo Lecture Notes in Computer ScienceVol 10772 Springer Berlin Germany 2018
[15] Y Sun W Ni and R Men ldquoA personalized paper recom-mendation approach based on web paper mining and re-viewerrsquos interest modellingrdquo in Proceedings of theInternational Conference on Research Challenges in ComputerScience Shanghai China December 2009
[16] B Shaparenko and T Joachims ldquoInformation genealogyUncovering the flow of ideas in non-hyperlinked documentdatabasesrdquo in Proceedings of the ACM SIGKDD internationalConference on Knowledge Discovery and Data Mining SanJose CA USA August 2007
[17] T Strohman W B Croft and D Jensen ldquoRecommendingcitations for academic papersrdquo in Proceedings of the Annualinternational ACM SIGIR Conference on Research and De-velopment in information Retrieval Amsterdam NetherlandsJuly 2007
[18] A Livne V Gokuladas J Teevan et al ldquoCiteSight supportingcontextual citation recommendation using differentialsearchrdquo in Proceedings of the International ACM SIGIRConference on Research amp Development in informationRetrieval Gold Coast Australia July 2014
[19] Y Lu J He D Shan et al ldquoRecommending citations withtranslation modelrdquo in Proceedings of the ACM internationalConference on Information and Knowledge ManagementGlasgow UK October 2011
[20] W Huang P Mitra S Kataria et al ldquoRecommending cita-tions translating papers into referencesrdquo in Proceedings of theACM international Conference on Information and KnowledgeManagement Shanghai China November 2014
[21] X Tang X Wan X Zhang et al ldquoCross-language context-aware citation recommendation in scientific articlesrdquo inProceedings of the International ACM SIGIR Conference onResearch amp Development in information Retrieval Gold CoastUK July 2014
[22] A Brock T Lim J M Ritchie et al ldquoNeural photo edi-tingwith introspective adversarial networksrdquo in InternationalConference on Learning Representations 2017
[23] I J Goodfellow J Pouget-Abadie MMirza et al ldquoGenerativeadversarial netsrdquo in Proceedings of the International Con-ference on Neural Information Processing Systems MontrealCanada December 2014
[24] A Brock J Donahue K Simonyan et al ldquoLarge scale GANtraining for high fidelity natural image synthesisrdquo 2018httpsarxivorgabs180911096
[25] Y Sun L Zheng W Deng et al ldquoSVDNet for pedestrianretrievalrdquo in Proceedings of the 2017 IEEE InternationalConference on Computer Vision (ICCV) pp 3820ndash3828Venice Italy October 2017
[26] Q Zheng M Yang J Yang Q Zhang and X ZhangldquoImprovement of generalization ability of deep CNN via
implicit regularization in two-stage training processrdquo IEEEAccess vol 6 no 1109 pp 15844ndash15869 2018
[27] Y Wang D Gong Z Zheng et al ldquoOrthogonal deep featuresdecomposition for age-invariant face recognitionrdquo in Pro-ceedings of the European Conference on Computer Vision(ECCV) Munich Germany September 2018
[28] Y Chen X Jin J Feng et al ldquoTraining group orthogonalneural networks with privileged informationrdquo in Proceedingsof the Twenty-Sixth International Joint Conference on ArtificialIntelligence Melbourne Australia August 2017
[29] T Mikolov I Sutskever K Chen et al ldquoDistributedrepresentations of words and phrases and their composi-tionalityrdquo in Proceedings of the 26th International Con-ference on Neural Information Processing Systems LakeTahoe NV USA December 2013
[30] A Rajaraman and J D Ullman ldquoData miningrdquo Mining ofMassive Datasets vol 3 no 2 pp 1ndash17 2014
[31] J Duchi E Hazan and Y Singer ldquoAdaptive subgradientmethods for online learning and stochastic optimizationrdquoJournal of Machine Learning Research vol 12 no 7pp 2121ndash2159 2011
[32] T Miyato A M Dai and I Goodfellow ldquoAdversarial trainingmethods for semi-supervised text classificationrdquo in Pro-ceedings of the International Conference on LearningRepresentations San Juan Puerto Rico May 2016
[33] I J Goodfellow J Shlens and C Szegedy ldquoExplaining andharnessing adversarial examplesrdquo in Proceedings of the In-ternational Conference on Learning Representations BanffCanada April 2014
[34] K Chandrasekaran S Gauch P Lakkaraju et al ldquoConcept-based document recommendations for CiteSeer authorsrdquo inProceedings of the International Conference on AdaptiveHypermedia and Adaptive Web-Based Systems HannoverGermany August 2008
[35] E Voorhees ldquo0e trec-8 question answering track reportrdquo inProceedings of the TRECrsquo00 pp 77ndash82 Gaithersburg MDUSA 2000
[36] S Kataria P Mitra and S Bhatia ldquoUtilizing context ingenerative bayesian models for linked corpusrdquo in Proceedingsof the Twenty-Fourth AAAI Conference on ArtificialIntelligence Atlanta GA USA July 2010
[37] J Tang and J Zhang ldquoA discriminative approach to topic-based citation recommendationrdquo in Proceedings of the Pacific-Asia Conference Hyderabad India July 2009