Top Banner
Research Article SVD-CNN:AConvolutionalNeuralNetworkModelwith OrthogonalConstraintsBasedonSVDforContext-Aware CitationRecommendation ShaoyuTao,ChaoyuanShen,LiZhu ,andTaoDai School of Software Engineering, Xi’an Jiaotong University, Xi’an, Shanxi 710049, China Correspondence should be addressed to Li Zhu; [email protected] Received 27 November 2019; Revised 28 September 2020; Accepted 5 October 2020; Published 23 October 2020 Academic Editor: Giosu` e Lo Bosco Copyright © 2020 Shaoyu Tao et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Context-aware citation recommendation aims to automatically predict suitable citations for a given citation context, which is essentially helpful for researchers when writing scientific papers. In existing neural network-based approaches, overcorrelation in the weight matrix influences semantic similarity, which is a difficult problem to solve. In this paper, we propose a novel context- aware citation recommendation approach that can essentially improve the orthogonality of the weight matrix and explore more accurate citation patterns. We quantitatively show that the various reference patterns in the paper have interactional features that can significantly affect link prediction. We conduct experiments on the CiteSeer datasets. e results show that our model is superior to baseline models in all metrics. 1.Introduction Citation recommendation for researchers to quickly find the appropriate relevant literature is a rapidly developing re- search area [1]. Among this area, context-aware citation recommendation is a particular type for predicting citations for a citation context [2]. e citation context is usually a few sentences before and after the place holder, such as “[]”. e key problem for context-aware citation recommendation is how to measure the similarity between the citation context and a specific scientific paper. Similar to other NLP tasks (e.g., information retrieval (IR) and text mining), the simplest solution for context- aware citation recommendation calculates the relevant score between a citation context and candidate papers via Eu- clidean distance [3] and then selects the salient citations. However, simple text similarity is obviously too coarse to be a good measurement. In recent years, neural network models have been widely used to recommend documents due to their efficiency and effectiveness [4–7]. Neural net- work models can be regarded as better solutions than tra- ditional machine learning methods for simplifying feature engineering tasks and having the ability to deal with large- scale data. However, the weight vectors in existing neural network-based models are usually strongly correlated. In fact, a critical assumption of using similarity measurements, such as Euclidean distance or cosine distance, is that the entries in the feature vectors should be possibly independent [8]. When the weight vectors are overcorrelated, some en- tries of the descriptor will dominate the measurement and cause poor ranking results. e above problems seriously affect the performance of citation recommendation because citing activity appears to have strong orthogonality. Assume there are three types of citations in a paper, including “field- reference” (red color), “method-reference” (purple color), and “math-reference” (blue color). “Field-reference” usually appears in the introduction and cites scientific articles that use the same techniques in other research fields. “Method- reference” usually appears in related work and cites scientific articles solving the same task. “Math-reference” usually appears in the main part of the paper describing the re- searcher’s method in detail, and its citations will be more related to mathematical theorem. It is obvious that these three types of citations have strong orthogonality. In the Hindawi Computational Intelligence and Neuroscience Volume 2020, Article ID 5343214, 12 pages https://doi.org/10.1155/2020/5343214
12

SVD-CNN:AConvolutionalNeuralNetworkModelwith ...

Feb 25, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: SVD-CNN:AConvolutionalNeuralNetworkModelwith ...

Research ArticleSVD-CNN A Convolutional Neural Network Model withOrthogonal Constraints Based on SVD for Context-AwareCitation Recommendation

Shaoyu Tao Chaoyuan Shen Li Zhu and Tao Dai

School of Software Engineering Xirsquoan Jiaotong University Xirsquoan Shanxi 710049 China

Correspondence should be addressed to Li Zhu zhulixjtueducn

Received 27 November 2019 Revised 28 September 2020 Accepted 5 October 2020 Published 23 October 2020

Academic Editor Giosue Lo Bosco

Copyright copy 2020 Shaoyu Tao et al 0is is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Context-aware citation recommendation aims to automatically predict suitable citations for a given citation context which isessentially helpful for researchers when writing scientific papers In existing neural network-based approaches overcorrelation inthe weight matrix influences semantic similarity which is a difficult problem to solve In this paper we propose a novel context-aware citation recommendation approach that can essentially improve the orthogonality of the weight matrix and explore moreaccurate citation patterns We quantitatively show that the various reference patterns in the paper have interactional features thatcan significantly affect link prediction We conduct experiments on the CiteSeer datasets 0e results show that our model issuperior to baseline models in all metrics

1 Introduction

Citation recommendation for researchers to quickly find theappropriate relevant literature is a rapidly developing re-search area [1] Among this area context-aware citationrecommendation is a particular type for predicting citationsfor a citation context [2]0e citation context is usually a fewsentences before and after the place holder such as ldquo[]rdquo 0ekey problem for context-aware citation recommendation ishow to measure the similarity between the citation contextand a specific scientific paper

Similar to other NLP tasks (eg information retrieval(IR) and text mining) the simplest solution for context-aware citation recommendation calculates the relevant scorebetween a citation context and candidate papers via Eu-clidean distance [3] and then selects the salient citationsHowever simple text similarity is obviously too coarse to bea good measurement In recent years neural networkmodels have been widely used to recommend documentsdue to their efficiency and effectiveness [4ndash7] Neural net-work models can be regarded as better solutions than tra-ditional machine learning methods for simplifying feature

engineering tasks and having the ability to deal with large-scale data However the weight vectors in existing neuralnetwork-based models are usually strongly correlated Infact a critical assumption of using similarity measurementssuch as Euclidean distance or cosine distance is that theentries in the feature vectors should be possibly independent[8] When the weight vectors are overcorrelated some en-tries of the descriptor will dominate the measurement andcause poor ranking results 0e above problems seriouslyaffect the performance of citation recommendation becauseciting activity appears to have strong orthogonality Assumethere are three types of citations in a paper including ldquofield-referencerdquo (red color) ldquomethod-referencerdquo (purple color)and ldquomath-referencerdquo (blue color) ldquoField-referencerdquo usuallyappears in the introduction and cites scientific articles thatuse the same techniques in other research fields ldquoMethod-referencerdquo usually appears in related work and cites scientificarticles solving the same task ldquoMath-referencerdquo usuallyappears in the main part of the paper describing the re-searcherrsquos method in detail and its citations will be morerelated to mathematical theorem It is obvious that thesethree types of citations have strong orthogonality In the

HindawiComputational Intelligence and NeuroscienceVolume 2020 Article ID 5343214 12 pageshttpsdoiorg10115520205343214

neural network model these three citation types are usuallymapped into a matrix and can be seen as base vectors forinputs As shown in Figure 1 vectors in the mapping matrixlearned by traditional neural network models are not or-thogonal When a sample is mapped by w1

rarr w2rarr and w3

rarrapparently w1

rarr and w3rarr will dominate the output and con-

sequently create low discriminative ability A more satis-factory w2prime

rarr(yellow color) imposes orthogonality

To address the aforementioned problems we propose aneural network model with orthogonal regularization forcontext-aware citation recommendation Our model usesCNN to extract the semantic features for citation contextand candidate papers We then add the orthogonal con-straint based on SVD in our model to weaken the correlationof weight vectors in the FC layer which can learn goodinterpretable features for citation context and papers To thebest of our knowledge this is the first work that addresses thecontext-aware citation recommendation with the CNN andorthogonal constraint framework Experimental resultsshow that our model significantly outperforms otherbaseline methods

2 Related Work

21 Citation Recommendation A variety of citation rec-ommendation approaches have been proposed in the lit-erature including text similarity-based [9 10] topic model-based [11 12] probabilistic model-based [13] translationmodel-based [7] and collaborative filtering-based [14] Sunet al [15] proposed amethod for recommending appropriatepapers for academic reviewers by using the similarity-basedalgorithm 0eir method builds preference vectors for re-viewers based on published history information and cal-culates the similarity between the preference vector andcandidate document vector 0e literature with high simi-larity is recommended to corresponding reviewers Sha-parenko and Joachims [16] considered the relevance ofcitation context and the paper content and applied a lan-guage model to the recommendation task Strohman et al[17] showed that using text similarity alone was not ideal forrecommending citations because scholars tend to constructnew words to describe their own achievements while twoscholars who study the same topic may use different ex-pressions for the same concept and method To address thisproblem Strohman et al [17] regarded the document as anode in a directed graph to perform citation recommen-dations 0ey believe that the similarity measurement withreference information can reflect the reference situation of anode more authentically Livne et al [18] proposed a citationrecommendation method by coupling the enriched citationcontext of the literature and adopted various techniquesincluding machine learning when making recommenda-tions Some works addressed the language gap between citedpapers and citation contexts and attempted to use transla-tionmodels or distributed semantic representations Lu et al[19] assumed that the languages used in the citation contextsand in the cited papers were different and used a translationmodel to solve this problem He et al [3] combined alanguage model topic model and feature model to find the

appropriate citation context Huang et al [20] assumed thatthe appearance of cited papers was a particular language andrepresented the cited papers in unique IDs regarded as newldquowordsrdquo 0e probability of citing a paper given a citationcontext is directly estimated by using a translation modelTang et al [21] proposed a joint embedding model to learn alow-dimensional embedding space for both contexts andcitations

In recent years neural networks have shown betterperformance in many fields Some researchers haveattempted to recommend citations by using neural net-works Huang et al [4] learned a distributed word rep-resentation for citation context and associated documentembedding via a feedforward neural network and thenestimated the probability of citing a paper by a given ci-tation context Tan et al [5] proposed a neural networkmethod based on LSTM to solve quote recommended tasks0ey focused on the characteristics of quotes and trainedneural networks to bridge the language gap A neuralnetwork model learned the semantic representations ofarbitrary length texts from a large corpus

22 Orthogonal Constraint in Deep Learning One of thegreatest advantages of orthogonal matrices is that thenorm of the matrix is changed when it is multiplied by amatrix 0is property is useful in gradient back-propagation especially to deal with gradient explosionand gradient dissipation problems Orthogonal regula-rization is widely used in many fields Brock et al [22]used orthogonal regularization to improve the general-ization performance of image generation editor tasks byusing generative adversarial networks (GANs) [23] 0eyfurther expanded their work into BigGAN [24] 0e re-sults in their work showed that by applying orthogonalregularization the generator allows fine-tuning thetradeoff between fidelity and diversity of samples bytruncating hidden spaces which can make the modelachieve the best performance in the image synthesis ofclass conditions Another advantage of orthogonal ma-trices is that they benefit from deep representationlearning If the weight vectors of the full connection layerin the convolutional neural network are highly

w1w2

w3w2prime

Figure 1 Distribution of the weight vector of the reference type ingeometric space

2 Computational Intelligence and Neuroscience

correlated the individuals in each full-join descriptionwill also be highly correlated which will highly reduceretrieval performance Sun et al [25] proposed SVD-Netto show that guaranteeing the feature weight of the FClayer can increase the orthogonal constraint of the net-work and improve the accuracy Zheng et al [26] re-ported that regularization was an efficient method forimproving the generalization ability of deep CNN be-cause it makes it possible to train more complex modelswhile maintaining lower overfitting Zheng et al [26]proposed a method for optimizing the feature boundaryof a deep CNN through a two-stage training step to re-duce the overfitting problem However the mixed fea-tures learned from CNN potentially reduce therobustness of network models for identification orclassification To address this problem Wang et al [27]decomposed deep face features into two orthogonalcomponents to represent age-related and identity-relatedfeatures to learn the age-invariant deep face features Inthe above model age-invariant deep features can be ef-fectively obtained to improve AIFR performance Chenet al [28] proposed a group orthogonal convolutionalneural network (GoCNN) model based on the idea oflearning different groups of convolutional functions thatare ldquoorthogonalrdquo to those in other groups ie with nosignificant correlation among the produced featuresOptimizing orthogonality among convolutional func-tions reduces the redundancy and increases the diversitywithin the architecture Moreover it can also obtain asingle CNN model with sufficient inherent diversity suchthat the model learns more diverse representations andhas stronger generalization ability than vanilla CNNs

3 Proposed Method

31 Problem Formulation 0e context-aware citation rec-ommendation is defined as the matching task between citationcontext and candidate papers 0e main architecture of ourmodel is shown in Figure 2 Our model is actually a con-volutional neural network with two inputs and orthogonalconstraints Our model consists of the following main steps

(1) We adopt word2vec to obtain the raw input vectorsand then use CNNs to extract multiple granularitysemantic features

(2) 0e multiple granularity semantic feature is thenimposed orthogonally by an SVD-FC layer

(3) We use fully connected layers to obtain the finalvector representation 0e logistic function or SVMis used to obtain the recommendation result

32 Network Structure

321 Input Layer Word2vec [29] is used to embed the inputof our model Each word is represented as a d0 dimensionalprecomputed vector where d0 300 As a result each sentenceis represented as a feature matrix with dimension d0 times s0rough this layer we can obtain the raw representation ofcitation context c and candidate document d

We also calculate the weight of common wordsaccording to the inputs 0en we can obtain the basic inputfeatures TF minus IDF(c d) for our model which is the productof TF(wc d) and IDF to reflect how important a word incitation context c is for a candidate document d in the corpus[30] wc is a word in citation context c 0ese two variablesare calculated as follows

TF wc d( 1113857 count wc d( 1113857

top wlowast d( 1113857

IDF logN

docs wc D( 1113857

(1)

where count(wc d) is the number of words wc that appear indocument d top(wlowast d) is the occurrence number of theword wlowast that appears most frequently in this candidatedocument d docs(wc D) is the number of documentscontaining the word wc in all candidate citations D N is thetotal number of candidate citations

322 Convolution Layer 0e inputs of the convolutionlayer are the feature matrix of citation context c and doc-ument d 0e process of this layer is demonstrated in Fig-ure 3 We first pad the two inputs to have the same lengths max(c d) by zero vectors For every input letv1 v2 vs be the words in a sentenceWe define gi isin Rwd0 0lt ilt s + w minus 1 as the concatenation of viminusw vi 0enthis layer generates the feature Pi isin Rd1 for the phrasesviminusw vi as follows

Pi tanh W middot gi + b( 1113857 (2)

whereW isin Rd1timeswd0 is a convolution kernel and b isin Rd1 is thebias

323 Average Pooling Layer 0e pooling layer is usuallyused for feature compression In our model we chooseaverage pooling 0e reason is that whole sentences orparagraphs can express more meaningful semantics Asshown in Figure 4 we design two pooling layers 0e firstone is ldquow-aprdquo which is the column average for thewindow of w continuous columns After the convolutionlayer an s column feature map is converted into a news + w minus 1 column feature map By using ldquow-aprdquo the newfeature map is recovered into the s column 0is archi-tecture facilitates the extraction of more useful abstractfeatures

0e second one is ldquoall-aprdquo which normalizes all col-umns As shown in Figure 5 ldquoall-aprdquo generates a repre-sentation vector for each feature map 0e generated featurecombines the information of the whole citation context orcited document

Now we can obtain the features of citation context andindependent features of the cited document 0e next step isto obtain the semantic relationships between the citationcontext and the candidate paper We use cosine similarity tomeasure the semantic relations

Computational Intelligence and Neuroscience 3

simj 1113936

dj

i0 Cji times Dji1113872 1113873

1113936dj

i0 Cji1113872 11138732

times 1113936dj

i0 Dji1113872 11138732

1113969 (j isin [1 10]) (3)

Citation context Document

SVD-FC

Word2vet Word2vet

Convolution

W-ap W-ap

FC

FC

LogisticsSVM

W-ap W-ap

Convolution

Convolution Convolution10 th

1st

USSVDw

Splice

All-ap All-ap

All-ap All-ap

Based-feature

All-ap-feature

Figure 2 An overview of our model

s + w minus 1ws

Figure 3 Convolution extraction generates phrases

ss + w minus 1

Figure 4 ldquoW-aprdquo structure

Figure 5 ldquoAll-aprdquo structure

4 Computational Intelligence and Neuroscience

where Cj and Dj are the distributed representation of ci-tation context and candidate document after the j-th ldquoall-aprdquo layer respectively A total of ten ldquoall-aprdquo layers arecarried out in our model 0erefore j belongs to [1 10] 0ebenefit is that we can obtain the semantic relation betweenthe citation context and the cited document with multiplegranularities As shown in Figure 6 the final output featureconsists of all simj and basic features 0en it is fed into theSVD-FC layer

In most cases we find that if we use all outputs of poollayers as the input of the SVD-FC layer the performance willbe improved0e reason is that features from different layersrepresent the different levels of semantics Neglecting anylayers will obviously cause information loss problems

Next we use the SVD-FC layer to learn the nonlinearcombination features of citation relationships0is layer canforce vectors in the feature map independent and orthogonalto each other 0e added SVD-FC layer can also reduce thenegative impact of excessive parameters

324 SVD-FC Layer In this layer we use SVD to factorizethe weight matrix W (W USVT) and replace it with USOur experimental results show that replacing operations canreduce the negative impact on the sample space

0e Euclidean distance between samples can be used tomeasure whether their feature expression changes in asample space Denoting em and en as the feature maps of twodifferent samples we can obtain two different outputs of thefull connection operation by using the weight matrix W orUS as follows

p e times W (4)

q e times US (5)

As seen in the above equations q is orthogonalizedoutput while p is unorthogonalized0en we can obtain thefollowing theorem

Theorem 1 p and q in equations (4) and (5) will generatethe same Euclidean distance for samples em and en

Proof 0e Euclidean distance L between pm and pn iscalculated as follows

L pm

rarrminus pn

rarr2

emrarr

minus enrarr

( 1113857TWW

Temrarr

minus enrarr

( 1113857

1113969

emrarr

minus enrarr

( 1113857TUSVV

TS

TU

Temrarr

minus enrarr

( 1113857

1113969(6)

Since V is an orthogonal matrix equation (6) isequivalent to

L

emrarr

minus enrarr

( 1113857TUSS

TU

Temrarr

minus enrarr

( 1113857

1113969

qmrarr

minus qnrarr

( 1113857T

qmrarr

minus qnrarr

( 1113857

1113969

qmrarr

minus qnrarr

2

(7)

It can be seen that pm

rarrminus pn

rarr2 qm

rarrminus qn

rarr2

It should be noted that there are no negative impacts andno changes in discrimination ability for the entire samplespace when replacing the weight As shown in Figure 7 weuse SVD of weight matrix W to map the feature map to anorthogonal linear space

325 Output Layer 0e citation recommendation problemis regarded as a classification task in our model In this layerlogistics and SVM can deal with binary classification tasksand predict the final citation relationship

33 Training Details

331 Embeddings In our model words are initialized by300-dimensional word2vec embeddings and will notchange during training A single randomly initializedembedding is created for all unknown words by uniformsampling from[minus001 001] We employ AdaGrad [31] andL2 regularization We introduce adversarial training [32]for embeddings to make the model more robust 0eprocess is achieved by replacing the word vector v afterword2vec embeddings using word vector with disturbingvlowast

vlowast

v times radv (8)

where radv is the worst case of perturbation on the wordvector Goodfellow et al [33] approximated this value bylinearizing the loss function logp(y|x 1113954θ) around x where1113954θ is a constant set to the current parameters of our modeland it only participates in the calculation process of radvwithout a backpropagation algorithm With the linearapproximation and L2 norm constraint the adversarialperturbation is

All-ap-feature

simi

Ci Di

Basic-feature

Figure 6 Generating the feature map

SVD-FC layer input feature

SVD-FC layer output featureSVD-FC layer

Figure 7 SVD-FC layer

Computational Intelligence and Neuroscience 5

radv minusising

g2 whereg nablaxlogp(y|x 1113954θ) (9)

0is perturbation can be easily computed by usingbackpropagation in neural networks

332 Layerwise Training In our training steps we defineconv-pooling block bt (tge 2) which consists of a convo-lution layer and a pooling layer Our network model is thenassembled by the initialization block b1 that initializes usingword2vec and (n minus 1) conv-pooling blocks

First we train the conv-pooling block b2 after b1 istrained On this basis the next conv-pooling block b3 iscreated by keeping the previous block fixed We repeat thisprocedure until all (n minus 1) conv-pooling blocks are trained

Second the following semiorthogonal training proce-dure is used to train the whole network

Semiorthogonal training (SOT) it is crucial to trainSVD-CNN which consists of the following three steps

Step 1 Decompose the weight matrix by SVD ieW USVT W is the weight matrix of the linear layerU is the left-unitary matrix S is the singular valuematrix V is the right-unitary matrix After that wereplace W with US Next we take all eigenvectors ofUS(US)T as weight vectorsStep 2 0e backbone model is fine-tuned by fixing theSVD-FC layerStep 3 0e model keeps fine-tuning with the unfixedSVD-FC layer

Step 1 can generate orthogonal weights but the per-formance of prediction cannot be guaranteed 0e reason isthat over orthogonality will excessively punish synonymoussentences which is apparently inappropriate 0erefore weintroduce Steps 2 and 3 to solve the above problem

0e inputs of SVD-FC are defined as Y

Y (y1 y2 ym)T 0e outputs are defined as O

O (o1 o2 om)T 0e weight matrix is defined asW (w1 w2 wm)T 0e expected outputs are defined asA (a1 a2 am)T 0e error function is defined as

E 12

1113944

l

k1ak minus ok( 1113857

2 (10)

where ok f(1113936mj0 wkjyj) k 1 2 l 0en E with re-

spect to ok is derived and the outcome is

zE

zok

minus ak minus ok( 1113857 (11)

We utilize the gradient descent strategy to find thegradient of the error with respect to weights 0e iterativeupdate of weights is as follows

Δwkj minusηzE

zwkj

(12)

We define an error signal δok zEz netk equation (12) is

equivalent to

Δwkj minusηzE

z netk

z netkzwkj

minusηδok

z netkzwkj

(13)

According to equation (11) δok zEz netk is equivalent

to

δok minus

zE

zok

zok

z netk minus

zE

zok

fprime netk( 1113857

zE

zok

okprime minus dk minus ok( 1113857ok

prime

(14)

We use the sigmoid f(x) 1(1 + ex) as the nonlinearfunction so equation (13) is equivalent to

Δwkj minusηδokyj η dk minus ok( 1113857ok 1 minus ok( 1113857yj (15)

In Step 1 the weight matrix W is decomposed by SVDand replaced with US U (q1 q2 qm)T andS diag(λ1 λ2 λm) Since dk minus ok is given we definethat Loss dk minus ok As a result equation (15) is equivalent to

Δwkj η Loss middot ok minus sigmoid yj 1113944 qiλi + B1113872 11138732

1113876 1113877yj (16)

qi middot qj 0 ine j are in the left-unitary matrix U so themodel operation is not affected by the nonorthogonal ei-genvectors qi 0is is the reason for excessively punishingsynonymous sentences in Step 1 However orthogonalityhas a positive effect on Δwkj in Step 2

0e purpose of SVD is to maintain the orthogonality ofeach weight vector in geometric space When weight vectorsare conditioned by orthogonal regularization the relevancybetween weight vectors decreases We use the followingmethods in Step 3 to measure relevance

H WTW

w1rarrT

w1rarr

middot middot middot w1rarrT

wkrarr

⋮ ⋱ ⋮

wkrarrT

w1rarr

middot middot middot wkrarrT

wkrarr

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

h11 middot middot middot h1k

⋮ ⋱ ⋮

hk1 middot middot middot hkk

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

(17)

where W is a weight matrix that contains k weight vectorswi (i 1 k)hij (i j 1 k) is the dot product of wi

and wj Let us define S(W) as the correlation measurementof all column vectors in W

S(W) 1113936

ki1 hii

1113936ki1 1113936

kj1 hij

11138681113868111386811138681113868

11138681113868111386811138681113868 (18)

When W is an orthogonal matrix the value of S(W) is 1When ine j S(W) obtains the minimum value (1k)0erefore we can see that the value of S(W) falls into

6 Computational Intelligence and Neuroscience

[(1k) 1] As a result when S(W) is close to 1k or 0 theweight matrix will have high relevance

34 Complexity Analysis Assume that the training samplesize is |C| the average number of words in each citationcontext is |c| Cl is the number of kernels in the l-th layer andwis the size of the sliding window For one convolution layerthe training complexity is O(Clminus1 middot Cl middot w middot (s minus w + 1)) 0etraining complexity of one w-ap layer is O(C2

l middot w middot s) 0etraining complexity of one all-ap layer isO(C2

l middot (s minus w + 1))which was improved by C F Van Loan [12] computing theeigenvalue for SVD matrix decomposition with K size takesO(K) on the way of JACOBI Assume that the size of theweight matrix in the SVD-FC layer isK and the channel ofthe input matrix is Cin 0e computational cost for the SVD-FC layer is O(2K2 middot Cin + K)

4 Experiment

41 Dataset We use the CiteSeer dataset [34] to evaluatethe performance of our model 0e dataset was publishedby Huang et al [4] In this dataset citation relationshipsare extracted by a pair of citation contexts and the ab-stracts of cited papers A citation context includes thesentence where the citation placeholder appears and thesentences before and after the citation placeholderWithin each paper in the corpus the 50 words before and50 words after each citation reference are treated as thecorresponding citation context (a discussion on thenumber of words can be found in [7]) Before wordembedding we also remove stop words from the contextsTo preserve the time-sensitive pastpresentfuture tensesof verbs and the singularplural styles of named entitiesno stemming is done but all words are transferred tolower-case 0e training set contains 3989547 pairs ofreference contexts and citations and the test set contains1021685 citation relations

Following common practice in information retrieval(IR) we employ the following four evaluation metrics toevaluate recommendation results recall mean reciprocalrank (MRR) mean average precision (MAP) and normal-ized discounted cumulative gain (nDCG)

42 EvaluationMetric For each query in the test set we usethe original set of references as the ground truth Rg Assumethat the set of recommended citations is Rr and the correctrecommendations are Rg capRr Recall is defined as

recall Rg capRr

11138681113868111386811138681113868

11138681113868111386811138681113868

Rg

(19)

In our experiments the number of recommended ci-tations ranges from 1 to 10 Recall evaluation does not revealthe order of recommended references To address thisproblem we select the following two additional metrics

For a query q let rankq be the rank of the first correctrecommendation within the list MRR [35] is defined as

MRR 1

|Q|1113944qisinQ

1rankq

(20)

where Q is the testing set MRR reveals the average rankingof the first correct recommendation

For each citation placeholder we search the papers thatmay be referenced at this citation placeholder Each retrievalmodel returns a ranked list of papers Since there may be oneor more references for one citation context we use meanaverage precision (MAP) as the evaluation metric

MAP d1 dN( 1113857 1113936i R di( 1113857i( 11138571113936jleiR dj1113872 1113873

1113936iR di( 1113857 (21)

where R(di) is a binary function indicating whether doc-ument di is relevant or not For our problem the papers citedat the citation placeholder are considered relevantdocuments

We use normalized discounted cumulative gain (NDCG)to measure the ranked recommendation list 0e NDCGvalue of a ranking list at position i is calculated as

NDCG d1 dN( 1113857 1113944i

2rel di( ) minus 1lni+1 (22)

where rel (di) is the 4-scale relevance of document di in theranked list We use the average cocited probability [2] oflangdi dlowastrang to weigh the citation relevance score of di to dlowast(anoriginal citation of the query) We report the average NDCGscore over all testing documents

43 BaselineComparison We choose the following methodsfor comparison

Cite-PLSA-LDA (CP-LDA) [36] we use the originalimplementation provided by the author 0e number oftopics is set to 60

(i) Restricted Boltzmann Machine (RBM-CS) [37] Wetrain two layers of RBM-CS according to the sug-gestion of the author We set the hidden layer size to600

(ii) Word2vec Model (W2V) [29] We use the word2vecmodel to learn words and document representa-tions 0e cited document is treated as a ldquowordrdquo (adocument uses a unique marker when it is cited bydifferent papers) 0e dimensions of the word anddocument vectors are set to n 300

(iii) Neural Probabilistic Model (NPM) [4] We followthe original implementation 0e dimensions of theword and document representation vector are set ton 600 For negative sampling we set the numberof negative samples k 10 where k is the number ofnoise words in the citation context For noisecontrast estimation we set the number of noisesamples k 1000

(iv) Neural Citation Network (NCN) [7] In NCN thegradient clipping is 5 the dropout probability is 02and the recurrent layers are 2 0e region sizes for

Computational Intelligence and Neuroscience 7

the encoder are set to 4 4 and 5 and the region sizesfor the author network are set to 1 and 2

Figures 8 and 9 show the performance of eachmethod onthe CiteSeer dataset It is obvious that the SVD-FC modelleads the performance in most cases More detailed analysesare given as follows

First we perform a comparison among CP-LDA RBMW2V and SVD-CNN Our SVD-CNN completely andsignificantly exceeds other models in all metrics 0e successof ourmodel is ascribed to the content and correlation of ournetwork Due to the lack of citation context information wefind that W2V is obviously worse than other methods interms of all metrics CP-LDA works much better than W2Vwhich indicates that link information is very important forfinding relevant papers RBM-CS shows a clear performancegain over W2V because RBM-CS automatically discoverstopical aspects of each paper based on citation contextHowever the vector representations of citation context inRBM-CS are extracted by traditional word vector repre-sentations which fully neglect semantic relations betweenthe citation document and citation context and thus may belimited by vocabulary

Second we compare the performance among NPMNCN and SVD-CNN It is not surprising that NPM andNCN achieve worse performance than SVD-CNN since theirdistributed representation of words and documents reliessolely on deep learning without restraint NPM recommendscitations based on trained distributed representations NCNfurther enhances the performance by considering authorinformation and using a more sophisticated neural networkarchitecture However the CNN in NCN does not haveorthogonal constraints which makes it difficult to capturedifferent types of citing activities In addition NCN onlyutilizes the title of the cited paper for a decoder which isapparently not sufficient for learning good embedding

44 e Influence on the Link Prediction of Reference PatternInteractionalFeatures According to the chapter positions ofcitation context in the article we divide the training set intothree parts the introduction part contains 1307885 pairs ofreference contexts and citations the related word partcontains 1599897 pairs of citations and the main partcontains 1024783 pairs Furthermore these datasets formthree mixed datasets In this part of the experiment we usethe CNN model without SVD as the baseline 0ese datasetsare tested in a ratio of 3 1 In Tables 1 and 2 we show theresults on the abovementioned datasets

From the results we obtain the following observationsFirst both CNN and SVD-CNN outperform unmixed

datasets over mixed datasets across the different evaluationmetrics which shows that the diversity of reference patternsincreases the difficulty of citation recommendation tasks

Second in Tables 1 and 2 we observe that our model isparticularly good at resolving the difficulties in mixeddatasets which come from the diversity of referencepatterns

To better explore why mixed datasets are more complexthan unmixed datasets in Figure 10 we show the change in

S(W) during the training process of SVD-CNN amongvarious datasets

As shown in Figure 10 the increase in S(W) on themixed datasets indicates that SVD-CNN is good at decor-relation We can also see in Tables 1 and 2 that the CNNmodel has pretty performance on unmixed datasets whileachieving poor performance on mixed datasets HoweverSVD-CNN achieves almost the same performance on thetwo types of datasets 0is proves that the correlation fromvarious reference patterns can significantly affect the linkprediction

0e reason why the change in S(W) is not large on theunmixed datasets is that reference patterns of unmixeddatasets have similar features which belong to the samecategory As a result the orthogonality of the weight matrixis hard to improve on unmixed datasets However a citationrecommendation algorithm has pretty performance on theunmixed datasets because there are low complexities

Although mixed datasets are more complicated thanunmixed datasets SVD-CNN still performs well in mixeddatasets 0is indicates that SVD-CNN reduces the negativeimpact of the correlation of reference patterns and ourapproach is more suitable for complex scenarios

45 Comparison with Other Types of Decorrelation In ad-dition to SVD there are still some other methods fordecorrelating the feature matrix However these methodscannot maintain the discriminating ability of the CNNmodel To illustrate this we compare SVD with severalvarieties as follows

(1) Using the originally learned W

(2) Replacing W with US

(3) Replacing W with U

(4) Replacing W with UVT

(5) Replacing Wwith Q D where D is the diagonalmatrix extracted from the upper triangle matrix inQ-R decomposition

(6) Replacing W with WPCA where WPCA is the diagonalmatrix extracted from the weight matrix W after theprocessing of dimension reduction by PCA

After convergence of training different orthogonalmatrices are used to replace the weight matrix W We defineT-cost as the time cost of replacing the weight which isequivalent to the proportion of the added time to the originaltime As shown in Table 3 other types of decorrelationdegrade the performance in addition to W⟶ US andW⟶WPCA However the time cost of W⟶WPCA ismore than that of W⟶ US

46 Ablation Study In our method there are two essentialparameters a term sot which means the number of SOTiterations and a biased parameter d0 In this section weconduct an ablation study of these parameters

We first evaluate the effectiveness of sot by empiri-cally fixing d0 300 Since sot defines the loop time of

8 Computational Intelligence and Neuroscience

orthogonal constraint training it should be set as anonnegative value Figure 11 illustrates the MRR with sotfrom 0 to 10 on the CiteSeer dataset We can see that the

performance improves as the value of sot increasesWhen sot 0 the model has no decorrelation andachieves the worst performance In this situation the

Table 1 MRR metric on various datasets

Introduction Related Main Introduction + related Introduction +main Related +mainCNN 03312 03294 03478 02773 02815 02978SVD-CNN 03995 04078 03989 03878 03889 03845

060055050045040035030025Re

call

020015010005000

20 40 60Number of recommended citations

80 100

W2VsNPMs

RBMsCP_LDAs

SVD_CNNsNCNs

Figure 8 Comparison of recall with different methods on CiteSeer

MRR MAP and nDCG scores for top 10 recommendations04

035

03

025

02

015

01

005

0

0091600997

MRR MAP nCDG

00662

01843

02667

03687

00912009982

00663

01835

02418

03352

01288 0135601476

0256602592

03448

CP-LDARBM

W2VNPM

NCNSVD-CNN

Figure 9 Comparison of MRR MAP and nDCG with different methods on CiteSeer

Table 2 MAP metric on various datasets

Introduction Related Main Introduction + related Introduction +main Related +mainCNN 03001 02909 03107 02572 02601 02637SVD-CNN 03701 03655 03693 03498 03511 03539

Computational Intelligence and Neuroscience 9

weight matrix in the FC layer is highly correlated andS(W) has the lowest value 0e recommendation per-formance then increases while adding sot which indi-cates that reducing the correlative degree of the weightmatrix in the FC layer is critical for improving perfor-mance When sot 10 our model achieves the bestperformance

In our model d0 is the dimension of citation contextand cited document representations Figure 12 shows howthe performance of SVD-CNN varies with d0 on the samesot When d0 is small the information content of thecitation context is very small and produces worse per-formance 0e recommendation performance increases toa maximum point until d0 reaches 300 It should be noted

05

045

04

035

03

S (W

)

025

02

015

01

005

00 1 2 3 4 5

Sot6 7 8 9 10 11

IntroductionRelatedMain

Introduction + relatedIntroduction + mainRelated + main

Figure 10 0e change in S(W) during training on unmixed datasets and mixed datasets

Table 3 0e comparison of related methods in Step 1

W W⟶ US W⟶ U W⟶ UVT W⟶ Q D W⟶WPCA

Rank-1 636 636 617 617 616 636mAP 390 390 371 371 373 390T-cost 0 3627 3627 3627 3533 5765

055050045040035030

MRR

025020015010005000

05

S (W

)

10

04

03

02

01

001 3 5

Sot7 9

MRRLRMRRSVM

S (W) SOTS (W) NO-SOT

Figure 11 0e performance impact of sot on CiteSeer

10 Computational Intelligence and Neuroscience

that although the larger d0 is better the larger d0 willsignificantly increase the training time 0erefore wechoose d0 300

5 Conclusion and Future Works

We propose a convolutional neural network model withorthogonal regularization to solve the context-aware citationrecommendation task In our model orthogonal regulari-zation is achieved by using SVD to factorize the weight of theFC layer which can essentially make each vector in thefeature map more independent 0e orthogonal regulari-zation also enhances the feature extraction ability of CNN0e experimental results show that SVD-CNN outperformsthe other compared methods on CiteSeer Our model onlytakes the abstract as the content of the cited paper In thefuture we will explore the performance of our model byusing the full text of papers

Data Availability

Previously reported CiteSeer data were used to support thisstudy and are available at [httpspsuappboxcomvrefseer] 0ese prior datasets are cited at relevant placeswithin the text as references [4]

Conflicts of Interest

0e authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

0is work was partially supported by the National NaturalScience Foundation of China (project no 61373046) andthe National Key Research and Development Programs ofChina (project nos 2018AAA0101100 and2019YFB2102500)

References

[1] M A Angrosh S Cranefield and N Stanger ldquoConditionalrandom field based sentence context identification enhancingcitation services for the research communityrdquo in Proceedingsof the First Australasian Web Conference Adelaide AustraliaJanuary 2013

[2] Q He J Pei D Kifer et al ldquoContext-aware citation rec-ommendationrdquo in Proceedings of the International Conferenceon World Wide Web Raleigh NC USA April 2010

[3] Q He D Kifer J Pei et al ldquoCitation recommendationwithout author supervisionrdquo in Proceedings of the FourthACM international Conference on Web Search and DataMining Hong Kong China February 2011

[4] W Huang ldquoA neural probabilistic model for context basedcitation recommendationrdquo in Proceedings of the AAAIConference on Artificial Intelligence Austin TX USA January2015

[5] J Tan X Wan and J Xiao ldquoA neural network approach toquote recommendation in writingsrdquo in Proceedings of theACM International on Conference on Information andKnowledge Management Indianapolis IN USA October2016

[6] X Ren J Liu X Yu et al ldquoCluscite effective citation rec-ommendation by information network-based clusteringrdquo inProceedings of the 20th ACM SIGKDD International Con-ference on Knowledge Discovery and Data Mining New YorkNY USA August 2014

[7] T Ebesu and Y Fang ldquoNeural citation network for context-aware citation recommendationrdquo in Proceedings of the 40thInternational ACM SIGIR Conference on Research and De-velopment in Information Retrieval pp 1093ndash1096 ShinjukuJapan August 2017

[8] D M Blei A Y Ng and M I Jordan ldquoLatentdirichlet allocationrdquo Journal of Machine Learning Researchvol 3 pp 993ndash1022 2003

[9] S Bradshaw ldquoReference directed indexing redeeming rele-vance for subject search in citation indexesrdquo Research andAdvanced Technology for Digital Libraries vol 2769pp 499ndash510 2003

[10] N Meuschke B Gipp and M Lipinsk ldquoCITREC an eval-uation framework for citation-based similarity measures

06

055

05

045

Valu

e of e

valu

atio

n sta

ndar

d

04

035

03

025100 200 300 400 500

03302 03512 03687 03701 0372203003 03225 03352 03398 0340903101 03312 03448 03486 0349905456 05689 05801

Vector dimension of input layer05842 05867

Various d_0 for the effects of the model

MRRMAPnDCGRecall10

Figure 12 0e performance impact of d0 on CiteSeer

Computational Intelligence and Neuroscience 11

based on TREC genomics and PubMed centralrdquo 2015 httphdlhandlenet214273680

[11] A Ritchie S Robertson and S Teufel ldquoComparing CitationContexts for information Retrievalrdquo in Proceedings of the 17thACM Conference on Information and Knowledge Manage-ment pp 213ndash222 Napa Valley CA USA October 2008

[12] C F Van Loan e Block Jacobi Method for Computing theSingular Value Decomposition Department of ComputerScience Cornell University Ithaca NY USA 1985

[13] C Bhagavatula S Feldman R Power et al ldquoContent-basedcitation recommendationrdquo 2018 httpsarxivorgpdf18020830201v1pdf

[14] H Jia and E Saule ldquoLocal is good a fast citation recom-mendation approachrdquo Lecture Notes in Computer ScienceVol 10772 Springer Berlin Germany 2018

[15] Y Sun W Ni and R Men ldquoA personalized paper recom-mendation approach based on web paper mining and re-viewerrsquos interest modellingrdquo in Proceedings of theInternational Conference on Research Challenges in ComputerScience Shanghai China December 2009

[16] B Shaparenko and T Joachims ldquoInformation genealogyUncovering the flow of ideas in non-hyperlinked documentdatabasesrdquo in Proceedings of the ACM SIGKDD internationalConference on Knowledge Discovery and Data Mining SanJose CA USA August 2007

[17] T Strohman W B Croft and D Jensen ldquoRecommendingcitations for academic papersrdquo in Proceedings of the Annualinternational ACM SIGIR Conference on Research and De-velopment in information Retrieval Amsterdam NetherlandsJuly 2007

[18] A Livne V Gokuladas J Teevan et al ldquoCiteSight supportingcontextual citation recommendation using differentialsearchrdquo in Proceedings of the International ACM SIGIRConference on Research amp Development in informationRetrieval Gold Coast Australia July 2014

[19] Y Lu J He D Shan et al ldquoRecommending citations withtranslation modelrdquo in Proceedings of the ACM internationalConference on Information and Knowledge ManagementGlasgow UK October 2011

[20] W Huang P Mitra S Kataria et al ldquoRecommending cita-tions translating papers into referencesrdquo in Proceedings of theACM international Conference on Information and KnowledgeManagement Shanghai China November 2014

[21] X Tang X Wan X Zhang et al ldquoCross-language context-aware citation recommendation in scientific articlesrdquo inProceedings of the International ACM SIGIR Conference onResearch amp Development in information Retrieval Gold CoastUK July 2014

[22] A Brock T Lim J M Ritchie et al ldquoNeural photo edi-tingwith introspective adversarial networksrdquo in InternationalConference on Learning Representations 2017

[23] I J Goodfellow J Pouget-Abadie MMirza et al ldquoGenerativeadversarial netsrdquo in Proceedings of the International Con-ference on Neural Information Processing Systems MontrealCanada December 2014

[24] A Brock J Donahue K Simonyan et al ldquoLarge scale GANtraining for high fidelity natural image synthesisrdquo 2018httpsarxivorgabs180911096

[25] Y Sun L Zheng W Deng et al ldquoSVDNet for pedestrianretrievalrdquo in Proceedings of the 2017 IEEE InternationalConference on Computer Vision (ICCV) pp 3820ndash3828Venice Italy October 2017

[26] Q Zheng M Yang J Yang Q Zhang and X ZhangldquoImprovement of generalization ability of deep CNN via

implicit regularization in two-stage training processrdquo IEEEAccess vol 6 no 1109 pp 15844ndash15869 2018

[27] Y Wang D Gong Z Zheng et al ldquoOrthogonal deep featuresdecomposition for age-invariant face recognitionrdquo in Pro-ceedings of the European Conference on Computer Vision(ECCV) Munich Germany September 2018

[28] Y Chen X Jin J Feng et al ldquoTraining group orthogonalneural networks with privileged informationrdquo in Proceedingsof the Twenty-Sixth International Joint Conference on ArtificialIntelligence Melbourne Australia August 2017

[29] T Mikolov I Sutskever K Chen et al ldquoDistributedrepresentations of words and phrases and their composi-tionalityrdquo in Proceedings of the 26th International Con-ference on Neural Information Processing Systems LakeTahoe NV USA December 2013

[30] A Rajaraman and J D Ullman ldquoData miningrdquo Mining ofMassive Datasets vol 3 no 2 pp 1ndash17 2014

[31] J Duchi E Hazan and Y Singer ldquoAdaptive subgradientmethods for online learning and stochastic optimizationrdquoJournal of Machine Learning Research vol 12 no 7pp 2121ndash2159 2011

[32] T Miyato A M Dai and I Goodfellow ldquoAdversarial trainingmethods for semi-supervised text classificationrdquo in Pro-ceedings of the International Conference on LearningRepresentations San Juan Puerto Rico May 2016

[33] I J Goodfellow J Shlens and C Szegedy ldquoExplaining andharnessing adversarial examplesrdquo in Proceedings of the In-ternational Conference on Learning Representations BanffCanada April 2014

[34] K Chandrasekaran S Gauch P Lakkaraju et al ldquoConcept-based document recommendations for CiteSeer authorsrdquo inProceedings of the International Conference on AdaptiveHypermedia and Adaptive Web-Based Systems HannoverGermany August 2008

[35] E Voorhees ldquo0e trec-8 question answering track reportrdquo inProceedings of the TRECrsquo00 pp 77ndash82 Gaithersburg MDUSA 2000

[36] S Kataria P Mitra and S Bhatia ldquoUtilizing context ingenerative bayesian models for linked corpusrdquo in Proceedingsof the Twenty-Fourth AAAI Conference on ArtificialIntelligence Atlanta GA USA July 2010

[37] J Tang and J Zhang ldquoA discriminative approach to topic-based citation recommendationrdquo in Proceedings of the Pacific-Asia Conference Hyderabad India July 2009

12 Computational Intelligence and Neuroscience

Page 2: SVD-CNN:AConvolutionalNeuralNetworkModelwith ...

neural network model these three citation types are usuallymapped into a matrix and can be seen as base vectors forinputs As shown in Figure 1 vectors in the mapping matrixlearned by traditional neural network models are not or-thogonal When a sample is mapped by w1

rarr w2rarr and w3

rarrapparently w1

rarr and w3rarr will dominate the output and con-

sequently create low discriminative ability A more satis-factory w2prime

rarr(yellow color) imposes orthogonality

To address the aforementioned problems we propose aneural network model with orthogonal regularization forcontext-aware citation recommendation Our model usesCNN to extract the semantic features for citation contextand candidate papers We then add the orthogonal con-straint based on SVD in our model to weaken the correlationof weight vectors in the FC layer which can learn goodinterpretable features for citation context and papers To thebest of our knowledge this is the first work that addresses thecontext-aware citation recommendation with the CNN andorthogonal constraint framework Experimental resultsshow that our model significantly outperforms otherbaseline methods

2 Related Work

21 Citation Recommendation A variety of citation rec-ommendation approaches have been proposed in the lit-erature including text similarity-based [9 10] topic model-based [11 12] probabilistic model-based [13] translationmodel-based [7] and collaborative filtering-based [14] Sunet al [15] proposed amethod for recommending appropriatepapers for academic reviewers by using the similarity-basedalgorithm 0eir method builds preference vectors for re-viewers based on published history information and cal-culates the similarity between the preference vector andcandidate document vector 0e literature with high simi-larity is recommended to corresponding reviewers Sha-parenko and Joachims [16] considered the relevance ofcitation context and the paper content and applied a lan-guage model to the recommendation task Strohman et al[17] showed that using text similarity alone was not ideal forrecommending citations because scholars tend to constructnew words to describe their own achievements while twoscholars who study the same topic may use different ex-pressions for the same concept and method To address thisproblem Strohman et al [17] regarded the document as anode in a directed graph to perform citation recommen-dations 0ey believe that the similarity measurement withreference information can reflect the reference situation of anode more authentically Livne et al [18] proposed a citationrecommendation method by coupling the enriched citationcontext of the literature and adopted various techniquesincluding machine learning when making recommenda-tions Some works addressed the language gap between citedpapers and citation contexts and attempted to use transla-tionmodels or distributed semantic representations Lu et al[19] assumed that the languages used in the citation contextsand in the cited papers were different and used a translationmodel to solve this problem He et al [3] combined alanguage model topic model and feature model to find the

appropriate citation context Huang et al [20] assumed thatthe appearance of cited papers was a particular language andrepresented the cited papers in unique IDs regarded as newldquowordsrdquo 0e probability of citing a paper given a citationcontext is directly estimated by using a translation modelTang et al [21] proposed a joint embedding model to learn alow-dimensional embedding space for both contexts andcitations

In recent years neural networks have shown betterperformance in many fields Some researchers haveattempted to recommend citations by using neural net-works Huang et al [4] learned a distributed word rep-resentation for citation context and associated documentembedding via a feedforward neural network and thenestimated the probability of citing a paper by a given ci-tation context Tan et al [5] proposed a neural networkmethod based on LSTM to solve quote recommended tasks0ey focused on the characteristics of quotes and trainedneural networks to bridge the language gap A neuralnetwork model learned the semantic representations ofarbitrary length texts from a large corpus

22 Orthogonal Constraint in Deep Learning One of thegreatest advantages of orthogonal matrices is that thenorm of the matrix is changed when it is multiplied by amatrix 0is property is useful in gradient back-propagation especially to deal with gradient explosionand gradient dissipation problems Orthogonal regula-rization is widely used in many fields Brock et al [22]used orthogonal regularization to improve the general-ization performance of image generation editor tasks byusing generative adversarial networks (GANs) [23] 0eyfurther expanded their work into BigGAN [24] 0e re-sults in their work showed that by applying orthogonalregularization the generator allows fine-tuning thetradeoff between fidelity and diversity of samples bytruncating hidden spaces which can make the modelachieve the best performance in the image synthesis ofclass conditions Another advantage of orthogonal ma-trices is that they benefit from deep representationlearning If the weight vectors of the full connection layerin the convolutional neural network are highly

w1w2

w3w2prime

Figure 1 Distribution of the weight vector of the reference type ingeometric space

2 Computational Intelligence and Neuroscience

correlated the individuals in each full-join descriptionwill also be highly correlated which will highly reduceretrieval performance Sun et al [25] proposed SVD-Netto show that guaranteeing the feature weight of the FClayer can increase the orthogonal constraint of the net-work and improve the accuracy Zheng et al [26] re-ported that regularization was an efficient method forimproving the generalization ability of deep CNN be-cause it makes it possible to train more complex modelswhile maintaining lower overfitting Zheng et al [26]proposed a method for optimizing the feature boundaryof a deep CNN through a two-stage training step to re-duce the overfitting problem However the mixed fea-tures learned from CNN potentially reduce therobustness of network models for identification orclassification To address this problem Wang et al [27]decomposed deep face features into two orthogonalcomponents to represent age-related and identity-relatedfeatures to learn the age-invariant deep face features Inthe above model age-invariant deep features can be ef-fectively obtained to improve AIFR performance Chenet al [28] proposed a group orthogonal convolutionalneural network (GoCNN) model based on the idea oflearning different groups of convolutional functions thatare ldquoorthogonalrdquo to those in other groups ie with nosignificant correlation among the produced featuresOptimizing orthogonality among convolutional func-tions reduces the redundancy and increases the diversitywithin the architecture Moreover it can also obtain asingle CNN model with sufficient inherent diversity suchthat the model learns more diverse representations andhas stronger generalization ability than vanilla CNNs

3 Proposed Method

31 Problem Formulation 0e context-aware citation rec-ommendation is defined as the matching task between citationcontext and candidate papers 0e main architecture of ourmodel is shown in Figure 2 Our model is actually a con-volutional neural network with two inputs and orthogonalconstraints Our model consists of the following main steps

(1) We adopt word2vec to obtain the raw input vectorsand then use CNNs to extract multiple granularitysemantic features

(2) 0e multiple granularity semantic feature is thenimposed orthogonally by an SVD-FC layer

(3) We use fully connected layers to obtain the finalvector representation 0e logistic function or SVMis used to obtain the recommendation result

32 Network Structure

321 Input Layer Word2vec [29] is used to embed the inputof our model Each word is represented as a d0 dimensionalprecomputed vector where d0 300 As a result each sentenceis represented as a feature matrix with dimension d0 times s0rough this layer we can obtain the raw representation ofcitation context c and candidate document d

We also calculate the weight of common wordsaccording to the inputs 0en we can obtain the basic inputfeatures TF minus IDF(c d) for our model which is the productof TF(wc d) and IDF to reflect how important a word incitation context c is for a candidate document d in the corpus[30] wc is a word in citation context c 0ese two variablesare calculated as follows

TF wc d( 1113857 count wc d( 1113857

top wlowast d( 1113857

IDF logN

docs wc D( 1113857

(1)

where count(wc d) is the number of words wc that appear indocument d top(wlowast d) is the occurrence number of theword wlowast that appears most frequently in this candidatedocument d docs(wc D) is the number of documentscontaining the word wc in all candidate citations D N is thetotal number of candidate citations

322 Convolution Layer 0e inputs of the convolutionlayer are the feature matrix of citation context c and doc-ument d 0e process of this layer is demonstrated in Fig-ure 3 We first pad the two inputs to have the same lengths max(c d) by zero vectors For every input letv1 v2 vs be the words in a sentenceWe define gi isin Rwd0 0lt ilt s + w minus 1 as the concatenation of viminusw vi 0enthis layer generates the feature Pi isin Rd1 for the phrasesviminusw vi as follows

Pi tanh W middot gi + b( 1113857 (2)

whereW isin Rd1timeswd0 is a convolution kernel and b isin Rd1 is thebias

323 Average Pooling Layer 0e pooling layer is usuallyused for feature compression In our model we chooseaverage pooling 0e reason is that whole sentences orparagraphs can express more meaningful semantics Asshown in Figure 4 we design two pooling layers 0e firstone is ldquow-aprdquo which is the column average for thewindow of w continuous columns After the convolutionlayer an s column feature map is converted into a news + w minus 1 column feature map By using ldquow-aprdquo the newfeature map is recovered into the s column 0is archi-tecture facilitates the extraction of more useful abstractfeatures

0e second one is ldquoall-aprdquo which normalizes all col-umns As shown in Figure 5 ldquoall-aprdquo generates a repre-sentation vector for each feature map 0e generated featurecombines the information of the whole citation context orcited document

Now we can obtain the features of citation context andindependent features of the cited document 0e next step isto obtain the semantic relationships between the citationcontext and the candidate paper We use cosine similarity tomeasure the semantic relations

Computational Intelligence and Neuroscience 3

simj 1113936

dj

i0 Cji times Dji1113872 1113873

1113936dj

i0 Cji1113872 11138732

times 1113936dj

i0 Dji1113872 11138732

1113969 (j isin [1 10]) (3)

Citation context Document

SVD-FC

Word2vet Word2vet

Convolution

W-ap W-ap

FC

FC

LogisticsSVM

W-ap W-ap

Convolution

Convolution Convolution10 th

1st

USSVDw

Splice

All-ap All-ap

All-ap All-ap

Based-feature

All-ap-feature

Figure 2 An overview of our model

s + w minus 1ws

Figure 3 Convolution extraction generates phrases

ss + w minus 1

Figure 4 ldquoW-aprdquo structure

Figure 5 ldquoAll-aprdquo structure

4 Computational Intelligence and Neuroscience

where Cj and Dj are the distributed representation of ci-tation context and candidate document after the j-th ldquoall-aprdquo layer respectively A total of ten ldquoall-aprdquo layers arecarried out in our model 0erefore j belongs to [1 10] 0ebenefit is that we can obtain the semantic relation betweenthe citation context and the cited document with multiplegranularities As shown in Figure 6 the final output featureconsists of all simj and basic features 0en it is fed into theSVD-FC layer

In most cases we find that if we use all outputs of poollayers as the input of the SVD-FC layer the performance willbe improved0e reason is that features from different layersrepresent the different levels of semantics Neglecting anylayers will obviously cause information loss problems

Next we use the SVD-FC layer to learn the nonlinearcombination features of citation relationships0is layer canforce vectors in the feature map independent and orthogonalto each other 0e added SVD-FC layer can also reduce thenegative impact of excessive parameters

324 SVD-FC Layer In this layer we use SVD to factorizethe weight matrix W (W USVT) and replace it with USOur experimental results show that replacing operations canreduce the negative impact on the sample space

0e Euclidean distance between samples can be used tomeasure whether their feature expression changes in asample space Denoting em and en as the feature maps of twodifferent samples we can obtain two different outputs of thefull connection operation by using the weight matrix W orUS as follows

p e times W (4)

q e times US (5)

As seen in the above equations q is orthogonalizedoutput while p is unorthogonalized0en we can obtain thefollowing theorem

Theorem 1 p and q in equations (4) and (5) will generatethe same Euclidean distance for samples em and en

Proof 0e Euclidean distance L between pm and pn iscalculated as follows

L pm

rarrminus pn

rarr2

emrarr

minus enrarr

( 1113857TWW

Temrarr

minus enrarr

( 1113857

1113969

emrarr

minus enrarr

( 1113857TUSVV

TS

TU

Temrarr

minus enrarr

( 1113857

1113969(6)

Since V is an orthogonal matrix equation (6) isequivalent to

L

emrarr

minus enrarr

( 1113857TUSS

TU

Temrarr

minus enrarr

( 1113857

1113969

qmrarr

minus qnrarr

( 1113857T

qmrarr

minus qnrarr

( 1113857

1113969

qmrarr

minus qnrarr

2

(7)

It can be seen that pm

rarrminus pn

rarr2 qm

rarrminus qn

rarr2

It should be noted that there are no negative impacts andno changes in discrimination ability for the entire samplespace when replacing the weight As shown in Figure 7 weuse SVD of weight matrix W to map the feature map to anorthogonal linear space

325 Output Layer 0e citation recommendation problemis regarded as a classification task in our model In this layerlogistics and SVM can deal with binary classification tasksand predict the final citation relationship

33 Training Details

331 Embeddings In our model words are initialized by300-dimensional word2vec embeddings and will notchange during training A single randomly initializedembedding is created for all unknown words by uniformsampling from[minus001 001] We employ AdaGrad [31] andL2 regularization We introduce adversarial training [32]for embeddings to make the model more robust 0eprocess is achieved by replacing the word vector v afterword2vec embeddings using word vector with disturbingvlowast

vlowast

v times radv (8)

where radv is the worst case of perturbation on the wordvector Goodfellow et al [33] approximated this value bylinearizing the loss function logp(y|x 1113954θ) around x where1113954θ is a constant set to the current parameters of our modeland it only participates in the calculation process of radvwithout a backpropagation algorithm With the linearapproximation and L2 norm constraint the adversarialperturbation is

All-ap-feature

simi

Ci Di

Basic-feature

Figure 6 Generating the feature map

SVD-FC layer input feature

SVD-FC layer output featureSVD-FC layer

Figure 7 SVD-FC layer

Computational Intelligence and Neuroscience 5

radv minusising

g2 whereg nablaxlogp(y|x 1113954θ) (9)

0is perturbation can be easily computed by usingbackpropagation in neural networks

332 Layerwise Training In our training steps we defineconv-pooling block bt (tge 2) which consists of a convo-lution layer and a pooling layer Our network model is thenassembled by the initialization block b1 that initializes usingword2vec and (n minus 1) conv-pooling blocks

First we train the conv-pooling block b2 after b1 istrained On this basis the next conv-pooling block b3 iscreated by keeping the previous block fixed We repeat thisprocedure until all (n minus 1) conv-pooling blocks are trained

Second the following semiorthogonal training proce-dure is used to train the whole network

Semiorthogonal training (SOT) it is crucial to trainSVD-CNN which consists of the following three steps

Step 1 Decompose the weight matrix by SVD ieW USVT W is the weight matrix of the linear layerU is the left-unitary matrix S is the singular valuematrix V is the right-unitary matrix After that wereplace W with US Next we take all eigenvectors ofUS(US)T as weight vectorsStep 2 0e backbone model is fine-tuned by fixing theSVD-FC layerStep 3 0e model keeps fine-tuning with the unfixedSVD-FC layer

Step 1 can generate orthogonal weights but the per-formance of prediction cannot be guaranteed 0e reason isthat over orthogonality will excessively punish synonymoussentences which is apparently inappropriate 0erefore weintroduce Steps 2 and 3 to solve the above problem

0e inputs of SVD-FC are defined as Y

Y (y1 y2 ym)T 0e outputs are defined as O

O (o1 o2 om)T 0e weight matrix is defined asW (w1 w2 wm)T 0e expected outputs are defined asA (a1 a2 am)T 0e error function is defined as

E 12

1113944

l

k1ak minus ok( 1113857

2 (10)

where ok f(1113936mj0 wkjyj) k 1 2 l 0en E with re-

spect to ok is derived and the outcome is

zE

zok

minus ak minus ok( 1113857 (11)

We utilize the gradient descent strategy to find thegradient of the error with respect to weights 0e iterativeupdate of weights is as follows

Δwkj minusηzE

zwkj

(12)

We define an error signal δok zEz netk equation (12) is

equivalent to

Δwkj minusηzE

z netk

z netkzwkj

minusηδok

z netkzwkj

(13)

According to equation (11) δok zEz netk is equivalent

to

δok minus

zE

zok

zok

z netk minus

zE

zok

fprime netk( 1113857

zE

zok

okprime minus dk minus ok( 1113857ok

prime

(14)

We use the sigmoid f(x) 1(1 + ex) as the nonlinearfunction so equation (13) is equivalent to

Δwkj minusηδokyj η dk minus ok( 1113857ok 1 minus ok( 1113857yj (15)

In Step 1 the weight matrix W is decomposed by SVDand replaced with US U (q1 q2 qm)T andS diag(λ1 λ2 λm) Since dk minus ok is given we definethat Loss dk minus ok As a result equation (15) is equivalent to

Δwkj η Loss middot ok minus sigmoid yj 1113944 qiλi + B1113872 11138732

1113876 1113877yj (16)

qi middot qj 0 ine j are in the left-unitary matrix U so themodel operation is not affected by the nonorthogonal ei-genvectors qi 0is is the reason for excessively punishingsynonymous sentences in Step 1 However orthogonalityhas a positive effect on Δwkj in Step 2

0e purpose of SVD is to maintain the orthogonality ofeach weight vector in geometric space When weight vectorsare conditioned by orthogonal regularization the relevancybetween weight vectors decreases We use the followingmethods in Step 3 to measure relevance

H WTW

w1rarrT

w1rarr

middot middot middot w1rarrT

wkrarr

⋮ ⋱ ⋮

wkrarrT

w1rarr

middot middot middot wkrarrT

wkrarr

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

h11 middot middot middot h1k

⋮ ⋱ ⋮

hk1 middot middot middot hkk

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

(17)

where W is a weight matrix that contains k weight vectorswi (i 1 k)hij (i j 1 k) is the dot product of wi

and wj Let us define S(W) as the correlation measurementof all column vectors in W

S(W) 1113936

ki1 hii

1113936ki1 1113936

kj1 hij

11138681113868111386811138681113868

11138681113868111386811138681113868 (18)

When W is an orthogonal matrix the value of S(W) is 1When ine j S(W) obtains the minimum value (1k)0erefore we can see that the value of S(W) falls into

6 Computational Intelligence and Neuroscience

[(1k) 1] As a result when S(W) is close to 1k or 0 theweight matrix will have high relevance

34 Complexity Analysis Assume that the training samplesize is |C| the average number of words in each citationcontext is |c| Cl is the number of kernels in the l-th layer andwis the size of the sliding window For one convolution layerthe training complexity is O(Clminus1 middot Cl middot w middot (s minus w + 1)) 0etraining complexity of one w-ap layer is O(C2

l middot w middot s) 0etraining complexity of one all-ap layer isO(C2

l middot (s minus w + 1))which was improved by C F Van Loan [12] computing theeigenvalue for SVD matrix decomposition with K size takesO(K) on the way of JACOBI Assume that the size of theweight matrix in the SVD-FC layer isK and the channel ofthe input matrix is Cin 0e computational cost for the SVD-FC layer is O(2K2 middot Cin + K)

4 Experiment

41 Dataset We use the CiteSeer dataset [34] to evaluatethe performance of our model 0e dataset was publishedby Huang et al [4] In this dataset citation relationshipsare extracted by a pair of citation contexts and the ab-stracts of cited papers A citation context includes thesentence where the citation placeholder appears and thesentences before and after the citation placeholderWithin each paper in the corpus the 50 words before and50 words after each citation reference are treated as thecorresponding citation context (a discussion on thenumber of words can be found in [7]) Before wordembedding we also remove stop words from the contextsTo preserve the time-sensitive pastpresentfuture tensesof verbs and the singularplural styles of named entitiesno stemming is done but all words are transferred tolower-case 0e training set contains 3989547 pairs ofreference contexts and citations and the test set contains1021685 citation relations

Following common practice in information retrieval(IR) we employ the following four evaluation metrics toevaluate recommendation results recall mean reciprocalrank (MRR) mean average precision (MAP) and normal-ized discounted cumulative gain (nDCG)

42 EvaluationMetric For each query in the test set we usethe original set of references as the ground truth Rg Assumethat the set of recommended citations is Rr and the correctrecommendations are Rg capRr Recall is defined as

recall Rg capRr

11138681113868111386811138681113868

11138681113868111386811138681113868

Rg

(19)

In our experiments the number of recommended ci-tations ranges from 1 to 10 Recall evaluation does not revealthe order of recommended references To address thisproblem we select the following two additional metrics

For a query q let rankq be the rank of the first correctrecommendation within the list MRR [35] is defined as

MRR 1

|Q|1113944qisinQ

1rankq

(20)

where Q is the testing set MRR reveals the average rankingof the first correct recommendation

For each citation placeholder we search the papers thatmay be referenced at this citation placeholder Each retrievalmodel returns a ranked list of papers Since there may be oneor more references for one citation context we use meanaverage precision (MAP) as the evaluation metric

MAP d1 dN( 1113857 1113936i R di( 1113857i( 11138571113936jleiR dj1113872 1113873

1113936iR di( 1113857 (21)

where R(di) is a binary function indicating whether doc-ument di is relevant or not For our problem the papers citedat the citation placeholder are considered relevantdocuments

We use normalized discounted cumulative gain (NDCG)to measure the ranked recommendation list 0e NDCGvalue of a ranking list at position i is calculated as

NDCG d1 dN( 1113857 1113944i

2rel di( ) minus 1lni+1 (22)

where rel (di) is the 4-scale relevance of document di in theranked list We use the average cocited probability [2] oflangdi dlowastrang to weigh the citation relevance score of di to dlowast(anoriginal citation of the query) We report the average NDCGscore over all testing documents

43 BaselineComparison We choose the following methodsfor comparison

Cite-PLSA-LDA (CP-LDA) [36] we use the originalimplementation provided by the author 0e number oftopics is set to 60

(i) Restricted Boltzmann Machine (RBM-CS) [37] Wetrain two layers of RBM-CS according to the sug-gestion of the author We set the hidden layer size to600

(ii) Word2vec Model (W2V) [29] We use the word2vecmodel to learn words and document representa-tions 0e cited document is treated as a ldquowordrdquo (adocument uses a unique marker when it is cited bydifferent papers) 0e dimensions of the word anddocument vectors are set to n 300

(iii) Neural Probabilistic Model (NPM) [4] We followthe original implementation 0e dimensions of theword and document representation vector are set ton 600 For negative sampling we set the numberof negative samples k 10 where k is the number ofnoise words in the citation context For noisecontrast estimation we set the number of noisesamples k 1000

(iv) Neural Citation Network (NCN) [7] In NCN thegradient clipping is 5 the dropout probability is 02and the recurrent layers are 2 0e region sizes for

Computational Intelligence and Neuroscience 7

the encoder are set to 4 4 and 5 and the region sizesfor the author network are set to 1 and 2

Figures 8 and 9 show the performance of eachmethod onthe CiteSeer dataset It is obvious that the SVD-FC modelleads the performance in most cases More detailed analysesare given as follows

First we perform a comparison among CP-LDA RBMW2V and SVD-CNN Our SVD-CNN completely andsignificantly exceeds other models in all metrics 0e successof ourmodel is ascribed to the content and correlation of ournetwork Due to the lack of citation context information wefind that W2V is obviously worse than other methods interms of all metrics CP-LDA works much better than W2Vwhich indicates that link information is very important forfinding relevant papers RBM-CS shows a clear performancegain over W2V because RBM-CS automatically discoverstopical aspects of each paper based on citation contextHowever the vector representations of citation context inRBM-CS are extracted by traditional word vector repre-sentations which fully neglect semantic relations betweenthe citation document and citation context and thus may belimited by vocabulary

Second we compare the performance among NPMNCN and SVD-CNN It is not surprising that NPM andNCN achieve worse performance than SVD-CNN since theirdistributed representation of words and documents reliessolely on deep learning without restraint NPM recommendscitations based on trained distributed representations NCNfurther enhances the performance by considering authorinformation and using a more sophisticated neural networkarchitecture However the CNN in NCN does not haveorthogonal constraints which makes it difficult to capturedifferent types of citing activities In addition NCN onlyutilizes the title of the cited paper for a decoder which isapparently not sufficient for learning good embedding

44 e Influence on the Link Prediction of Reference PatternInteractionalFeatures According to the chapter positions ofcitation context in the article we divide the training set intothree parts the introduction part contains 1307885 pairs ofreference contexts and citations the related word partcontains 1599897 pairs of citations and the main partcontains 1024783 pairs Furthermore these datasets formthree mixed datasets In this part of the experiment we usethe CNN model without SVD as the baseline 0ese datasetsare tested in a ratio of 3 1 In Tables 1 and 2 we show theresults on the abovementioned datasets

From the results we obtain the following observationsFirst both CNN and SVD-CNN outperform unmixed

datasets over mixed datasets across the different evaluationmetrics which shows that the diversity of reference patternsincreases the difficulty of citation recommendation tasks

Second in Tables 1 and 2 we observe that our model isparticularly good at resolving the difficulties in mixeddatasets which come from the diversity of referencepatterns

To better explore why mixed datasets are more complexthan unmixed datasets in Figure 10 we show the change in

S(W) during the training process of SVD-CNN amongvarious datasets

As shown in Figure 10 the increase in S(W) on themixed datasets indicates that SVD-CNN is good at decor-relation We can also see in Tables 1 and 2 that the CNNmodel has pretty performance on unmixed datasets whileachieving poor performance on mixed datasets HoweverSVD-CNN achieves almost the same performance on thetwo types of datasets 0is proves that the correlation fromvarious reference patterns can significantly affect the linkprediction

0e reason why the change in S(W) is not large on theunmixed datasets is that reference patterns of unmixeddatasets have similar features which belong to the samecategory As a result the orthogonality of the weight matrixis hard to improve on unmixed datasets However a citationrecommendation algorithm has pretty performance on theunmixed datasets because there are low complexities

Although mixed datasets are more complicated thanunmixed datasets SVD-CNN still performs well in mixeddatasets 0is indicates that SVD-CNN reduces the negativeimpact of the correlation of reference patterns and ourapproach is more suitable for complex scenarios

45 Comparison with Other Types of Decorrelation In ad-dition to SVD there are still some other methods fordecorrelating the feature matrix However these methodscannot maintain the discriminating ability of the CNNmodel To illustrate this we compare SVD with severalvarieties as follows

(1) Using the originally learned W

(2) Replacing W with US

(3) Replacing W with U

(4) Replacing W with UVT

(5) Replacing Wwith Q D where D is the diagonalmatrix extracted from the upper triangle matrix inQ-R decomposition

(6) Replacing W with WPCA where WPCA is the diagonalmatrix extracted from the weight matrix W after theprocessing of dimension reduction by PCA

After convergence of training different orthogonalmatrices are used to replace the weight matrix W We defineT-cost as the time cost of replacing the weight which isequivalent to the proportion of the added time to the originaltime As shown in Table 3 other types of decorrelationdegrade the performance in addition to W⟶ US andW⟶WPCA However the time cost of W⟶WPCA ismore than that of W⟶ US

46 Ablation Study In our method there are two essentialparameters a term sot which means the number of SOTiterations and a biased parameter d0 In this section weconduct an ablation study of these parameters

We first evaluate the effectiveness of sot by empiri-cally fixing d0 300 Since sot defines the loop time of

8 Computational Intelligence and Neuroscience

orthogonal constraint training it should be set as anonnegative value Figure 11 illustrates the MRR with sotfrom 0 to 10 on the CiteSeer dataset We can see that the

performance improves as the value of sot increasesWhen sot 0 the model has no decorrelation andachieves the worst performance In this situation the

Table 1 MRR metric on various datasets

Introduction Related Main Introduction + related Introduction +main Related +mainCNN 03312 03294 03478 02773 02815 02978SVD-CNN 03995 04078 03989 03878 03889 03845

060055050045040035030025Re

call

020015010005000

20 40 60Number of recommended citations

80 100

W2VsNPMs

RBMsCP_LDAs

SVD_CNNsNCNs

Figure 8 Comparison of recall with different methods on CiteSeer

MRR MAP and nDCG scores for top 10 recommendations04

035

03

025

02

015

01

005

0

0091600997

MRR MAP nCDG

00662

01843

02667

03687

00912009982

00663

01835

02418

03352

01288 0135601476

0256602592

03448

CP-LDARBM

W2VNPM

NCNSVD-CNN

Figure 9 Comparison of MRR MAP and nDCG with different methods on CiteSeer

Table 2 MAP metric on various datasets

Introduction Related Main Introduction + related Introduction +main Related +mainCNN 03001 02909 03107 02572 02601 02637SVD-CNN 03701 03655 03693 03498 03511 03539

Computational Intelligence and Neuroscience 9

weight matrix in the FC layer is highly correlated andS(W) has the lowest value 0e recommendation per-formance then increases while adding sot which indi-cates that reducing the correlative degree of the weightmatrix in the FC layer is critical for improving perfor-mance When sot 10 our model achieves the bestperformance

In our model d0 is the dimension of citation contextand cited document representations Figure 12 shows howthe performance of SVD-CNN varies with d0 on the samesot When d0 is small the information content of thecitation context is very small and produces worse per-formance 0e recommendation performance increases toa maximum point until d0 reaches 300 It should be noted

05

045

04

035

03

S (W

)

025

02

015

01

005

00 1 2 3 4 5

Sot6 7 8 9 10 11

IntroductionRelatedMain

Introduction + relatedIntroduction + mainRelated + main

Figure 10 0e change in S(W) during training on unmixed datasets and mixed datasets

Table 3 0e comparison of related methods in Step 1

W W⟶ US W⟶ U W⟶ UVT W⟶ Q D W⟶WPCA

Rank-1 636 636 617 617 616 636mAP 390 390 371 371 373 390T-cost 0 3627 3627 3627 3533 5765

055050045040035030

MRR

025020015010005000

05

S (W

)

10

04

03

02

01

001 3 5

Sot7 9

MRRLRMRRSVM

S (W) SOTS (W) NO-SOT

Figure 11 0e performance impact of sot on CiteSeer

10 Computational Intelligence and Neuroscience

that although the larger d0 is better the larger d0 willsignificantly increase the training time 0erefore wechoose d0 300

5 Conclusion and Future Works

We propose a convolutional neural network model withorthogonal regularization to solve the context-aware citationrecommendation task In our model orthogonal regulari-zation is achieved by using SVD to factorize the weight of theFC layer which can essentially make each vector in thefeature map more independent 0e orthogonal regulari-zation also enhances the feature extraction ability of CNN0e experimental results show that SVD-CNN outperformsthe other compared methods on CiteSeer Our model onlytakes the abstract as the content of the cited paper In thefuture we will explore the performance of our model byusing the full text of papers

Data Availability

Previously reported CiteSeer data were used to support thisstudy and are available at [httpspsuappboxcomvrefseer] 0ese prior datasets are cited at relevant placeswithin the text as references [4]

Conflicts of Interest

0e authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

0is work was partially supported by the National NaturalScience Foundation of China (project no 61373046) andthe National Key Research and Development Programs ofChina (project nos 2018AAA0101100 and2019YFB2102500)

References

[1] M A Angrosh S Cranefield and N Stanger ldquoConditionalrandom field based sentence context identification enhancingcitation services for the research communityrdquo in Proceedingsof the First Australasian Web Conference Adelaide AustraliaJanuary 2013

[2] Q He J Pei D Kifer et al ldquoContext-aware citation rec-ommendationrdquo in Proceedings of the International Conferenceon World Wide Web Raleigh NC USA April 2010

[3] Q He D Kifer J Pei et al ldquoCitation recommendationwithout author supervisionrdquo in Proceedings of the FourthACM international Conference on Web Search and DataMining Hong Kong China February 2011

[4] W Huang ldquoA neural probabilistic model for context basedcitation recommendationrdquo in Proceedings of the AAAIConference on Artificial Intelligence Austin TX USA January2015

[5] J Tan X Wan and J Xiao ldquoA neural network approach toquote recommendation in writingsrdquo in Proceedings of theACM International on Conference on Information andKnowledge Management Indianapolis IN USA October2016

[6] X Ren J Liu X Yu et al ldquoCluscite effective citation rec-ommendation by information network-based clusteringrdquo inProceedings of the 20th ACM SIGKDD International Con-ference on Knowledge Discovery and Data Mining New YorkNY USA August 2014

[7] T Ebesu and Y Fang ldquoNeural citation network for context-aware citation recommendationrdquo in Proceedings of the 40thInternational ACM SIGIR Conference on Research and De-velopment in Information Retrieval pp 1093ndash1096 ShinjukuJapan August 2017

[8] D M Blei A Y Ng and M I Jordan ldquoLatentdirichlet allocationrdquo Journal of Machine Learning Researchvol 3 pp 993ndash1022 2003

[9] S Bradshaw ldquoReference directed indexing redeeming rele-vance for subject search in citation indexesrdquo Research andAdvanced Technology for Digital Libraries vol 2769pp 499ndash510 2003

[10] N Meuschke B Gipp and M Lipinsk ldquoCITREC an eval-uation framework for citation-based similarity measures

06

055

05

045

Valu

e of e

valu

atio

n sta

ndar

d

04

035

03

025100 200 300 400 500

03302 03512 03687 03701 0372203003 03225 03352 03398 0340903101 03312 03448 03486 0349905456 05689 05801

Vector dimension of input layer05842 05867

Various d_0 for the effects of the model

MRRMAPnDCGRecall10

Figure 12 0e performance impact of d0 on CiteSeer

Computational Intelligence and Neuroscience 11

based on TREC genomics and PubMed centralrdquo 2015 httphdlhandlenet214273680

[11] A Ritchie S Robertson and S Teufel ldquoComparing CitationContexts for information Retrievalrdquo in Proceedings of the 17thACM Conference on Information and Knowledge Manage-ment pp 213ndash222 Napa Valley CA USA October 2008

[12] C F Van Loan e Block Jacobi Method for Computing theSingular Value Decomposition Department of ComputerScience Cornell University Ithaca NY USA 1985

[13] C Bhagavatula S Feldman R Power et al ldquoContent-basedcitation recommendationrdquo 2018 httpsarxivorgpdf18020830201v1pdf

[14] H Jia and E Saule ldquoLocal is good a fast citation recom-mendation approachrdquo Lecture Notes in Computer ScienceVol 10772 Springer Berlin Germany 2018

[15] Y Sun W Ni and R Men ldquoA personalized paper recom-mendation approach based on web paper mining and re-viewerrsquos interest modellingrdquo in Proceedings of theInternational Conference on Research Challenges in ComputerScience Shanghai China December 2009

[16] B Shaparenko and T Joachims ldquoInformation genealogyUncovering the flow of ideas in non-hyperlinked documentdatabasesrdquo in Proceedings of the ACM SIGKDD internationalConference on Knowledge Discovery and Data Mining SanJose CA USA August 2007

[17] T Strohman W B Croft and D Jensen ldquoRecommendingcitations for academic papersrdquo in Proceedings of the Annualinternational ACM SIGIR Conference on Research and De-velopment in information Retrieval Amsterdam NetherlandsJuly 2007

[18] A Livne V Gokuladas J Teevan et al ldquoCiteSight supportingcontextual citation recommendation using differentialsearchrdquo in Proceedings of the International ACM SIGIRConference on Research amp Development in informationRetrieval Gold Coast Australia July 2014

[19] Y Lu J He D Shan et al ldquoRecommending citations withtranslation modelrdquo in Proceedings of the ACM internationalConference on Information and Knowledge ManagementGlasgow UK October 2011

[20] W Huang P Mitra S Kataria et al ldquoRecommending cita-tions translating papers into referencesrdquo in Proceedings of theACM international Conference on Information and KnowledgeManagement Shanghai China November 2014

[21] X Tang X Wan X Zhang et al ldquoCross-language context-aware citation recommendation in scientific articlesrdquo inProceedings of the International ACM SIGIR Conference onResearch amp Development in information Retrieval Gold CoastUK July 2014

[22] A Brock T Lim J M Ritchie et al ldquoNeural photo edi-tingwith introspective adversarial networksrdquo in InternationalConference on Learning Representations 2017

[23] I J Goodfellow J Pouget-Abadie MMirza et al ldquoGenerativeadversarial netsrdquo in Proceedings of the International Con-ference on Neural Information Processing Systems MontrealCanada December 2014

[24] A Brock J Donahue K Simonyan et al ldquoLarge scale GANtraining for high fidelity natural image synthesisrdquo 2018httpsarxivorgabs180911096

[25] Y Sun L Zheng W Deng et al ldquoSVDNet for pedestrianretrievalrdquo in Proceedings of the 2017 IEEE InternationalConference on Computer Vision (ICCV) pp 3820ndash3828Venice Italy October 2017

[26] Q Zheng M Yang J Yang Q Zhang and X ZhangldquoImprovement of generalization ability of deep CNN via

implicit regularization in two-stage training processrdquo IEEEAccess vol 6 no 1109 pp 15844ndash15869 2018

[27] Y Wang D Gong Z Zheng et al ldquoOrthogonal deep featuresdecomposition for age-invariant face recognitionrdquo in Pro-ceedings of the European Conference on Computer Vision(ECCV) Munich Germany September 2018

[28] Y Chen X Jin J Feng et al ldquoTraining group orthogonalneural networks with privileged informationrdquo in Proceedingsof the Twenty-Sixth International Joint Conference on ArtificialIntelligence Melbourne Australia August 2017

[29] T Mikolov I Sutskever K Chen et al ldquoDistributedrepresentations of words and phrases and their composi-tionalityrdquo in Proceedings of the 26th International Con-ference on Neural Information Processing Systems LakeTahoe NV USA December 2013

[30] A Rajaraman and J D Ullman ldquoData miningrdquo Mining ofMassive Datasets vol 3 no 2 pp 1ndash17 2014

[31] J Duchi E Hazan and Y Singer ldquoAdaptive subgradientmethods for online learning and stochastic optimizationrdquoJournal of Machine Learning Research vol 12 no 7pp 2121ndash2159 2011

[32] T Miyato A M Dai and I Goodfellow ldquoAdversarial trainingmethods for semi-supervised text classificationrdquo in Pro-ceedings of the International Conference on LearningRepresentations San Juan Puerto Rico May 2016

[33] I J Goodfellow J Shlens and C Szegedy ldquoExplaining andharnessing adversarial examplesrdquo in Proceedings of the In-ternational Conference on Learning Representations BanffCanada April 2014

[34] K Chandrasekaran S Gauch P Lakkaraju et al ldquoConcept-based document recommendations for CiteSeer authorsrdquo inProceedings of the International Conference on AdaptiveHypermedia and Adaptive Web-Based Systems HannoverGermany August 2008

[35] E Voorhees ldquo0e trec-8 question answering track reportrdquo inProceedings of the TRECrsquo00 pp 77ndash82 Gaithersburg MDUSA 2000

[36] S Kataria P Mitra and S Bhatia ldquoUtilizing context ingenerative bayesian models for linked corpusrdquo in Proceedingsof the Twenty-Fourth AAAI Conference on ArtificialIntelligence Atlanta GA USA July 2010

[37] J Tang and J Zhang ldquoA discriminative approach to topic-based citation recommendationrdquo in Proceedings of the Pacific-Asia Conference Hyderabad India July 2009

12 Computational Intelligence and Neuroscience

Page 3: SVD-CNN:AConvolutionalNeuralNetworkModelwith ...

correlated the individuals in each full-join descriptionwill also be highly correlated which will highly reduceretrieval performance Sun et al [25] proposed SVD-Netto show that guaranteeing the feature weight of the FClayer can increase the orthogonal constraint of the net-work and improve the accuracy Zheng et al [26] re-ported that regularization was an efficient method forimproving the generalization ability of deep CNN be-cause it makes it possible to train more complex modelswhile maintaining lower overfitting Zheng et al [26]proposed a method for optimizing the feature boundaryof a deep CNN through a two-stage training step to re-duce the overfitting problem However the mixed fea-tures learned from CNN potentially reduce therobustness of network models for identification orclassification To address this problem Wang et al [27]decomposed deep face features into two orthogonalcomponents to represent age-related and identity-relatedfeatures to learn the age-invariant deep face features Inthe above model age-invariant deep features can be ef-fectively obtained to improve AIFR performance Chenet al [28] proposed a group orthogonal convolutionalneural network (GoCNN) model based on the idea oflearning different groups of convolutional functions thatare ldquoorthogonalrdquo to those in other groups ie with nosignificant correlation among the produced featuresOptimizing orthogonality among convolutional func-tions reduces the redundancy and increases the diversitywithin the architecture Moreover it can also obtain asingle CNN model with sufficient inherent diversity suchthat the model learns more diverse representations andhas stronger generalization ability than vanilla CNNs

3 Proposed Method

31 Problem Formulation 0e context-aware citation rec-ommendation is defined as the matching task between citationcontext and candidate papers 0e main architecture of ourmodel is shown in Figure 2 Our model is actually a con-volutional neural network with two inputs and orthogonalconstraints Our model consists of the following main steps

(1) We adopt word2vec to obtain the raw input vectorsand then use CNNs to extract multiple granularitysemantic features

(2) 0e multiple granularity semantic feature is thenimposed orthogonally by an SVD-FC layer

(3) We use fully connected layers to obtain the finalvector representation 0e logistic function or SVMis used to obtain the recommendation result

32 Network Structure

321 Input Layer Word2vec [29] is used to embed the inputof our model Each word is represented as a d0 dimensionalprecomputed vector where d0 300 As a result each sentenceis represented as a feature matrix with dimension d0 times s0rough this layer we can obtain the raw representation ofcitation context c and candidate document d

We also calculate the weight of common wordsaccording to the inputs 0en we can obtain the basic inputfeatures TF minus IDF(c d) for our model which is the productof TF(wc d) and IDF to reflect how important a word incitation context c is for a candidate document d in the corpus[30] wc is a word in citation context c 0ese two variablesare calculated as follows

TF wc d( 1113857 count wc d( 1113857

top wlowast d( 1113857

IDF logN

docs wc D( 1113857

(1)

where count(wc d) is the number of words wc that appear indocument d top(wlowast d) is the occurrence number of theword wlowast that appears most frequently in this candidatedocument d docs(wc D) is the number of documentscontaining the word wc in all candidate citations D N is thetotal number of candidate citations

322 Convolution Layer 0e inputs of the convolutionlayer are the feature matrix of citation context c and doc-ument d 0e process of this layer is demonstrated in Fig-ure 3 We first pad the two inputs to have the same lengths max(c d) by zero vectors For every input letv1 v2 vs be the words in a sentenceWe define gi isin Rwd0 0lt ilt s + w minus 1 as the concatenation of viminusw vi 0enthis layer generates the feature Pi isin Rd1 for the phrasesviminusw vi as follows

Pi tanh W middot gi + b( 1113857 (2)

whereW isin Rd1timeswd0 is a convolution kernel and b isin Rd1 is thebias

323 Average Pooling Layer 0e pooling layer is usuallyused for feature compression In our model we chooseaverage pooling 0e reason is that whole sentences orparagraphs can express more meaningful semantics Asshown in Figure 4 we design two pooling layers 0e firstone is ldquow-aprdquo which is the column average for thewindow of w continuous columns After the convolutionlayer an s column feature map is converted into a news + w minus 1 column feature map By using ldquow-aprdquo the newfeature map is recovered into the s column 0is archi-tecture facilitates the extraction of more useful abstractfeatures

0e second one is ldquoall-aprdquo which normalizes all col-umns As shown in Figure 5 ldquoall-aprdquo generates a repre-sentation vector for each feature map 0e generated featurecombines the information of the whole citation context orcited document

Now we can obtain the features of citation context andindependent features of the cited document 0e next step isto obtain the semantic relationships between the citationcontext and the candidate paper We use cosine similarity tomeasure the semantic relations

Computational Intelligence and Neuroscience 3

simj 1113936

dj

i0 Cji times Dji1113872 1113873

1113936dj

i0 Cji1113872 11138732

times 1113936dj

i0 Dji1113872 11138732

1113969 (j isin [1 10]) (3)

Citation context Document

SVD-FC

Word2vet Word2vet

Convolution

W-ap W-ap

FC

FC

LogisticsSVM

W-ap W-ap

Convolution

Convolution Convolution10 th

1st

USSVDw

Splice

All-ap All-ap

All-ap All-ap

Based-feature

All-ap-feature

Figure 2 An overview of our model

s + w minus 1ws

Figure 3 Convolution extraction generates phrases

ss + w minus 1

Figure 4 ldquoW-aprdquo structure

Figure 5 ldquoAll-aprdquo structure

4 Computational Intelligence and Neuroscience

where Cj and Dj are the distributed representation of ci-tation context and candidate document after the j-th ldquoall-aprdquo layer respectively A total of ten ldquoall-aprdquo layers arecarried out in our model 0erefore j belongs to [1 10] 0ebenefit is that we can obtain the semantic relation betweenthe citation context and the cited document with multiplegranularities As shown in Figure 6 the final output featureconsists of all simj and basic features 0en it is fed into theSVD-FC layer

In most cases we find that if we use all outputs of poollayers as the input of the SVD-FC layer the performance willbe improved0e reason is that features from different layersrepresent the different levels of semantics Neglecting anylayers will obviously cause information loss problems

Next we use the SVD-FC layer to learn the nonlinearcombination features of citation relationships0is layer canforce vectors in the feature map independent and orthogonalto each other 0e added SVD-FC layer can also reduce thenegative impact of excessive parameters

324 SVD-FC Layer In this layer we use SVD to factorizethe weight matrix W (W USVT) and replace it with USOur experimental results show that replacing operations canreduce the negative impact on the sample space

0e Euclidean distance between samples can be used tomeasure whether their feature expression changes in asample space Denoting em and en as the feature maps of twodifferent samples we can obtain two different outputs of thefull connection operation by using the weight matrix W orUS as follows

p e times W (4)

q e times US (5)

As seen in the above equations q is orthogonalizedoutput while p is unorthogonalized0en we can obtain thefollowing theorem

Theorem 1 p and q in equations (4) and (5) will generatethe same Euclidean distance for samples em and en

Proof 0e Euclidean distance L between pm and pn iscalculated as follows

L pm

rarrminus pn

rarr2

emrarr

minus enrarr

( 1113857TWW

Temrarr

minus enrarr

( 1113857

1113969

emrarr

minus enrarr

( 1113857TUSVV

TS

TU

Temrarr

minus enrarr

( 1113857

1113969(6)

Since V is an orthogonal matrix equation (6) isequivalent to

L

emrarr

minus enrarr

( 1113857TUSS

TU

Temrarr

minus enrarr

( 1113857

1113969

qmrarr

minus qnrarr

( 1113857T

qmrarr

minus qnrarr

( 1113857

1113969

qmrarr

minus qnrarr

2

(7)

It can be seen that pm

rarrminus pn

rarr2 qm

rarrminus qn

rarr2

It should be noted that there are no negative impacts andno changes in discrimination ability for the entire samplespace when replacing the weight As shown in Figure 7 weuse SVD of weight matrix W to map the feature map to anorthogonal linear space

325 Output Layer 0e citation recommendation problemis regarded as a classification task in our model In this layerlogistics and SVM can deal with binary classification tasksand predict the final citation relationship

33 Training Details

331 Embeddings In our model words are initialized by300-dimensional word2vec embeddings and will notchange during training A single randomly initializedembedding is created for all unknown words by uniformsampling from[minus001 001] We employ AdaGrad [31] andL2 regularization We introduce adversarial training [32]for embeddings to make the model more robust 0eprocess is achieved by replacing the word vector v afterword2vec embeddings using word vector with disturbingvlowast

vlowast

v times radv (8)

where radv is the worst case of perturbation on the wordvector Goodfellow et al [33] approximated this value bylinearizing the loss function logp(y|x 1113954θ) around x where1113954θ is a constant set to the current parameters of our modeland it only participates in the calculation process of radvwithout a backpropagation algorithm With the linearapproximation and L2 norm constraint the adversarialperturbation is

All-ap-feature

simi

Ci Di

Basic-feature

Figure 6 Generating the feature map

SVD-FC layer input feature

SVD-FC layer output featureSVD-FC layer

Figure 7 SVD-FC layer

Computational Intelligence and Neuroscience 5

radv minusising

g2 whereg nablaxlogp(y|x 1113954θ) (9)

0is perturbation can be easily computed by usingbackpropagation in neural networks

332 Layerwise Training In our training steps we defineconv-pooling block bt (tge 2) which consists of a convo-lution layer and a pooling layer Our network model is thenassembled by the initialization block b1 that initializes usingword2vec and (n minus 1) conv-pooling blocks

First we train the conv-pooling block b2 after b1 istrained On this basis the next conv-pooling block b3 iscreated by keeping the previous block fixed We repeat thisprocedure until all (n minus 1) conv-pooling blocks are trained

Second the following semiorthogonal training proce-dure is used to train the whole network

Semiorthogonal training (SOT) it is crucial to trainSVD-CNN which consists of the following three steps

Step 1 Decompose the weight matrix by SVD ieW USVT W is the weight matrix of the linear layerU is the left-unitary matrix S is the singular valuematrix V is the right-unitary matrix After that wereplace W with US Next we take all eigenvectors ofUS(US)T as weight vectorsStep 2 0e backbone model is fine-tuned by fixing theSVD-FC layerStep 3 0e model keeps fine-tuning with the unfixedSVD-FC layer

Step 1 can generate orthogonal weights but the per-formance of prediction cannot be guaranteed 0e reason isthat over orthogonality will excessively punish synonymoussentences which is apparently inappropriate 0erefore weintroduce Steps 2 and 3 to solve the above problem

0e inputs of SVD-FC are defined as Y

Y (y1 y2 ym)T 0e outputs are defined as O

O (o1 o2 om)T 0e weight matrix is defined asW (w1 w2 wm)T 0e expected outputs are defined asA (a1 a2 am)T 0e error function is defined as

E 12

1113944

l

k1ak minus ok( 1113857

2 (10)

where ok f(1113936mj0 wkjyj) k 1 2 l 0en E with re-

spect to ok is derived and the outcome is

zE

zok

minus ak minus ok( 1113857 (11)

We utilize the gradient descent strategy to find thegradient of the error with respect to weights 0e iterativeupdate of weights is as follows

Δwkj minusηzE

zwkj

(12)

We define an error signal δok zEz netk equation (12) is

equivalent to

Δwkj minusηzE

z netk

z netkzwkj

minusηδok

z netkzwkj

(13)

According to equation (11) δok zEz netk is equivalent

to

δok minus

zE

zok

zok

z netk minus

zE

zok

fprime netk( 1113857

zE

zok

okprime minus dk minus ok( 1113857ok

prime

(14)

We use the sigmoid f(x) 1(1 + ex) as the nonlinearfunction so equation (13) is equivalent to

Δwkj minusηδokyj η dk minus ok( 1113857ok 1 minus ok( 1113857yj (15)

In Step 1 the weight matrix W is decomposed by SVDand replaced with US U (q1 q2 qm)T andS diag(λ1 λ2 λm) Since dk minus ok is given we definethat Loss dk minus ok As a result equation (15) is equivalent to

Δwkj η Loss middot ok minus sigmoid yj 1113944 qiλi + B1113872 11138732

1113876 1113877yj (16)

qi middot qj 0 ine j are in the left-unitary matrix U so themodel operation is not affected by the nonorthogonal ei-genvectors qi 0is is the reason for excessively punishingsynonymous sentences in Step 1 However orthogonalityhas a positive effect on Δwkj in Step 2

0e purpose of SVD is to maintain the orthogonality ofeach weight vector in geometric space When weight vectorsare conditioned by orthogonal regularization the relevancybetween weight vectors decreases We use the followingmethods in Step 3 to measure relevance

H WTW

w1rarrT

w1rarr

middot middot middot w1rarrT

wkrarr

⋮ ⋱ ⋮

wkrarrT

w1rarr

middot middot middot wkrarrT

wkrarr

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

h11 middot middot middot h1k

⋮ ⋱ ⋮

hk1 middot middot middot hkk

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

(17)

where W is a weight matrix that contains k weight vectorswi (i 1 k)hij (i j 1 k) is the dot product of wi

and wj Let us define S(W) as the correlation measurementof all column vectors in W

S(W) 1113936

ki1 hii

1113936ki1 1113936

kj1 hij

11138681113868111386811138681113868

11138681113868111386811138681113868 (18)

When W is an orthogonal matrix the value of S(W) is 1When ine j S(W) obtains the minimum value (1k)0erefore we can see that the value of S(W) falls into

6 Computational Intelligence and Neuroscience

[(1k) 1] As a result when S(W) is close to 1k or 0 theweight matrix will have high relevance

34 Complexity Analysis Assume that the training samplesize is |C| the average number of words in each citationcontext is |c| Cl is the number of kernels in the l-th layer andwis the size of the sliding window For one convolution layerthe training complexity is O(Clminus1 middot Cl middot w middot (s minus w + 1)) 0etraining complexity of one w-ap layer is O(C2

l middot w middot s) 0etraining complexity of one all-ap layer isO(C2

l middot (s minus w + 1))which was improved by C F Van Loan [12] computing theeigenvalue for SVD matrix decomposition with K size takesO(K) on the way of JACOBI Assume that the size of theweight matrix in the SVD-FC layer isK and the channel ofthe input matrix is Cin 0e computational cost for the SVD-FC layer is O(2K2 middot Cin + K)

4 Experiment

41 Dataset We use the CiteSeer dataset [34] to evaluatethe performance of our model 0e dataset was publishedby Huang et al [4] In this dataset citation relationshipsare extracted by a pair of citation contexts and the ab-stracts of cited papers A citation context includes thesentence where the citation placeholder appears and thesentences before and after the citation placeholderWithin each paper in the corpus the 50 words before and50 words after each citation reference are treated as thecorresponding citation context (a discussion on thenumber of words can be found in [7]) Before wordembedding we also remove stop words from the contextsTo preserve the time-sensitive pastpresentfuture tensesof verbs and the singularplural styles of named entitiesno stemming is done but all words are transferred tolower-case 0e training set contains 3989547 pairs ofreference contexts and citations and the test set contains1021685 citation relations

Following common practice in information retrieval(IR) we employ the following four evaluation metrics toevaluate recommendation results recall mean reciprocalrank (MRR) mean average precision (MAP) and normal-ized discounted cumulative gain (nDCG)

42 EvaluationMetric For each query in the test set we usethe original set of references as the ground truth Rg Assumethat the set of recommended citations is Rr and the correctrecommendations are Rg capRr Recall is defined as

recall Rg capRr

11138681113868111386811138681113868

11138681113868111386811138681113868

Rg

(19)

In our experiments the number of recommended ci-tations ranges from 1 to 10 Recall evaluation does not revealthe order of recommended references To address thisproblem we select the following two additional metrics

For a query q let rankq be the rank of the first correctrecommendation within the list MRR [35] is defined as

MRR 1

|Q|1113944qisinQ

1rankq

(20)

where Q is the testing set MRR reveals the average rankingof the first correct recommendation

For each citation placeholder we search the papers thatmay be referenced at this citation placeholder Each retrievalmodel returns a ranked list of papers Since there may be oneor more references for one citation context we use meanaverage precision (MAP) as the evaluation metric

MAP d1 dN( 1113857 1113936i R di( 1113857i( 11138571113936jleiR dj1113872 1113873

1113936iR di( 1113857 (21)

where R(di) is a binary function indicating whether doc-ument di is relevant or not For our problem the papers citedat the citation placeholder are considered relevantdocuments

We use normalized discounted cumulative gain (NDCG)to measure the ranked recommendation list 0e NDCGvalue of a ranking list at position i is calculated as

NDCG d1 dN( 1113857 1113944i

2rel di( ) minus 1lni+1 (22)

where rel (di) is the 4-scale relevance of document di in theranked list We use the average cocited probability [2] oflangdi dlowastrang to weigh the citation relevance score of di to dlowast(anoriginal citation of the query) We report the average NDCGscore over all testing documents

43 BaselineComparison We choose the following methodsfor comparison

Cite-PLSA-LDA (CP-LDA) [36] we use the originalimplementation provided by the author 0e number oftopics is set to 60

(i) Restricted Boltzmann Machine (RBM-CS) [37] Wetrain two layers of RBM-CS according to the sug-gestion of the author We set the hidden layer size to600

(ii) Word2vec Model (W2V) [29] We use the word2vecmodel to learn words and document representa-tions 0e cited document is treated as a ldquowordrdquo (adocument uses a unique marker when it is cited bydifferent papers) 0e dimensions of the word anddocument vectors are set to n 300

(iii) Neural Probabilistic Model (NPM) [4] We followthe original implementation 0e dimensions of theword and document representation vector are set ton 600 For negative sampling we set the numberof negative samples k 10 where k is the number ofnoise words in the citation context For noisecontrast estimation we set the number of noisesamples k 1000

(iv) Neural Citation Network (NCN) [7] In NCN thegradient clipping is 5 the dropout probability is 02and the recurrent layers are 2 0e region sizes for

Computational Intelligence and Neuroscience 7

the encoder are set to 4 4 and 5 and the region sizesfor the author network are set to 1 and 2

Figures 8 and 9 show the performance of eachmethod onthe CiteSeer dataset It is obvious that the SVD-FC modelleads the performance in most cases More detailed analysesare given as follows

First we perform a comparison among CP-LDA RBMW2V and SVD-CNN Our SVD-CNN completely andsignificantly exceeds other models in all metrics 0e successof ourmodel is ascribed to the content and correlation of ournetwork Due to the lack of citation context information wefind that W2V is obviously worse than other methods interms of all metrics CP-LDA works much better than W2Vwhich indicates that link information is very important forfinding relevant papers RBM-CS shows a clear performancegain over W2V because RBM-CS automatically discoverstopical aspects of each paper based on citation contextHowever the vector representations of citation context inRBM-CS are extracted by traditional word vector repre-sentations which fully neglect semantic relations betweenthe citation document and citation context and thus may belimited by vocabulary

Second we compare the performance among NPMNCN and SVD-CNN It is not surprising that NPM andNCN achieve worse performance than SVD-CNN since theirdistributed representation of words and documents reliessolely on deep learning without restraint NPM recommendscitations based on trained distributed representations NCNfurther enhances the performance by considering authorinformation and using a more sophisticated neural networkarchitecture However the CNN in NCN does not haveorthogonal constraints which makes it difficult to capturedifferent types of citing activities In addition NCN onlyutilizes the title of the cited paper for a decoder which isapparently not sufficient for learning good embedding

44 e Influence on the Link Prediction of Reference PatternInteractionalFeatures According to the chapter positions ofcitation context in the article we divide the training set intothree parts the introduction part contains 1307885 pairs ofreference contexts and citations the related word partcontains 1599897 pairs of citations and the main partcontains 1024783 pairs Furthermore these datasets formthree mixed datasets In this part of the experiment we usethe CNN model without SVD as the baseline 0ese datasetsare tested in a ratio of 3 1 In Tables 1 and 2 we show theresults on the abovementioned datasets

From the results we obtain the following observationsFirst both CNN and SVD-CNN outperform unmixed

datasets over mixed datasets across the different evaluationmetrics which shows that the diversity of reference patternsincreases the difficulty of citation recommendation tasks

Second in Tables 1 and 2 we observe that our model isparticularly good at resolving the difficulties in mixeddatasets which come from the diversity of referencepatterns

To better explore why mixed datasets are more complexthan unmixed datasets in Figure 10 we show the change in

S(W) during the training process of SVD-CNN amongvarious datasets

As shown in Figure 10 the increase in S(W) on themixed datasets indicates that SVD-CNN is good at decor-relation We can also see in Tables 1 and 2 that the CNNmodel has pretty performance on unmixed datasets whileachieving poor performance on mixed datasets HoweverSVD-CNN achieves almost the same performance on thetwo types of datasets 0is proves that the correlation fromvarious reference patterns can significantly affect the linkprediction

0e reason why the change in S(W) is not large on theunmixed datasets is that reference patterns of unmixeddatasets have similar features which belong to the samecategory As a result the orthogonality of the weight matrixis hard to improve on unmixed datasets However a citationrecommendation algorithm has pretty performance on theunmixed datasets because there are low complexities

Although mixed datasets are more complicated thanunmixed datasets SVD-CNN still performs well in mixeddatasets 0is indicates that SVD-CNN reduces the negativeimpact of the correlation of reference patterns and ourapproach is more suitable for complex scenarios

45 Comparison with Other Types of Decorrelation In ad-dition to SVD there are still some other methods fordecorrelating the feature matrix However these methodscannot maintain the discriminating ability of the CNNmodel To illustrate this we compare SVD with severalvarieties as follows

(1) Using the originally learned W

(2) Replacing W with US

(3) Replacing W with U

(4) Replacing W with UVT

(5) Replacing Wwith Q D where D is the diagonalmatrix extracted from the upper triangle matrix inQ-R decomposition

(6) Replacing W with WPCA where WPCA is the diagonalmatrix extracted from the weight matrix W after theprocessing of dimension reduction by PCA

After convergence of training different orthogonalmatrices are used to replace the weight matrix W We defineT-cost as the time cost of replacing the weight which isequivalent to the proportion of the added time to the originaltime As shown in Table 3 other types of decorrelationdegrade the performance in addition to W⟶ US andW⟶WPCA However the time cost of W⟶WPCA ismore than that of W⟶ US

46 Ablation Study In our method there are two essentialparameters a term sot which means the number of SOTiterations and a biased parameter d0 In this section weconduct an ablation study of these parameters

We first evaluate the effectiveness of sot by empiri-cally fixing d0 300 Since sot defines the loop time of

8 Computational Intelligence and Neuroscience

orthogonal constraint training it should be set as anonnegative value Figure 11 illustrates the MRR with sotfrom 0 to 10 on the CiteSeer dataset We can see that the

performance improves as the value of sot increasesWhen sot 0 the model has no decorrelation andachieves the worst performance In this situation the

Table 1 MRR metric on various datasets

Introduction Related Main Introduction + related Introduction +main Related +mainCNN 03312 03294 03478 02773 02815 02978SVD-CNN 03995 04078 03989 03878 03889 03845

060055050045040035030025Re

call

020015010005000

20 40 60Number of recommended citations

80 100

W2VsNPMs

RBMsCP_LDAs

SVD_CNNsNCNs

Figure 8 Comparison of recall with different methods on CiteSeer

MRR MAP and nDCG scores for top 10 recommendations04

035

03

025

02

015

01

005

0

0091600997

MRR MAP nCDG

00662

01843

02667

03687

00912009982

00663

01835

02418

03352

01288 0135601476

0256602592

03448

CP-LDARBM

W2VNPM

NCNSVD-CNN

Figure 9 Comparison of MRR MAP and nDCG with different methods on CiteSeer

Table 2 MAP metric on various datasets

Introduction Related Main Introduction + related Introduction +main Related +mainCNN 03001 02909 03107 02572 02601 02637SVD-CNN 03701 03655 03693 03498 03511 03539

Computational Intelligence and Neuroscience 9

weight matrix in the FC layer is highly correlated andS(W) has the lowest value 0e recommendation per-formance then increases while adding sot which indi-cates that reducing the correlative degree of the weightmatrix in the FC layer is critical for improving perfor-mance When sot 10 our model achieves the bestperformance

In our model d0 is the dimension of citation contextand cited document representations Figure 12 shows howthe performance of SVD-CNN varies with d0 on the samesot When d0 is small the information content of thecitation context is very small and produces worse per-formance 0e recommendation performance increases toa maximum point until d0 reaches 300 It should be noted

05

045

04

035

03

S (W

)

025

02

015

01

005

00 1 2 3 4 5

Sot6 7 8 9 10 11

IntroductionRelatedMain

Introduction + relatedIntroduction + mainRelated + main

Figure 10 0e change in S(W) during training on unmixed datasets and mixed datasets

Table 3 0e comparison of related methods in Step 1

W W⟶ US W⟶ U W⟶ UVT W⟶ Q D W⟶WPCA

Rank-1 636 636 617 617 616 636mAP 390 390 371 371 373 390T-cost 0 3627 3627 3627 3533 5765

055050045040035030

MRR

025020015010005000

05

S (W

)

10

04

03

02

01

001 3 5

Sot7 9

MRRLRMRRSVM

S (W) SOTS (W) NO-SOT

Figure 11 0e performance impact of sot on CiteSeer

10 Computational Intelligence and Neuroscience

that although the larger d0 is better the larger d0 willsignificantly increase the training time 0erefore wechoose d0 300

5 Conclusion and Future Works

We propose a convolutional neural network model withorthogonal regularization to solve the context-aware citationrecommendation task In our model orthogonal regulari-zation is achieved by using SVD to factorize the weight of theFC layer which can essentially make each vector in thefeature map more independent 0e orthogonal regulari-zation also enhances the feature extraction ability of CNN0e experimental results show that SVD-CNN outperformsthe other compared methods on CiteSeer Our model onlytakes the abstract as the content of the cited paper In thefuture we will explore the performance of our model byusing the full text of papers

Data Availability

Previously reported CiteSeer data were used to support thisstudy and are available at [httpspsuappboxcomvrefseer] 0ese prior datasets are cited at relevant placeswithin the text as references [4]

Conflicts of Interest

0e authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

0is work was partially supported by the National NaturalScience Foundation of China (project no 61373046) andthe National Key Research and Development Programs ofChina (project nos 2018AAA0101100 and2019YFB2102500)

References

[1] M A Angrosh S Cranefield and N Stanger ldquoConditionalrandom field based sentence context identification enhancingcitation services for the research communityrdquo in Proceedingsof the First Australasian Web Conference Adelaide AustraliaJanuary 2013

[2] Q He J Pei D Kifer et al ldquoContext-aware citation rec-ommendationrdquo in Proceedings of the International Conferenceon World Wide Web Raleigh NC USA April 2010

[3] Q He D Kifer J Pei et al ldquoCitation recommendationwithout author supervisionrdquo in Proceedings of the FourthACM international Conference on Web Search and DataMining Hong Kong China February 2011

[4] W Huang ldquoA neural probabilistic model for context basedcitation recommendationrdquo in Proceedings of the AAAIConference on Artificial Intelligence Austin TX USA January2015

[5] J Tan X Wan and J Xiao ldquoA neural network approach toquote recommendation in writingsrdquo in Proceedings of theACM International on Conference on Information andKnowledge Management Indianapolis IN USA October2016

[6] X Ren J Liu X Yu et al ldquoCluscite effective citation rec-ommendation by information network-based clusteringrdquo inProceedings of the 20th ACM SIGKDD International Con-ference on Knowledge Discovery and Data Mining New YorkNY USA August 2014

[7] T Ebesu and Y Fang ldquoNeural citation network for context-aware citation recommendationrdquo in Proceedings of the 40thInternational ACM SIGIR Conference on Research and De-velopment in Information Retrieval pp 1093ndash1096 ShinjukuJapan August 2017

[8] D M Blei A Y Ng and M I Jordan ldquoLatentdirichlet allocationrdquo Journal of Machine Learning Researchvol 3 pp 993ndash1022 2003

[9] S Bradshaw ldquoReference directed indexing redeeming rele-vance for subject search in citation indexesrdquo Research andAdvanced Technology for Digital Libraries vol 2769pp 499ndash510 2003

[10] N Meuschke B Gipp and M Lipinsk ldquoCITREC an eval-uation framework for citation-based similarity measures

06

055

05

045

Valu

e of e

valu

atio

n sta

ndar

d

04

035

03

025100 200 300 400 500

03302 03512 03687 03701 0372203003 03225 03352 03398 0340903101 03312 03448 03486 0349905456 05689 05801

Vector dimension of input layer05842 05867

Various d_0 for the effects of the model

MRRMAPnDCGRecall10

Figure 12 0e performance impact of d0 on CiteSeer

Computational Intelligence and Neuroscience 11

based on TREC genomics and PubMed centralrdquo 2015 httphdlhandlenet214273680

[11] A Ritchie S Robertson and S Teufel ldquoComparing CitationContexts for information Retrievalrdquo in Proceedings of the 17thACM Conference on Information and Knowledge Manage-ment pp 213ndash222 Napa Valley CA USA October 2008

[12] C F Van Loan e Block Jacobi Method for Computing theSingular Value Decomposition Department of ComputerScience Cornell University Ithaca NY USA 1985

[13] C Bhagavatula S Feldman R Power et al ldquoContent-basedcitation recommendationrdquo 2018 httpsarxivorgpdf18020830201v1pdf

[14] H Jia and E Saule ldquoLocal is good a fast citation recom-mendation approachrdquo Lecture Notes in Computer ScienceVol 10772 Springer Berlin Germany 2018

[15] Y Sun W Ni and R Men ldquoA personalized paper recom-mendation approach based on web paper mining and re-viewerrsquos interest modellingrdquo in Proceedings of theInternational Conference on Research Challenges in ComputerScience Shanghai China December 2009

[16] B Shaparenko and T Joachims ldquoInformation genealogyUncovering the flow of ideas in non-hyperlinked documentdatabasesrdquo in Proceedings of the ACM SIGKDD internationalConference on Knowledge Discovery and Data Mining SanJose CA USA August 2007

[17] T Strohman W B Croft and D Jensen ldquoRecommendingcitations for academic papersrdquo in Proceedings of the Annualinternational ACM SIGIR Conference on Research and De-velopment in information Retrieval Amsterdam NetherlandsJuly 2007

[18] A Livne V Gokuladas J Teevan et al ldquoCiteSight supportingcontextual citation recommendation using differentialsearchrdquo in Proceedings of the International ACM SIGIRConference on Research amp Development in informationRetrieval Gold Coast Australia July 2014

[19] Y Lu J He D Shan et al ldquoRecommending citations withtranslation modelrdquo in Proceedings of the ACM internationalConference on Information and Knowledge ManagementGlasgow UK October 2011

[20] W Huang P Mitra S Kataria et al ldquoRecommending cita-tions translating papers into referencesrdquo in Proceedings of theACM international Conference on Information and KnowledgeManagement Shanghai China November 2014

[21] X Tang X Wan X Zhang et al ldquoCross-language context-aware citation recommendation in scientific articlesrdquo inProceedings of the International ACM SIGIR Conference onResearch amp Development in information Retrieval Gold CoastUK July 2014

[22] A Brock T Lim J M Ritchie et al ldquoNeural photo edi-tingwith introspective adversarial networksrdquo in InternationalConference on Learning Representations 2017

[23] I J Goodfellow J Pouget-Abadie MMirza et al ldquoGenerativeadversarial netsrdquo in Proceedings of the International Con-ference on Neural Information Processing Systems MontrealCanada December 2014

[24] A Brock J Donahue K Simonyan et al ldquoLarge scale GANtraining for high fidelity natural image synthesisrdquo 2018httpsarxivorgabs180911096

[25] Y Sun L Zheng W Deng et al ldquoSVDNet for pedestrianretrievalrdquo in Proceedings of the 2017 IEEE InternationalConference on Computer Vision (ICCV) pp 3820ndash3828Venice Italy October 2017

[26] Q Zheng M Yang J Yang Q Zhang and X ZhangldquoImprovement of generalization ability of deep CNN via

implicit regularization in two-stage training processrdquo IEEEAccess vol 6 no 1109 pp 15844ndash15869 2018

[27] Y Wang D Gong Z Zheng et al ldquoOrthogonal deep featuresdecomposition for age-invariant face recognitionrdquo in Pro-ceedings of the European Conference on Computer Vision(ECCV) Munich Germany September 2018

[28] Y Chen X Jin J Feng et al ldquoTraining group orthogonalneural networks with privileged informationrdquo in Proceedingsof the Twenty-Sixth International Joint Conference on ArtificialIntelligence Melbourne Australia August 2017

[29] T Mikolov I Sutskever K Chen et al ldquoDistributedrepresentations of words and phrases and their composi-tionalityrdquo in Proceedings of the 26th International Con-ference on Neural Information Processing Systems LakeTahoe NV USA December 2013

[30] A Rajaraman and J D Ullman ldquoData miningrdquo Mining ofMassive Datasets vol 3 no 2 pp 1ndash17 2014

[31] J Duchi E Hazan and Y Singer ldquoAdaptive subgradientmethods for online learning and stochastic optimizationrdquoJournal of Machine Learning Research vol 12 no 7pp 2121ndash2159 2011

[32] T Miyato A M Dai and I Goodfellow ldquoAdversarial trainingmethods for semi-supervised text classificationrdquo in Pro-ceedings of the International Conference on LearningRepresentations San Juan Puerto Rico May 2016

[33] I J Goodfellow J Shlens and C Szegedy ldquoExplaining andharnessing adversarial examplesrdquo in Proceedings of the In-ternational Conference on Learning Representations BanffCanada April 2014

[34] K Chandrasekaran S Gauch P Lakkaraju et al ldquoConcept-based document recommendations for CiteSeer authorsrdquo inProceedings of the International Conference on AdaptiveHypermedia and Adaptive Web-Based Systems HannoverGermany August 2008

[35] E Voorhees ldquo0e trec-8 question answering track reportrdquo inProceedings of the TRECrsquo00 pp 77ndash82 Gaithersburg MDUSA 2000

[36] S Kataria P Mitra and S Bhatia ldquoUtilizing context ingenerative bayesian models for linked corpusrdquo in Proceedingsof the Twenty-Fourth AAAI Conference on ArtificialIntelligence Atlanta GA USA July 2010

[37] J Tang and J Zhang ldquoA discriminative approach to topic-based citation recommendationrdquo in Proceedings of the Pacific-Asia Conference Hyderabad India July 2009

12 Computational Intelligence and Neuroscience

Page 4: SVD-CNN:AConvolutionalNeuralNetworkModelwith ...

simj 1113936

dj

i0 Cji times Dji1113872 1113873

1113936dj

i0 Cji1113872 11138732

times 1113936dj

i0 Dji1113872 11138732

1113969 (j isin [1 10]) (3)

Citation context Document

SVD-FC

Word2vet Word2vet

Convolution

W-ap W-ap

FC

FC

LogisticsSVM

W-ap W-ap

Convolution

Convolution Convolution10 th

1st

USSVDw

Splice

All-ap All-ap

All-ap All-ap

Based-feature

All-ap-feature

Figure 2 An overview of our model

s + w minus 1ws

Figure 3 Convolution extraction generates phrases

ss + w minus 1

Figure 4 ldquoW-aprdquo structure

Figure 5 ldquoAll-aprdquo structure

4 Computational Intelligence and Neuroscience

where Cj and Dj are the distributed representation of ci-tation context and candidate document after the j-th ldquoall-aprdquo layer respectively A total of ten ldquoall-aprdquo layers arecarried out in our model 0erefore j belongs to [1 10] 0ebenefit is that we can obtain the semantic relation betweenthe citation context and the cited document with multiplegranularities As shown in Figure 6 the final output featureconsists of all simj and basic features 0en it is fed into theSVD-FC layer

In most cases we find that if we use all outputs of poollayers as the input of the SVD-FC layer the performance willbe improved0e reason is that features from different layersrepresent the different levels of semantics Neglecting anylayers will obviously cause information loss problems

Next we use the SVD-FC layer to learn the nonlinearcombination features of citation relationships0is layer canforce vectors in the feature map independent and orthogonalto each other 0e added SVD-FC layer can also reduce thenegative impact of excessive parameters

324 SVD-FC Layer In this layer we use SVD to factorizethe weight matrix W (W USVT) and replace it with USOur experimental results show that replacing operations canreduce the negative impact on the sample space

0e Euclidean distance between samples can be used tomeasure whether their feature expression changes in asample space Denoting em and en as the feature maps of twodifferent samples we can obtain two different outputs of thefull connection operation by using the weight matrix W orUS as follows

p e times W (4)

q e times US (5)

As seen in the above equations q is orthogonalizedoutput while p is unorthogonalized0en we can obtain thefollowing theorem

Theorem 1 p and q in equations (4) and (5) will generatethe same Euclidean distance for samples em and en

Proof 0e Euclidean distance L between pm and pn iscalculated as follows

L pm

rarrminus pn

rarr2

emrarr

minus enrarr

( 1113857TWW

Temrarr

minus enrarr

( 1113857

1113969

emrarr

minus enrarr

( 1113857TUSVV

TS

TU

Temrarr

minus enrarr

( 1113857

1113969(6)

Since V is an orthogonal matrix equation (6) isequivalent to

L

emrarr

minus enrarr

( 1113857TUSS

TU

Temrarr

minus enrarr

( 1113857

1113969

qmrarr

minus qnrarr

( 1113857T

qmrarr

minus qnrarr

( 1113857

1113969

qmrarr

minus qnrarr

2

(7)

It can be seen that pm

rarrminus pn

rarr2 qm

rarrminus qn

rarr2

It should be noted that there are no negative impacts andno changes in discrimination ability for the entire samplespace when replacing the weight As shown in Figure 7 weuse SVD of weight matrix W to map the feature map to anorthogonal linear space

325 Output Layer 0e citation recommendation problemis regarded as a classification task in our model In this layerlogistics and SVM can deal with binary classification tasksand predict the final citation relationship

33 Training Details

331 Embeddings In our model words are initialized by300-dimensional word2vec embeddings and will notchange during training A single randomly initializedembedding is created for all unknown words by uniformsampling from[minus001 001] We employ AdaGrad [31] andL2 regularization We introduce adversarial training [32]for embeddings to make the model more robust 0eprocess is achieved by replacing the word vector v afterword2vec embeddings using word vector with disturbingvlowast

vlowast

v times radv (8)

where radv is the worst case of perturbation on the wordvector Goodfellow et al [33] approximated this value bylinearizing the loss function logp(y|x 1113954θ) around x where1113954θ is a constant set to the current parameters of our modeland it only participates in the calculation process of radvwithout a backpropagation algorithm With the linearapproximation and L2 norm constraint the adversarialperturbation is

All-ap-feature

simi

Ci Di

Basic-feature

Figure 6 Generating the feature map

SVD-FC layer input feature

SVD-FC layer output featureSVD-FC layer

Figure 7 SVD-FC layer

Computational Intelligence and Neuroscience 5

radv minusising

g2 whereg nablaxlogp(y|x 1113954θ) (9)

0is perturbation can be easily computed by usingbackpropagation in neural networks

332 Layerwise Training In our training steps we defineconv-pooling block bt (tge 2) which consists of a convo-lution layer and a pooling layer Our network model is thenassembled by the initialization block b1 that initializes usingword2vec and (n minus 1) conv-pooling blocks

First we train the conv-pooling block b2 after b1 istrained On this basis the next conv-pooling block b3 iscreated by keeping the previous block fixed We repeat thisprocedure until all (n minus 1) conv-pooling blocks are trained

Second the following semiorthogonal training proce-dure is used to train the whole network

Semiorthogonal training (SOT) it is crucial to trainSVD-CNN which consists of the following three steps

Step 1 Decompose the weight matrix by SVD ieW USVT W is the weight matrix of the linear layerU is the left-unitary matrix S is the singular valuematrix V is the right-unitary matrix After that wereplace W with US Next we take all eigenvectors ofUS(US)T as weight vectorsStep 2 0e backbone model is fine-tuned by fixing theSVD-FC layerStep 3 0e model keeps fine-tuning with the unfixedSVD-FC layer

Step 1 can generate orthogonal weights but the per-formance of prediction cannot be guaranteed 0e reason isthat over orthogonality will excessively punish synonymoussentences which is apparently inappropriate 0erefore weintroduce Steps 2 and 3 to solve the above problem

0e inputs of SVD-FC are defined as Y

Y (y1 y2 ym)T 0e outputs are defined as O

O (o1 o2 om)T 0e weight matrix is defined asW (w1 w2 wm)T 0e expected outputs are defined asA (a1 a2 am)T 0e error function is defined as

E 12

1113944

l

k1ak minus ok( 1113857

2 (10)

where ok f(1113936mj0 wkjyj) k 1 2 l 0en E with re-

spect to ok is derived and the outcome is

zE

zok

minus ak minus ok( 1113857 (11)

We utilize the gradient descent strategy to find thegradient of the error with respect to weights 0e iterativeupdate of weights is as follows

Δwkj minusηzE

zwkj

(12)

We define an error signal δok zEz netk equation (12) is

equivalent to

Δwkj minusηzE

z netk

z netkzwkj

minusηδok

z netkzwkj

(13)

According to equation (11) δok zEz netk is equivalent

to

δok minus

zE

zok

zok

z netk minus

zE

zok

fprime netk( 1113857

zE

zok

okprime minus dk minus ok( 1113857ok

prime

(14)

We use the sigmoid f(x) 1(1 + ex) as the nonlinearfunction so equation (13) is equivalent to

Δwkj minusηδokyj η dk minus ok( 1113857ok 1 minus ok( 1113857yj (15)

In Step 1 the weight matrix W is decomposed by SVDand replaced with US U (q1 q2 qm)T andS diag(λ1 λ2 λm) Since dk minus ok is given we definethat Loss dk minus ok As a result equation (15) is equivalent to

Δwkj η Loss middot ok minus sigmoid yj 1113944 qiλi + B1113872 11138732

1113876 1113877yj (16)

qi middot qj 0 ine j are in the left-unitary matrix U so themodel operation is not affected by the nonorthogonal ei-genvectors qi 0is is the reason for excessively punishingsynonymous sentences in Step 1 However orthogonalityhas a positive effect on Δwkj in Step 2

0e purpose of SVD is to maintain the orthogonality ofeach weight vector in geometric space When weight vectorsare conditioned by orthogonal regularization the relevancybetween weight vectors decreases We use the followingmethods in Step 3 to measure relevance

H WTW

w1rarrT

w1rarr

middot middot middot w1rarrT

wkrarr

⋮ ⋱ ⋮

wkrarrT

w1rarr

middot middot middot wkrarrT

wkrarr

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

h11 middot middot middot h1k

⋮ ⋱ ⋮

hk1 middot middot middot hkk

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

(17)

where W is a weight matrix that contains k weight vectorswi (i 1 k)hij (i j 1 k) is the dot product of wi

and wj Let us define S(W) as the correlation measurementof all column vectors in W

S(W) 1113936

ki1 hii

1113936ki1 1113936

kj1 hij

11138681113868111386811138681113868

11138681113868111386811138681113868 (18)

When W is an orthogonal matrix the value of S(W) is 1When ine j S(W) obtains the minimum value (1k)0erefore we can see that the value of S(W) falls into

6 Computational Intelligence and Neuroscience

[(1k) 1] As a result when S(W) is close to 1k or 0 theweight matrix will have high relevance

34 Complexity Analysis Assume that the training samplesize is |C| the average number of words in each citationcontext is |c| Cl is the number of kernels in the l-th layer andwis the size of the sliding window For one convolution layerthe training complexity is O(Clminus1 middot Cl middot w middot (s minus w + 1)) 0etraining complexity of one w-ap layer is O(C2

l middot w middot s) 0etraining complexity of one all-ap layer isO(C2

l middot (s minus w + 1))which was improved by C F Van Loan [12] computing theeigenvalue for SVD matrix decomposition with K size takesO(K) on the way of JACOBI Assume that the size of theweight matrix in the SVD-FC layer isK and the channel ofthe input matrix is Cin 0e computational cost for the SVD-FC layer is O(2K2 middot Cin + K)

4 Experiment

41 Dataset We use the CiteSeer dataset [34] to evaluatethe performance of our model 0e dataset was publishedby Huang et al [4] In this dataset citation relationshipsare extracted by a pair of citation contexts and the ab-stracts of cited papers A citation context includes thesentence where the citation placeholder appears and thesentences before and after the citation placeholderWithin each paper in the corpus the 50 words before and50 words after each citation reference are treated as thecorresponding citation context (a discussion on thenumber of words can be found in [7]) Before wordembedding we also remove stop words from the contextsTo preserve the time-sensitive pastpresentfuture tensesof verbs and the singularplural styles of named entitiesno stemming is done but all words are transferred tolower-case 0e training set contains 3989547 pairs ofreference contexts and citations and the test set contains1021685 citation relations

Following common practice in information retrieval(IR) we employ the following four evaluation metrics toevaluate recommendation results recall mean reciprocalrank (MRR) mean average precision (MAP) and normal-ized discounted cumulative gain (nDCG)

42 EvaluationMetric For each query in the test set we usethe original set of references as the ground truth Rg Assumethat the set of recommended citations is Rr and the correctrecommendations are Rg capRr Recall is defined as

recall Rg capRr

11138681113868111386811138681113868

11138681113868111386811138681113868

Rg

(19)

In our experiments the number of recommended ci-tations ranges from 1 to 10 Recall evaluation does not revealthe order of recommended references To address thisproblem we select the following two additional metrics

For a query q let rankq be the rank of the first correctrecommendation within the list MRR [35] is defined as

MRR 1

|Q|1113944qisinQ

1rankq

(20)

where Q is the testing set MRR reveals the average rankingof the first correct recommendation

For each citation placeholder we search the papers thatmay be referenced at this citation placeholder Each retrievalmodel returns a ranked list of papers Since there may be oneor more references for one citation context we use meanaverage precision (MAP) as the evaluation metric

MAP d1 dN( 1113857 1113936i R di( 1113857i( 11138571113936jleiR dj1113872 1113873

1113936iR di( 1113857 (21)

where R(di) is a binary function indicating whether doc-ument di is relevant or not For our problem the papers citedat the citation placeholder are considered relevantdocuments

We use normalized discounted cumulative gain (NDCG)to measure the ranked recommendation list 0e NDCGvalue of a ranking list at position i is calculated as

NDCG d1 dN( 1113857 1113944i

2rel di( ) minus 1lni+1 (22)

where rel (di) is the 4-scale relevance of document di in theranked list We use the average cocited probability [2] oflangdi dlowastrang to weigh the citation relevance score of di to dlowast(anoriginal citation of the query) We report the average NDCGscore over all testing documents

43 BaselineComparison We choose the following methodsfor comparison

Cite-PLSA-LDA (CP-LDA) [36] we use the originalimplementation provided by the author 0e number oftopics is set to 60

(i) Restricted Boltzmann Machine (RBM-CS) [37] Wetrain two layers of RBM-CS according to the sug-gestion of the author We set the hidden layer size to600

(ii) Word2vec Model (W2V) [29] We use the word2vecmodel to learn words and document representa-tions 0e cited document is treated as a ldquowordrdquo (adocument uses a unique marker when it is cited bydifferent papers) 0e dimensions of the word anddocument vectors are set to n 300

(iii) Neural Probabilistic Model (NPM) [4] We followthe original implementation 0e dimensions of theword and document representation vector are set ton 600 For negative sampling we set the numberof negative samples k 10 where k is the number ofnoise words in the citation context For noisecontrast estimation we set the number of noisesamples k 1000

(iv) Neural Citation Network (NCN) [7] In NCN thegradient clipping is 5 the dropout probability is 02and the recurrent layers are 2 0e region sizes for

Computational Intelligence and Neuroscience 7

the encoder are set to 4 4 and 5 and the region sizesfor the author network are set to 1 and 2

Figures 8 and 9 show the performance of eachmethod onthe CiteSeer dataset It is obvious that the SVD-FC modelleads the performance in most cases More detailed analysesare given as follows

First we perform a comparison among CP-LDA RBMW2V and SVD-CNN Our SVD-CNN completely andsignificantly exceeds other models in all metrics 0e successof ourmodel is ascribed to the content and correlation of ournetwork Due to the lack of citation context information wefind that W2V is obviously worse than other methods interms of all metrics CP-LDA works much better than W2Vwhich indicates that link information is very important forfinding relevant papers RBM-CS shows a clear performancegain over W2V because RBM-CS automatically discoverstopical aspects of each paper based on citation contextHowever the vector representations of citation context inRBM-CS are extracted by traditional word vector repre-sentations which fully neglect semantic relations betweenthe citation document and citation context and thus may belimited by vocabulary

Second we compare the performance among NPMNCN and SVD-CNN It is not surprising that NPM andNCN achieve worse performance than SVD-CNN since theirdistributed representation of words and documents reliessolely on deep learning without restraint NPM recommendscitations based on trained distributed representations NCNfurther enhances the performance by considering authorinformation and using a more sophisticated neural networkarchitecture However the CNN in NCN does not haveorthogonal constraints which makes it difficult to capturedifferent types of citing activities In addition NCN onlyutilizes the title of the cited paper for a decoder which isapparently not sufficient for learning good embedding

44 e Influence on the Link Prediction of Reference PatternInteractionalFeatures According to the chapter positions ofcitation context in the article we divide the training set intothree parts the introduction part contains 1307885 pairs ofreference contexts and citations the related word partcontains 1599897 pairs of citations and the main partcontains 1024783 pairs Furthermore these datasets formthree mixed datasets In this part of the experiment we usethe CNN model without SVD as the baseline 0ese datasetsare tested in a ratio of 3 1 In Tables 1 and 2 we show theresults on the abovementioned datasets

From the results we obtain the following observationsFirst both CNN and SVD-CNN outperform unmixed

datasets over mixed datasets across the different evaluationmetrics which shows that the diversity of reference patternsincreases the difficulty of citation recommendation tasks

Second in Tables 1 and 2 we observe that our model isparticularly good at resolving the difficulties in mixeddatasets which come from the diversity of referencepatterns

To better explore why mixed datasets are more complexthan unmixed datasets in Figure 10 we show the change in

S(W) during the training process of SVD-CNN amongvarious datasets

As shown in Figure 10 the increase in S(W) on themixed datasets indicates that SVD-CNN is good at decor-relation We can also see in Tables 1 and 2 that the CNNmodel has pretty performance on unmixed datasets whileachieving poor performance on mixed datasets HoweverSVD-CNN achieves almost the same performance on thetwo types of datasets 0is proves that the correlation fromvarious reference patterns can significantly affect the linkprediction

0e reason why the change in S(W) is not large on theunmixed datasets is that reference patterns of unmixeddatasets have similar features which belong to the samecategory As a result the orthogonality of the weight matrixis hard to improve on unmixed datasets However a citationrecommendation algorithm has pretty performance on theunmixed datasets because there are low complexities

Although mixed datasets are more complicated thanunmixed datasets SVD-CNN still performs well in mixeddatasets 0is indicates that SVD-CNN reduces the negativeimpact of the correlation of reference patterns and ourapproach is more suitable for complex scenarios

45 Comparison with Other Types of Decorrelation In ad-dition to SVD there are still some other methods fordecorrelating the feature matrix However these methodscannot maintain the discriminating ability of the CNNmodel To illustrate this we compare SVD with severalvarieties as follows

(1) Using the originally learned W

(2) Replacing W with US

(3) Replacing W with U

(4) Replacing W with UVT

(5) Replacing Wwith Q D where D is the diagonalmatrix extracted from the upper triangle matrix inQ-R decomposition

(6) Replacing W with WPCA where WPCA is the diagonalmatrix extracted from the weight matrix W after theprocessing of dimension reduction by PCA

After convergence of training different orthogonalmatrices are used to replace the weight matrix W We defineT-cost as the time cost of replacing the weight which isequivalent to the proportion of the added time to the originaltime As shown in Table 3 other types of decorrelationdegrade the performance in addition to W⟶ US andW⟶WPCA However the time cost of W⟶WPCA ismore than that of W⟶ US

46 Ablation Study In our method there are two essentialparameters a term sot which means the number of SOTiterations and a biased parameter d0 In this section weconduct an ablation study of these parameters

We first evaluate the effectiveness of sot by empiri-cally fixing d0 300 Since sot defines the loop time of

8 Computational Intelligence and Neuroscience

orthogonal constraint training it should be set as anonnegative value Figure 11 illustrates the MRR with sotfrom 0 to 10 on the CiteSeer dataset We can see that the

performance improves as the value of sot increasesWhen sot 0 the model has no decorrelation andachieves the worst performance In this situation the

Table 1 MRR metric on various datasets

Introduction Related Main Introduction + related Introduction +main Related +mainCNN 03312 03294 03478 02773 02815 02978SVD-CNN 03995 04078 03989 03878 03889 03845

060055050045040035030025Re

call

020015010005000

20 40 60Number of recommended citations

80 100

W2VsNPMs

RBMsCP_LDAs

SVD_CNNsNCNs

Figure 8 Comparison of recall with different methods on CiteSeer

MRR MAP and nDCG scores for top 10 recommendations04

035

03

025

02

015

01

005

0

0091600997

MRR MAP nCDG

00662

01843

02667

03687

00912009982

00663

01835

02418

03352

01288 0135601476

0256602592

03448

CP-LDARBM

W2VNPM

NCNSVD-CNN

Figure 9 Comparison of MRR MAP and nDCG with different methods on CiteSeer

Table 2 MAP metric on various datasets

Introduction Related Main Introduction + related Introduction +main Related +mainCNN 03001 02909 03107 02572 02601 02637SVD-CNN 03701 03655 03693 03498 03511 03539

Computational Intelligence and Neuroscience 9

weight matrix in the FC layer is highly correlated andS(W) has the lowest value 0e recommendation per-formance then increases while adding sot which indi-cates that reducing the correlative degree of the weightmatrix in the FC layer is critical for improving perfor-mance When sot 10 our model achieves the bestperformance

In our model d0 is the dimension of citation contextand cited document representations Figure 12 shows howthe performance of SVD-CNN varies with d0 on the samesot When d0 is small the information content of thecitation context is very small and produces worse per-formance 0e recommendation performance increases toa maximum point until d0 reaches 300 It should be noted

05

045

04

035

03

S (W

)

025

02

015

01

005

00 1 2 3 4 5

Sot6 7 8 9 10 11

IntroductionRelatedMain

Introduction + relatedIntroduction + mainRelated + main

Figure 10 0e change in S(W) during training on unmixed datasets and mixed datasets

Table 3 0e comparison of related methods in Step 1

W W⟶ US W⟶ U W⟶ UVT W⟶ Q D W⟶WPCA

Rank-1 636 636 617 617 616 636mAP 390 390 371 371 373 390T-cost 0 3627 3627 3627 3533 5765

055050045040035030

MRR

025020015010005000

05

S (W

)

10

04

03

02

01

001 3 5

Sot7 9

MRRLRMRRSVM

S (W) SOTS (W) NO-SOT

Figure 11 0e performance impact of sot on CiteSeer

10 Computational Intelligence and Neuroscience

that although the larger d0 is better the larger d0 willsignificantly increase the training time 0erefore wechoose d0 300

5 Conclusion and Future Works

We propose a convolutional neural network model withorthogonal regularization to solve the context-aware citationrecommendation task In our model orthogonal regulari-zation is achieved by using SVD to factorize the weight of theFC layer which can essentially make each vector in thefeature map more independent 0e orthogonal regulari-zation also enhances the feature extraction ability of CNN0e experimental results show that SVD-CNN outperformsthe other compared methods on CiteSeer Our model onlytakes the abstract as the content of the cited paper In thefuture we will explore the performance of our model byusing the full text of papers

Data Availability

Previously reported CiteSeer data were used to support thisstudy and are available at [httpspsuappboxcomvrefseer] 0ese prior datasets are cited at relevant placeswithin the text as references [4]

Conflicts of Interest

0e authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

0is work was partially supported by the National NaturalScience Foundation of China (project no 61373046) andthe National Key Research and Development Programs ofChina (project nos 2018AAA0101100 and2019YFB2102500)

References

[1] M A Angrosh S Cranefield and N Stanger ldquoConditionalrandom field based sentence context identification enhancingcitation services for the research communityrdquo in Proceedingsof the First Australasian Web Conference Adelaide AustraliaJanuary 2013

[2] Q He J Pei D Kifer et al ldquoContext-aware citation rec-ommendationrdquo in Proceedings of the International Conferenceon World Wide Web Raleigh NC USA April 2010

[3] Q He D Kifer J Pei et al ldquoCitation recommendationwithout author supervisionrdquo in Proceedings of the FourthACM international Conference on Web Search and DataMining Hong Kong China February 2011

[4] W Huang ldquoA neural probabilistic model for context basedcitation recommendationrdquo in Proceedings of the AAAIConference on Artificial Intelligence Austin TX USA January2015

[5] J Tan X Wan and J Xiao ldquoA neural network approach toquote recommendation in writingsrdquo in Proceedings of theACM International on Conference on Information andKnowledge Management Indianapolis IN USA October2016

[6] X Ren J Liu X Yu et al ldquoCluscite effective citation rec-ommendation by information network-based clusteringrdquo inProceedings of the 20th ACM SIGKDD International Con-ference on Knowledge Discovery and Data Mining New YorkNY USA August 2014

[7] T Ebesu and Y Fang ldquoNeural citation network for context-aware citation recommendationrdquo in Proceedings of the 40thInternational ACM SIGIR Conference on Research and De-velopment in Information Retrieval pp 1093ndash1096 ShinjukuJapan August 2017

[8] D M Blei A Y Ng and M I Jordan ldquoLatentdirichlet allocationrdquo Journal of Machine Learning Researchvol 3 pp 993ndash1022 2003

[9] S Bradshaw ldquoReference directed indexing redeeming rele-vance for subject search in citation indexesrdquo Research andAdvanced Technology for Digital Libraries vol 2769pp 499ndash510 2003

[10] N Meuschke B Gipp and M Lipinsk ldquoCITREC an eval-uation framework for citation-based similarity measures

06

055

05

045

Valu

e of e

valu

atio

n sta

ndar

d

04

035

03

025100 200 300 400 500

03302 03512 03687 03701 0372203003 03225 03352 03398 0340903101 03312 03448 03486 0349905456 05689 05801

Vector dimension of input layer05842 05867

Various d_0 for the effects of the model

MRRMAPnDCGRecall10

Figure 12 0e performance impact of d0 on CiteSeer

Computational Intelligence and Neuroscience 11

based on TREC genomics and PubMed centralrdquo 2015 httphdlhandlenet214273680

[11] A Ritchie S Robertson and S Teufel ldquoComparing CitationContexts for information Retrievalrdquo in Proceedings of the 17thACM Conference on Information and Knowledge Manage-ment pp 213ndash222 Napa Valley CA USA October 2008

[12] C F Van Loan e Block Jacobi Method for Computing theSingular Value Decomposition Department of ComputerScience Cornell University Ithaca NY USA 1985

[13] C Bhagavatula S Feldman R Power et al ldquoContent-basedcitation recommendationrdquo 2018 httpsarxivorgpdf18020830201v1pdf

[14] H Jia and E Saule ldquoLocal is good a fast citation recom-mendation approachrdquo Lecture Notes in Computer ScienceVol 10772 Springer Berlin Germany 2018

[15] Y Sun W Ni and R Men ldquoA personalized paper recom-mendation approach based on web paper mining and re-viewerrsquos interest modellingrdquo in Proceedings of theInternational Conference on Research Challenges in ComputerScience Shanghai China December 2009

[16] B Shaparenko and T Joachims ldquoInformation genealogyUncovering the flow of ideas in non-hyperlinked documentdatabasesrdquo in Proceedings of the ACM SIGKDD internationalConference on Knowledge Discovery and Data Mining SanJose CA USA August 2007

[17] T Strohman W B Croft and D Jensen ldquoRecommendingcitations for academic papersrdquo in Proceedings of the Annualinternational ACM SIGIR Conference on Research and De-velopment in information Retrieval Amsterdam NetherlandsJuly 2007

[18] A Livne V Gokuladas J Teevan et al ldquoCiteSight supportingcontextual citation recommendation using differentialsearchrdquo in Proceedings of the International ACM SIGIRConference on Research amp Development in informationRetrieval Gold Coast Australia July 2014

[19] Y Lu J He D Shan et al ldquoRecommending citations withtranslation modelrdquo in Proceedings of the ACM internationalConference on Information and Knowledge ManagementGlasgow UK October 2011

[20] W Huang P Mitra S Kataria et al ldquoRecommending cita-tions translating papers into referencesrdquo in Proceedings of theACM international Conference on Information and KnowledgeManagement Shanghai China November 2014

[21] X Tang X Wan X Zhang et al ldquoCross-language context-aware citation recommendation in scientific articlesrdquo inProceedings of the International ACM SIGIR Conference onResearch amp Development in information Retrieval Gold CoastUK July 2014

[22] A Brock T Lim J M Ritchie et al ldquoNeural photo edi-tingwith introspective adversarial networksrdquo in InternationalConference on Learning Representations 2017

[23] I J Goodfellow J Pouget-Abadie MMirza et al ldquoGenerativeadversarial netsrdquo in Proceedings of the International Con-ference on Neural Information Processing Systems MontrealCanada December 2014

[24] A Brock J Donahue K Simonyan et al ldquoLarge scale GANtraining for high fidelity natural image synthesisrdquo 2018httpsarxivorgabs180911096

[25] Y Sun L Zheng W Deng et al ldquoSVDNet for pedestrianretrievalrdquo in Proceedings of the 2017 IEEE InternationalConference on Computer Vision (ICCV) pp 3820ndash3828Venice Italy October 2017

[26] Q Zheng M Yang J Yang Q Zhang and X ZhangldquoImprovement of generalization ability of deep CNN via

implicit regularization in two-stage training processrdquo IEEEAccess vol 6 no 1109 pp 15844ndash15869 2018

[27] Y Wang D Gong Z Zheng et al ldquoOrthogonal deep featuresdecomposition for age-invariant face recognitionrdquo in Pro-ceedings of the European Conference on Computer Vision(ECCV) Munich Germany September 2018

[28] Y Chen X Jin J Feng et al ldquoTraining group orthogonalneural networks with privileged informationrdquo in Proceedingsof the Twenty-Sixth International Joint Conference on ArtificialIntelligence Melbourne Australia August 2017

[29] T Mikolov I Sutskever K Chen et al ldquoDistributedrepresentations of words and phrases and their composi-tionalityrdquo in Proceedings of the 26th International Con-ference on Neural Information Processing Systems LakeTahoe NV USA December 2013

[30] A Rajaraman and J D Ullman ldquoData miningrdquo Mining ofMassive Datasets vol 3 no 2 pp 1ndash17 2014

[31] J Duchi E Hazan and Y Singer ldquoAdaptive subgradientmethods for online learning and stochastic optimizationrdquoJournal of Machine Learning Research vol 12 no 7pp 2121ndash2159 2011

[32] T Miyato A M Dai and I Goodfellow ldquoAdversarial trainingmethods for semi-supervised text classificationrdquo in Pro-ceedings of the International Conference on LearningRepresentations San Juan Puerto Rico May 2016

[33] I J Goodfellow J Shlens and C Szegedy ldquoExplaining andharnessing adversarial examplesrdquo in Proceedings of the In-ternational Conference on Learning Representations BanffCanada April 2014

[34] K Chandrasekaran S Gauch P Lakkaraju et al ldquoConcept-based document recommendations for CiteSeer authorsrdquo inProceedings of the International Conference on AdaptiveHypermedia and Adaptive Web-Based Systems HannoverGermany August 2008

[35] E Voorhees ldquo0e trec-8 question answering track reportrdquo inProceedings of the TRECrsquo00 pp 77ndash82 Gaithersburg MDUSA 2000

[36] S Kataria P Mitra and S Bhatia ldquoUtilizing context ingenerative bayesian models for linked corpusrdquo in Proceedingsof the Twenty-Fourth AAAI Conference on ArtificialIntelligence Atlanta GA USA July 2010

[37] J Tang and J Zhang ldquoA discriminative approach to topic-based citation recommendationrdquo in Proceedings of the Pacific-Asia Conference Hyderabad India July 2009

12 Computational Intelligence and Neuroscience

Page 5: SVD-CNN:AConvolutionalNeuralNetworkModelwith ...

where Cj and Dj are the distributed representation of ci-tation context and candidate document after the j-th ldquoall-aprdquo layer respectively A total of ten ldquoall-aprdquo layers arecarried out in our model 0erefore j belongs to [1 10] 0ebenefit is that we can obtain the semantic relation betweenthe citation context and the cited document with multiplegranularities As shown in Figure 6 the final output featureconsists of all simj and basic features 0en it is fed into theSVD-FC layer

In most cases we find that if we use all outputs of poollayers as the input of the SVD-FC layer the performance willbe improved0e reason is that features from different layersrepresent the different levels of semantics Neglecting anylayers will obviously cause information loss problems

Next we use the SVD-FC layer to learn the nonlinearcombination features of citation relationships0is layer canforce vectors in the feature map independent and orthogonalto each other 0e added SVD-FC layer can also reduce thenegative impact of excessive parameters

324 SVD-FC Layer In this layer we use SVD to factorizethe weight matrix W (W USVT) and replace it with USOur experimental results show that replacing operations canreduce the negative impact on the sample space

0e Euclidean distance between samples can be used tomeasure whether their feature expression changes in asample space Denoting em and en as the feature maps of twodifferent samples we can obtain two different outputs of thefull connection operation by using the weight matrix W orUS as follows

p e times W (4)

q e times US (5)

As seen in the above equations q is orthogonalizedoutput while p is unorthogonalized0en we can obtain thefollowing theorem

Theorem 1 p and q in equations (4) and (5) will generatethe same Euclidean distance for samples em and en

Proof 0e Euclidean distance L between pm and pn iscalculated as follows

L pm

rarrminus pn

rarr2

emrarr

minus enrarr

( 1113857TWW

Temrarr

minus enrarr

( 1113857

1113969

emrarr

minus enrarr

( 1113857TUSVV

TS

TU

Temrarr

minus enrarr

( 1113857

1113969(6)

Since V is an orthogonal matrix equation (6) isequivalent to

L

emrarr

minus enrarr

( 1113857TUSS

TU

Temrarr

minus enrarr

( 1113857

1113969

qmrarr

minus qnrarr

( 1113857T

qmrarr

minus qnrarr

( 1113857

1113969

qmrarr

minus qnrarr

2

(7)

It can be seen that pm

rarrminus pn

rarr2 qm

rarrminus qn

rarr2

It should be noted that there are no negative impacts andno changes in discrimination ability for the entire samplespace when replacing the weight As shown in Figure 7 weuse SVD of weight matrix W to map the feature map to anorthogonal linear space

325 Output Layer 0e citation recommendation problemis regarded as a classification task in our model In this layerlogistics and SVM can deal with binary classification tasksand predict the final citation relationship

33 Training Details

331 Embeddings In our model words are initialized by300-dimensional word2vec embeddings and will notchange during training A single randomly initializedembedding is created for all unknown words by uniformsampling from[minus001 001] We employ AdaGrad [31] andL2 regularization We introduce adversarial training [32]for embeddings to make the model more robust 0eprocess is achieved by replacing the word vector v afterword2vec embeddings using word vector with disturbingvlowast

vlowast

v times radv (8)

where radv is the worst case of perturbation on the wordvector Goodfellow et al [33] approximated this value bylinearizing the loss function logp(y|x 1113954θ) around x where1113954θ is a constant set to the current parameters of our modeland it only participates in the calculation process of radvwithout a backpropagation algorithm With the linearapproximation and L2 norm constraint the adversarialperturbation is

All-ap-feature

simi

Ci Di

Basic-feature

Figure 6 Generating the feature map

SVD-FC layer input feature

SVD-FC layer output featureSVD-FC layer

Figure 7 SVD-FC layer

Computational Intelligence and Neuroscience 5

radv minusising

g2 whereg nablaxlogp(y|x 1113954θ) (9)

0is perturbation can be easily computed by usingbackpropagation in neural networks

332 Layerwise Training In our training steps we defineconv-pooling block bt (tge 2) which consists of a convo-lution layer and a pooling layer Our network model is thenassembled by the initialization block b1 that initializes usingword2vec and (n minus 1) conv-pooling blocks

First we train the conv-pooling block b2 after b1 istrained On this basis the next conv-pooling block b3 iscreated by keeping the previous block fixed We repeat thisprocedure until all (n minus 1) conv-pooling blocks are trained

Second the following semiorthogonal training proce-dure is used to train the whole network

Semiorthogonal training (SOT) it is crucial to trainSVD-CNN which consists of the following three steps

Step 1 Decompose the weight matrix by SVD ieW USVT W is the weight matrix of the linear layerU is the left-unitary matrix S is the singular valuematrix V is the right-unitary matrix After that wereplace W with US Next we take all eigenvectors ofUS(US)T as weight vectorsStep 2 0e backbone model is fine-tuned by fixing theSVD-FC layerStep 3 0e model keeps fine-tuning with the unfixedSVD-FC layer

Step 1 can generate orthogonal weights but the per-formance of prediction cannot be guaranteed 0e reason isthat over orthogonality will excessively punish synonymoussentences which is apparently inappropriate 0erefore weintroduce Steps 2 and 3 to solve the above problem

0e inputs of SVD-FC are defined as Y

Y (y1 y2 ym)T 0e outputs are defined as O

O (o1 o2 om)T 0e weight matrix is defined asW (w1 w2 wm)T 0e expected outputs are defined asA (a1 a2 am)T 0e error function is defined as

E 12

1113944

l

k1ak minus ok( 1113857

2 (10)

where ok f(1113936mj0 wkjyj) k 1 2 l 0en E with re-

spect to ok is derived and the outcome is

zE

zok

minus ak minus ok( 1113857 (11)

We utilize the gradient descent strategy to find thegradient of the error with respect to weights 0e iterativeupdate of weights is as follows

Δwkj minusηzE

zwkj

(12)

We define an error signal δok zEz netk equation (12) is

equivalent to

Δwkj minusηzE

z netk

z netkzwkj

minusηδok

z netkzwkj

(13)

According to equation (11) δok zEz netk is equivalent

to

δok minus

zE

zok

zok

z netk minus

zE

zok

fprime netk( 1113857

zE

zok

okprime minus dk minus ok( 1113857ok

prime

(14)

We use the sigmoid f(x) 1(1 + ex) as the nonlinearfunction so equation (13) is equivalent to

Δwkj minusηδokyj η dk minus ok( 1113857ok 1 minus ok( 1113857yj (15)

In Step 1 the weight matrix W is decomposed by SVDand replaced with US U (q1 q2 qm)T andS diag(λ1 λ2 λm) Since dk minus ok is given we definethat Loss dk minus ok As a result equation (15) is equivalent to

Δwkj η Loss middot ok minus sigmoid yj 1113944 qiλi + B1113872 11138732

1113876 1113877yj (16)

qi middot qj 0 ine j are in the left-unitary matrix U so themodel operation is not affected by the nonorthogonal ei-genvectors qi 0is is the reason for excessively punishingsynonymous sentences in Step 1 However orthogonalityhas a positive effect on Δwkj in Step 2

0e purpose of SVD is to maintain the orthogonality ofeach weight vector in geometric space When weight vectorsare conditioned by orthogonal regularization the relevancybetween weight vectors decreases We use the followingmethods in Step 3 to measure relevance

H WTW

w1rarrT

w1rarr

middot middot middot w1rarrT

wkrarr

⋮ ⋱ ⋮

wkrarrT

w1rarr

middot middot middot wkrarrT

wkrarr

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

h11 middot middot middot h1k

⋮ ⋱ ⋮

hk1 middot middot middot hkk

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

(17)

where W is a weight matrix that contains k weight vectorswi (i 1 k)hij (i j 1 k) is the dot product of wi

and wj Let us define S(W) as the correlation measurementof all column vectors in W

S(W) 1113936

ki1 hii

1113936ki1 1113936

kj1 hij

11138681113868111386811138681113868

11138681113868111386811138681113868 (18)

When W is an orthogonal matrix the value of S(W) is 1When ine j S(W) obtains the minimum value (1k)0erefore we can see that the value of S(W) falls into

6 Computational Intelligence and Neuroscience

[(1k) 1] As a result when S(W) is close to 1k or 0 theweight matrix will have high relevance

34 Complexity Analysis Assume that the training samplesize is |C| the average number of words in each citationcontext is |c| Cl is the number of kernels in the l-th layer andwis the size of the sliding window For one convolution layerthe training complexity is O(Clminus1 middot Cl middot w middot (s minus w + 1)) 0etraining complexity of one w-ap layer is O(C2

l middot w middot s) 0etraining complexity of one all-ap layer isO(C2

l middot (s minus w + 1))which was improved by C F Van Loan [12] computing theeigenvalue for SVD matrix decomposition with K size takesO(K) on the way of JACOBI Assume that the size of theweight matrix in the SVD-FC layer isK and the channel ofthe input matrix is Cin 0e computational cost for the SVD-FC layer is O(2K2 middot Cin + K)

4 Experiment

41 Dataset We use the CiteSeer dataset [34] to evaluatethe performance of our model 0e dataset was publishedby Huang et al [4] In this dataset citation relationshipsare extracted by a pair of citation contexts and the ab-stracts of cited papers A citation context includes thesentence where the citation placeholder appears and thesentences before and after the citation placeholderWithin each paper in the corpus the 50 words before and50 words after each citation reference are treated as thecorresponding citation context (a discussion on thenumber of words can be found in [7]) Before wordembedding we also remove stop words from the contextsTo preserve the time-sensitive pastpresentfuture tensesof verbs and the singularplural styles of named entitiesno stemming is done but all words are transferred tolower-case 0e training set contains 3989547 pairs ofreference contexts and citations and the test set contains1021685 citation relations

Following common practice in information retrieval(IR) we employ the following four evaluation metrics toevaluate recommendation results recall mean reciprocalrank (MRR) mean average precision (MAP) and normal-ized discounted cumulative gain (nDCG)

42 EvaluationMetric For each query in the test set we usethe original set of references as the ground truth Rg Assumethat the set of recommended citations is Rr and the correctrecommendations are Rg capRr Recall is defined as

recall Rg capRr

11138681113868111386811138681113868

11138681113868111386811138681113868

Rg

(19)

In our experiments the number of recommended ci-tations ranges from 1 to 10 Recall evaluation does not revealthe order of recommended references To address thisproblem we select the following two additional metrics

For a query q let rankq be the rank of the first correctrecommendation within the list MRR [35] is defined as

MRR 1

|Q|1113944qisinQ

1rankq

(20)

where Q is the testing set MRR reveals the average rankingof the first correct recommendation

For each citation placeholder we search the papers thatmay be referenced at this citation placeholder Each retrievalmodel returns a ranked list of papers Since there may be oneor more references for one citation context we use meanaverage precision (MAP) as the evaluation metric

MAP d1 dN( 1113857 1113936i R di( 1113857i( 11138571113936jleiR dj1113872 1113873

1113936iR di( 1113857 (21)

where R(di) is a binary function indicating whether doc-ument di is relevant or not For our problem the papers citedat the citation placeholder are considered relevantdocuments

We use normalized discounted cumulative gain (NDCG)to measure the ranked recommendation list 0e NDCGvalue of a ranking list at position i is calculated as

NDCG d1 dN( 1113857 1113944i

2rel di( ) minus 1lni+1 (22)

where rel (di) is the 4-scale relevance of document di in theranked list We use the average cocited probability [2] oflangdi dlowastrang to weigh the citation relevance score of di to dlowast(anoriginal citation of the query) We report the average NDCGscore over all testing documents

43 BaselineComparison We choose the following methodsfor comparison

Cite-PLSA-LDA (CP-LDA) [36] we use the originalimplementation provided by the author 0e number oftopics is set to 60

(i) Restricted Boltzmann Machine (RBM-CS) [37] Wetrain two layers of RBM-CS according to the sug-gestion of the author We set the hidden layer size to600

(ii) Word2vec Model (W2V) [29] We use the word2vecmodel to learn words and document representa-tions 0e cited document is treated as a ldquowordrdquo (adocument uses a unique marker when it is cited bydifferent papers) 0e dimensions of the word anddocument vectors are set to n 300

(iii) Neural Probabilistic Model (NPM) [4] We followthe original implementation 0e dimensions of theword and document representation vector are set ton 600 For negative sampling we set the numberof negative samples k 10 where k is the number ofnoise words in the citation context For noisecontrast estimation we set the number of noisesamples k 1000

(iv) Neural Citation Network (NCN) [7] In NCN thegradient clipping is 5 the dropout probability is 02and the recurrent layers are 2 0e region sizes for

Computational Intelligence and Neuroscience 7

the encoder are set to 4 4 and 5 and the region sizesfor the author network are set to 1 and 2

Figures 8 and 9 show the performance of eachmethod onthe CiteSeer dataset It is obvious that the SVD-FC modelleads the performance in most cases More detailed analysesare given as follows

First we perform a comparison among CP-LDA RBMW2V and SVD-CNN Our SVD-CNN completely andsignificantly exceeds other models in all metrics 0e successof ourmodel is ascribed to the content and correlation of ournetwork Due to the lack of citation context information wefind that W2V is obviously worse than other methods interms of all metrics CP-LDA works much better than W2Vwhich indicates that link information is very important forfinding relevant papers RBM-CS shows a clear performancegain over W2V because RBM-CS automatically discoverstopical aspects of each paper based on citation contextHowever the vector representations of citation context inRBM-CS are extracted by traditional word vector repre-sentations which fully neglect semantic relations betweenthe citation document and citation context and thus may belimited by vocabulary

Second we compare the performance among NPMNCN and SVD-CNN It is not surprising that NPM andNCN achieve worse performance than SVD-CNN since theirdistributed representation of words and documents reliessolely on deep learning without restraint NPM recommendscitations based on trained distributed representations NCNfurther enhances the performance by considering authorinformation and using a more sophisticated neural networkarchitecture However the CNN in NCN does not haveorthogonal constraints which makes it difficult to capturedifferent types of citing activities In addition NCN onlyutilizes the title of the cited paper for a decoder which isapparently not sufficient for learning good embedding

44 e Influence on the Link Prediction of Reference PatternInteractionalFeatures According to the chapter positions ofcitation context in the article we divide the training set intothree parts the introduction part contains 1307885 pairs ofreference contexts and citations the related word partcontains 1599897 pairs of citations and the main partcontains 1024783 pairs Furthermore these datasets formthree mixed datasets In this part of the experiment we usethe CNN model without SVD as the baseline 0ese datasetsare tested in a ratio of 3 1 In Tables 1 and 2 we show theresults on the abovementioned datasets

From the results we obtain the following observationsFirst both CNN and SVD-CNN outperform unmixed

datasets over mixed datasets across the different evaluationmetrics which shows that the diversity of reference patternsincreases the difficulty of citation recommendation tasks

Second in Tables 1 and 2 we observe that our model isparticularly good at resolving the difficulties in mixeddatasets which come from the diversity of referencepatterns

To better explore why mixed datasets are more complexthan unmixed datasets in Figure 10 we show the change in

S(W) during the training process of SVD-CNN amongvarious datasets

As shown in Figure 10 the increase in S(W) on themixed datasets indicates that SVD-CNN is good at decor-relation We can also see in Tables 1 and 2 that the CNNmodel has pretty performance on unmixed datasets whileachieving poor performance on mixed datasets HoweverSVD-CNN achieves almost the same performance on thetwo types of datasets 0is proves that the correlation fromvarious reference patterns can significantly affect the linkprediction

0e reason why the change in S(W) is not large on theunmixed datasets is that reference patterns of unmixeddatasets have similar features which belong to the samecategory As a result the orthogonality of the weight matrixis hard to improve on unmixed datasets However a citationrecommendation algorithm has pretty performance on theunmixed datasets because there are low complexities

Although mixed datasets are more complicated thanunmixed datasets SVD-CNN still performs well in mixeddatasets 0is indicates that SVD-CNN reduces the negativeimpact of the correlation of reference patterns and ourapproach is more suitable for complex scenarios

45 Comparison with Other Types of Decorrelation In ad-dition to SVD there are still some other methods fordecorrelating the feature matrix However these methodscannot maintain the discriminating ability of the CNNmodel To illustrate this we compare SVD with severalvarieties as follows

(1) Using the originally learned W

(2) Replacing W with US

(3) Replacing W with U

(4) Replacing W with UVT

(5) Replacing Wwith Q D where D is the diagonalmatrix extracted from the upper triangle matrix inQ-R decomposition

(6) Replacing W with WPCA where WPCA is the diagonalmatrix extracted from the weight matrix W after theprocessing of dimension reduction by PCA

After convergence of training different orthogonalmatrices are used to replace the weight matrix W We defineT-cost as the time cost of replacing the weight which isequivalent to the proportion of the added time to the originaltime As shown in Table 3 other types of decorrelationdegrade the performance in addition to W⟶ US andW⟶WPCA However the time cost of W⟶WPCA ismore than that of W⟶ US

46 Ablation Study In our method there are two essentialparameters a term sot which means the number of SOTiterations and a biased parameter d0 In this section weconduct an ablation study of these parameters

We first evaluate the effectiveness of sot by empiri-cally fixing d0 300 Since sot defines the loop time of

8 Computational Intelligence and Neuroscience

orthogonal constraint training it should be set as anonnegative value Figure 11 illustrates the MRR with sotfrom 0 to 10 on the CiteSeer dataset We can see that the

performance improves as the value of sot increasesWhen sot 0 the model has no decorrelation andachieves the worst performance In this situation the

Table 1 MRR metric on various datasets

Introduction Related Main Introduction + related Introduction +main Related +mainCNN 03312 03294 03478 02773 02815 02978SVD-CNN 03995 04078 03989 03878 03889 03845

060055050045040035030025Re

call

020015010005000

20 40 60Number of recommended citations

80 100

W2VsNPMs

RBMsCP_LDAs

SVD_CNNsNCNs

Figure 8 Comparison of recall with different methods on CiteSeer

MRR MAP and nDCG scores for top 10 recommendations04

035

03

025

02

015

01

005

0

0091600997

MRR MAP nCDG

00662

01843

02667

03687

00912009982

00663

01835

02418

03352

01288 0135601476

0256602592

03448

CP-LDARBM

W2VNPM

NCNSVD-CNN

Figure 9 Comparison of MRR MAP and nDCG with different methods on CiteSeer

Table 2 MAP metric on various datasets

Introduction Related Main Introduction + related Introduction +main Related +mainCNN 03001 02909 03107 02572 02601 02637SVD-CNN 03701 03655 03693 03498 03511 03539

Computational Intelligence and Neuroscience 9

weight matrix in the FC layer is highly correlated andS(W) has the lowest value 0e recommendation per-formance then increases while adding sot which indi-cates that reducing the correlative degree of the weightmatrix in the FC layer is critical for improving perfor-mance When sot 10 our model achieves the bestperformance

In our model d0 is the dimension of citation contextand cited document representations Figure 12 shows howthe performance of SVD-CNN varies with d0 on the samesot When d0 is small the information content of thecitation context is very small and produces worse per-formance 0e recommendation performance increases toa maximum point until d0 reaches 300 It should be noted

05

045

04

035

03

S (W

)

025

02

015

01

005

00 1 2 3 4 5

Sot6 7 8 9 10 11

IntroductionRelatedMain

Introduction + relatedIntroduction + mainRelated + main

Figure 10 0e change in S(W) during training on unmixed datasets and mixed datasets

Table 3 0e comparison of related methods in Step 1

W W⟶ US W⟶ U W⟶ UVT W⟶ Q D W⟶WPCA

Rank-1 636 636 617 617 616 636mAP 390 390 371 371 373 390T-cost 0 3627 3627 3627 3533 5765

055050045040035030

MRR

025020015010005000

05

S (W

)

10

04

03

02

01

001 3 5

Sot7 9

MRRLRMRRSVM

S (W) SOTS (W) NO-SOT

Figure 11 0e performance impact of sot on CiteSeer

10 Computational Intelligence and Neuroscience

that although the larger d0 is better the larger d0 willsignificantly increase the training time 0erefore wechoose d0 300

5 Conclusion and Future Works

We propose a convolutional neural network model withorthogonal regularization to solve the context-aware citationrecommendation task In our model orthogonal regulari-zation is achieved by using SVD to factorize the weight of theFC layer which can essentially make each vector in thefeature map more independent 0e orthogonal regulari-zation also enhances the feature extraction ability of CNN0e experimental results show that SVD-CNN outperformsthe other compared methods on CiteSeer Our model onlytakes the abstract as the content of the cited paper In thefuture we will explore the performance of our model byusing the full text of papers

Data Availability

Previously reported CiteSeer data were used to support thisstudy and are available at [httpspsuappboxcomvrefseer] 0ese prior datasets are cited at relevant placeswithin the text as references [4]

Conflicts of Interest

0e authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

0is work was partially supported by the National NaturalScience Foundation of China (project no 61373046) andthe National Key Research and Development Programs ofChina (project nos 2018AAA0101100 and2019YFB2102500)

References

[1] M A Angrosh S Cranefield and N Stanger ldquoConditionalrandom field based sentence context identification enhancingcitation services for the research communityrdquo in Proceedingsof the First Australasian Web Conference Adelaide AustraliaJanuary 2013

[2] Q He J Pei D Kifer et al ldquoContext-aware citation rec-ommendationrdquo in Proceedings of the International Conferenceon World Wide Web Raleigh NC USA April 2010

[3] Q He D Kifer J Pei et al ldquoCitation recommendationwithout author supervisionrdquo in Proceedings of the FourthACM international Conference on Web Search and DataMining Hong Kong China February 2011

[4] W Huang ldquoA neural probabilistic model for context basedcitation recommendationrdquo in Proceedings of the AAAIConference on Artificial Intelligence Austin TX USA January2015

[5] J Tan X Wan and J Xiao ldquoA neural network approach toquote recommendation in writingsrdquo in Proceedings of theACM International on Conference on Information andKnowledge Management Indianapolis IN USA October2016

[6] X Ren J Liu X Yu et al ldquoCluscite effective citation rec-ommendation by information network-based clusteringrdquo inProceedings of the 20th ACM SIGKDD International Con-ference on Knowledge Discovery and Data Mining New YorkNY USA August 2014

[7] T Ebesu and Y Fang ldquoNeural citation network for context-aware citation recommendationrdquo in Proceedings of the 40thInternational ACM SIGIR Conference on Research and De-velopment in Information Retrieval pp 1093ndash1096 ShinjukuJapan August 2017

[8] D M Blei A Y Ng and M I Jordan ldquoLatentdirichlet allocationrdquo Journal of Machine Learning Researchvol 3 pp 993ndash1022 2003

[9] S Bradshaw ldquoReference directed indexing redeeming rele-vance for subject search in citation indexesrdquo Research andAdvanced Technology for Digital Libraries vol 2769pp 499ndash510 2003

[10] N Meuschke B Gipp and M Lipinsk ldquoCITREC an eval-uation framework for citation-based similarity measures

06

055

05

045

Valu

e of e

valu

atio

n sta

ndar

d

04

035

03

025100 200 300 400 500

03302 03512 03687 03701 0372203003 03225 03352 03398 0340903101 03312 03448 03486 0349905456 05689 05801

Vector dimension of input layer05842 05867

Various d_0 for the effects of the model

MRRMAPnDCGRecall10

Figure 12 0e performance impact of d0 on CiteSeer

Computational Intelligence and Neuroscience 11

based on TREC genomics and PubMed centralrdquo 2015 httphdlhandlenet214273680

[11] A Ritchie S Robertson and S Teufel ldquoComparing CitationContexts for information Retrievalrdquo in Proceedings of the 17thACM Conference on Information and Knowledge Manage-ment pp 213ndash222 Napa Valley CA USA October 2008

[12] C F Van Loan e Block Jacobi Method for Computing theSingular Value Decomposition Department of ComputerScience Cornell University Ithaca NY USA 1985

[13] C Bhagavatula S Feldman R Power et al ldquoContent-basedcitation recommendationrdquo 2018 httpsarxivorgpdf18020830201v1pdf

[14] H Jia and E Saule ldquoLocal is good a fast citation recom-mendation approachrdquo Lecture Notes in Computer ScienceVol 10772 Springer Berlin Germany 2018

[15] Y Sun W Ni and R Men ldquoA personalized paper recom-mendation approach based on web paper mining and re-viewerrsquos interest modellingrdquo in Proceedings of theInternational Conference on Research Challenges in ComputerScience Shanghai China December 2009

[16] B Shaparenko and T Joachims ldquoInformation genealogyUncovering the flow of ideas in non-hyperlinked documentdatabasesrdquo in Proceedings of the ACM SIGKDD internationalConference on Knowledge Discovery and Data Mining SanJose CA USA August 2007

[17] T Strohman W B Croft and D Jensen ldquoRecommendingcitations for academic papersrdquo in Proceedings of the Annualinternational ACM SIGIR Conference on Research and De-velopment in information Retrieval Amsterdam NetherlandsJuly 2007

[18] A Livne V Gokuladas J Teevan et al ldquoCiteSight supportingcontextual citation recommendation using differentialsearchrdquo in Proceedings of the International ACM SIGIRConference on Research amp Development in informationRetrieval Gold Coast Australia July 2014

[19] Y Lu J He D Shan et al ldquoRecommending citations withtranslation modelrdquo in Proceedings of the ACM internationalConference on Information and Knowledge ManagementGlasgow UK October 2011

[20] W Huang P Mitra S Kataria et al ldquoRecommending cita-tions translating papers into referencesrdquo in Proceedings of theACM international Conference on Information and KnowledgeManagement Shanghai China November 2014

[21] X Tang X Wan X Zhang et al ldquoCross-language context-aware citation recommendation in scientific articlesrdquo inProceedings of the International ACM SIGIR Conference onResearch amp Development in information Retrieval Gold CoastUK July 2014

[22] A Brock T Lim J M Ritchie et al ldquoNeural photo edi-tingwith introspective adversarial networksrdquo in InternationalConference on Learning Representations 2017

[23] I J Goodfellow J Pouget-Abadie MMirza et al ldquoGenerativeadversarial netsrdquo in Proceedings of the International Con-ference on Neural Information Processing Systems MontrealCanada December 2014

[24] A Brock J Donahue K Simonyan et al ldquoLarge scale GANtraining for high fidelity natural image synthesisrdquo 2018httpsarxivorgabs180911096

[25] Y Sun L Zheng W Deng et al ldquoSVDNet for pedestrianretrievalrdquo in Proceedings of the 2017 IEEE InternationalConference on Computer Vision (ICCV) pp 3820ndash3828Venice Italy October 2017

[26] Q Zheng M Yang J Yang Q Zhang and X ZhangldquoImprovement of generalization ability of deep CNN via

implicit regularization in two-stage training processrdquo IEEEAccess vol 6 no 1109 pp 15844ndash15869 2018

[27] Y Wang D Gong Z Zheng et al ldquoOrthogonal deep featuresdecomposition for age-invariant face recognitionrdquo in Pro-ceedings of the European Conference on Computer Vision(ECCV) Munich Germany September 2018

[28] Y Chen X Jin J Feng et al ldquoTraining group orthogonalneural networks with privileged informationrdquo in Proceedingsof the Twenty-Sixth International Joint Conference on ArtificialIntelligence Melbourne Australia August 2017

[29] T Mikolov I Sutskever K Chen et al ldquoDistributedrepresentations of words and phrases and their composi-tionalityrdquo in Proceedings of the 26th International Con-ference on Neural Information Processing Systems LakeTahoe NV USA December 2013

[30] A Rajaraman and J D Ullman ldquoData miningrdquo Mining ofMassive Datasets vol 3 no 2 pp 1ndash17 2014

[31] J Duchi E Hazan and Y Singer ldquoAdaptive subgradientmethods for online learning and stochastic optimizationrdquoJournal of Machine Learning Research vol 12 no 7pp 2121ndash2159 2011

[32] T Miyato A M Dai and I Goodfellow ldquoAdversarial trainingmethods for semi-supervised text classificationrdquo in Pro-ceedings of the International Conference on LearningRepresentations San Juan Puerto Rico May 2016

[33] I J Goodfellow J Shlens and C Szegedy ldquoExplaining andharnessing adversarial examplesrdquo in Proceedings of the In-ternational Conference on Learning Representations BanffCanada April 2014

[34] K Chandrasekaran S Gauch P Lakkaraju et al ldquoConcept-based document recommendations for CiteSeer authorsrdquo inProceedings of the International Conference on AdaptiveHypermedia and Adaptive Web-Based Systems HannoverGermany August 2008

[35] E Voorhees ldquo0e trec-8 question answering track reportrdquo inProceedings of the TRECrsquo00 pp 77ndash82 Gaithersburg MDUSA 2000

[36] S Kataria P Mitra and S Bhatia ldquoUtilizing context ingenerative bayesian models for linked corpusrdquo in Proceedingsof the Twenty-Fourth AAAI Conference on ArtificialIntelligence Atlanta GA USA July 2010

[37] J Tang and J Zhang ldquoA discriminative approach to topic-based citation recommendationrdquo in Proceedings of the Pacific-Asia Conference Hyderabad India July 2009

12 Computational Intelligence and Neuroscience

Page 6: SVD-CNN:AConvolutionalNeuralNetworkModelwith ...

radv minusising

g2 whereg nablaxlogp(y|x 1113954θ) (9)

0is perturbation can be easily computed by usingbackpropagation in neural networks

332 Layerwise Training In our training steps we defineconv-pooling block bt (tge 2) which consists of a convo-lution layer and a pooling layer Our network model is thenassembled by the initialization block b1 that initializes usingword2vec and (n minus 1) conv-pooling blocks

First we train the conv-pooling block b2 after b1 istrained On this basis the next conv-pooling block b3 iscreated by keeping the previous block fixed We repeat thisprocedure until all (n minus 1) conv-pooling blocks are trained

Second the following semiorthogonal training proce-dure is used to train the whole network

Semiorthogonal training (SOT) it is crucial to trainSVD-CNN which consists of the following three steps

Step 1 Decompose the weight matrix by SVD ieW USVT W is the weight matrix of the linear layerU is the left-unitary matrix S is the singular valuematrix V is the right-unitary matrix After that wereplace W with US Next we take all eigenvectors ofUS(US)T as weight vectorsStep 2 0e backbone model is fine-tuned by fixing theSVD-FC layerStep 3 0e model keeps fine-tuning with the unfixedSVD-FC layer

Step 1 can generate orthogonal weights but the per-formance of prediction cannot be guaranteed 0e reason isthat over orthogonality will excessively punish synonymoussentences which is apparently inappropriate 0erefore weintroduce Steps 2 and 3 to solve the above problem

0e inputs of SVD-FC are defined as Y

Y (y1 y2 ym)T 0e outputs are defined as O

O (o1 o2 om)T 0e weight matrix is defined asW (w1 w2 wm)T 0e expected outputs are defined asA (a1 a2 am)T 0e error function is defined as

E 12

1113944

l

k1ak minus ok( 1113857

2 (10)

where ok f(1113936mj0 wkjyj) k 1 2 l 0en E with re-

spect to ok is derived and the outcome is

zE

zok

minus ak minus ok( 1113857 (11)

We utilize the gradient descent strategy to find thegradient of the error with respect to weights 0e iterativeupdate of weights is as follows

Δwkj minusηzE

zwkj

(12)

We define an error signal δok zEz netk equation (12) is

equivalent to

Δwkj minusηzE

z netk

z netkzwkj

minusηδok

z netkzwkj

(13)

According to equation (11) δok zEz netk is equivalent

to

δok minus

zE

zok

zok

z netk minus

zE

zok

fprime netk( 1113857

zE

zok

okprime minus dk minus ok( 1113857ok

prime

(14)

We use the sigmoid f(x) 1(1 + ex) as the nonlinearfunction so equation (13) is equivalent to

Δwkj minusηδokyj η dk minus ok( 1113857ok 1 minus ok( 1113857yj (15)

In Step 1 the weight matrix W is decomposed by SVDand replaced with US U (q1 q2 qm)T andS diag(λ1 λ2 λm) Since dk minus ok is given we definethat Loss dk minus ok As a result equation (15) is equivalent to

Δwkj η Loss middot ok minus sigmoid yj 1113944 qiλi + B1113872 11138732

1113876 1113877yj (16)

qi middot qj 0 ine j are in the left-unitary matrix U so themodel operation is not affected by the nonorthogonal ei-genvectors qi 0is is the reason for excessively punishingsynonymous sentences in Step 1 However orthogonalityhas a positive effect on Δwkj in Step 2

0e purpose of SVD is to maintain the orthogonality ofeach weight vector in geometric space When weight vectorsare conditioned by orthogonal regularization the relevancybetween weight vectors decreases We use the followingmethods in Step 3 to measure relevance

H WTW

w1rarrT

w1rarr

middot middot middot w1rarrT

wkrarr

⋮ ⋱ ⋮

wkrarrT

w1rarr

middot middot middot wkrarrT

wkrarr

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

h11 middot middot middot h1k

⋮ ⋱ ⋮

hk1 middot middot middot hkk

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

(17)

where W is a weight matrix that contains k weight vectorswi (i 1 k)hij (i j 1 k) is the dot product of wi

and wj Let us define S(W) as the correlation measurementof all column vectors in W

S(W) 1113936

ki1 hii

1113936ki1 1113936

kj1 hij

11138681113868111386811138681113868

11138681113868111386811138681113868 (18)

When W is an orthogonal matrix the value of S(W) is 1When ine j S(W) obtains the minimum value (1k)0erefore we can see that the value of S(W) falls into

6 Computational Intelligence and Neuroscience

[(1k) 1] As a result when S(W) is close to 1k or 0 theweight matrix will have high relevance

34 Complexity Analysis Assume that the training samplesize is |C| the average number of words in each citationcontext is |c| Cl is the number of kernels in the l-th layer andwis the size of the sliding window For one convolution layerthe training complexity is O(Clminus1 middot Cl middot w middot (s minus w + 1)) 0etraining complexity of one w-ap layer is O(C2

l middot w middot s) 0etraining complexity of one all-ap layer isO(C2

l middot (s minus w + 1))which was improved by C F Van Loan [12] computing theeigenvalue for SVD matrix decomposition with K size takesO(K) on the way of JACOBI Assume that the size of theweight matrix in the SVD-FC layer isK and the channel ofthe input matrix is Cin 0e computational cost for the SVD-FC layer is O(2K2 middot Cin + K)

4 Experiment

41 Dataset We use the CiteSeer dataset [34] to evaluatethe performance of our model 0e dataset was publishedby Huang et al [4] In this dataset citation relationshipsare extracted by a pair of citation contexts and the ab-stracts of cited papers A citation context includes thesentence where the citation placeholder appears and thesentences before and after the citation placeholderWithin each paper in the corpus the 50 words before and50 words after each citation reference are treated as thecorresponding citation context (a discussion on thenumber of words can be found in [7]) Before wordembedding we also remove stop words from the contextsTo preserve the time-sensitive pastpresentfuture tensesof verbs and the singularplural styles of named entitiesno stemming is done but all words are transferred tolower-case 0e training set contains 3989547 pairs ofreference contexts and citations and the test set contains1021685 citation relations

Following common practice in information retrieval(IR) we employ the following four evaluation metrics toevaluate recommendation results recall mean reciprocalrank (MRR) mean average precision (MAP) and normal-ized discounted cumulative gain (nDCG)

42 EvaluationMetric For each query in the test set we usethe original set of references as the ground truth Rg Assumethat the set of recommended citations is Rr and the correctrecommendations are Rg capRr Recall is defined as

recall Rg capRr

11138681113868111386811138681113868

11138681113868111386811138681113868

Rg

(19)

In our experiments the number of recommended ci-tations ranges from 1 to 10 Recall evaluation does not revealthe order of recommended references To address thisproblem we select the following two additional metrics

For a query q let rankq be the rank of the first correctrecommendation within the list MRR [35] is defined as

MRR 1

|Q|1113944qisinQ

1rankq

(20)

where Q is the testing set MRR reveals the average rankingof the first correct recommendation

For each citation placeholder we search the papers thatmay be referenced at this citation placeholder Each retrievalmodel returns a ranked list of papers Since there may be oneor more references for one citation context we use meanaverage precision (MAP) as the evaluation metric

MAP d1 dN( 1113857 1113936i R di( 1113857i( 11138571113936jleiR dj1113872 1113873

1113936iR di( 1113857 (21)

where R(di) is a binary function indicating whether doc-ument di is relevant or not For our problem the papers citedat the citation placeholder are considered relevantdocuments

We use normalized discounted cumulative gain (NDCG)to measure the ranked recommendation list 0e NDCGvalue of a ranking list at position i is calculated as

NDCG d1 dN( 1113857 1113944i

2rel di( ) minus 1lni+1 (22)

where rel (di) is the 4-scale relevance of document di in theranked list We use the average cocited probability [2] oflangdi dlowastrang to weigh the citation relevance score of di to dlowast(anoriginal citation of the query) We report the average NDCGscore over all testing documents

43 BaselineComparison We choose the following methodsfor comparison

Cite-PLSA-LDA (CP-LDA) [36] we use the originalimplementation provided by the author 0e number oftopics is set to 60

(i) Restricted Boltzmann Machine (RBM-CS) [37] Wetrain two layers of RBM-CS according to the sug-gestion of the author We set the hidden layer size to600

(ii) Word2vec Model (W2V) [29] We use the word2vecmodel to learn words and document representa-tions 0e cited document is treated as a ldquowordrdquo (adocument uses a unique marker when it is cited bydifferent papers) 0e dimensions of the word anddocument vectors are set to n 300

(iii) Neural Probabilistic Model (NPM) [4] We followthe original implementation 0e dimensions of theword and document representation vector are set ton 600 For negative sampling we set the numberof negative samples k 10 where k is the number ofnoise words in the citation context For noisecontrast estimation we set the number of noisesamples k 1000

(iv) Neural Citation Network (NCN) [7] In NCN thegradient clipping is 5 the dropout probability is 02and the recurrent layers are 2 0e region sizes for

Computational Intelligence and Neuroscience 7

the encoder are set to 4 4 and 5 and the region sizesfor the author network are set to 1 and 2

Figures 8 and 9 show the performance of eachmethod onthe CiteSeer dataset It is obvious that the SVD-FC modelleads the performance in most cases More detailed analysesare given as follows

First we perform a comparison among CP-LDA RBMW2V and SVD-CNN Our SVD-CNN completely andsignificantly exceeds other models in all metrics 0e successof ourmodel is ascribed to the content and correlation of ournetwork Due to the lack of citation context information wefind that W2V is obviously worse than other methods interms of all metrics CP-LDA works much better than W2Vwhich indicates that link information is very important forfinding relevant papers RBM-CS shows a clear performancegain over W2V because RBM-CS automatically discoverstopical aspects of each paper based on citation contextHowever the vector representations of citation context inRBM-CS are extracted by traditional word vector repre-sentations which fully neglect semantic relations betweenthe citation document and citation context and thus may belimited by vocabulary

Second we compare the performance among NPMNCN and SVD-CNN It is not surprising that NPM andNCN achieve worse performance than SVD-CNN since theirdistributed representation of words and documents reliessolely on deep learning without restraint NPM recommendscitations based on trained distributed representations NCNfurther enhances the performance by considering authorinformation and using a more sophisticated neural networkarchitecture However the CNN in NCN does not haveorthogonal constraints which makes it difficult to capturedifferent types of citing activities In addition NCN onlyutilizes the title of the cited paper for a decoder which isapparently not sufficient for learning good embedding

44 e Influence on the Link Prediction of Reference PatternInteractionalFeatures According to the chapter positions ofcitation context in the article we divide the training set intothree parts the introduction part contains 1307885 pairs ofreference contexts and citations the related word partcontains 1599897 pairs of citations and the main partcontains 1024783 pairs Furthermore these datasets formthree mixed datasets In this part of the experiment we usethe CNN model without SVD as the baseline 0ese datasetsare tested in a ratio of 3 1 In Tables 1 and 2 we show theresults on the abovementioned datasets

From the results we obtain the following observationsFirst both CNN and SVD-CNN outperform unmixed

datasets over mixed datasets across the different evaluationmetrics which shows that the diversity of reference patternsincreases the difficulty of citation recommendation tasks

Second in Tables 1 and 2 we observe that our model isparticularly good at resolving the difficulties in mixeddatasets which come from the diversity of referencepatterns

To better explore why mixed datasets are more complexthan unmixed datasets in Figure 10 we show the change in

S(W) during the training process of SVD-CNN amongvarious datasets

As shown in Figure 10 the increase in S(W) on themixed datasets indicates that SVD-CNN is good at decor-relation We can also see in Tables 1 and 2 that the CNNmodel has pretty performance on unmixed datasets whileachieving poor performance on mixed datasets HoweverSVD-CNN achieves almost the same performance on thetwo types of datasets 0is proves that the correlation fromvarious reference patterns can significantly affect the linkprediction

0e reason why the change in S(W) is not large on theunmixed datasets is that reference patterns of unmixeddatasets have similar features which belong to the samecategory As a result the orthogonality of the weight matrixis hard to improve on unmixed datasets However a citationrecommendation algorithm has pretty performance on theunmixed datasets because there are low complexities

Although mixed datasets are more complicated thanunmixed datasets SVD-CNN still performs well in mixeddatasets 0is indicates that SVD-CNN reduces the negativeimpact of the correlation of reference patterns and ourapproach is more suitable for complex scenarios

45 Comparison with Other Types of Decorrelation In ad-dition to SVD there are still some other methods fordecorrelating the feature matrix However these methodscannot maintain the discriminating ability of the CNNmodel To illustrate this we compare SVD with severalvarieties as follows

(1) Using the originally learned W

(2) Replacing W with US

(3) Replacing W with U

(4) Replacing W with UVT

(5) Replacing Wwith Q D where D is the diagonalmatrix extracted from the upper triangle matrix inQ-R decomposition

(6) Replacing W with WPCA where WPCA is the diagonalmatrix extracted from the weight matrix W after theprocessing of dimension reduction by PCA

After convergence of training different orthogonalmatrices are used to replace the weight matrix W We defineT-cost as the time cost of replacing the weight which isequivalent to the proportion of the added time to the originaltime As shown in Table 3 other types of decorrelationdegrade the performance in addition to W⟶ US andW⟶WPCA However the time cost of W⟶WPCA ismore than that of W⟶ US

46 Ablation Study In our method there are two essentialparameters a term sot which means the number of SOTiterations and a biased parameter d0 In this section weconduct an ablation study of these parameters

We first evaluate the effectiveness of sot by empiri-cally fixing d0 300 Since sot defines the loop time of

8 Computational Intelligence and Neuroscience

orthogonal constraint training it should be set as anonnegative value Figure 11 illustrates the MRR with sotfrom 0 to 10 on the CiteSeer dataset We can see that the

performance improves as the value of sot increasesWhen sot 0 the model has no decorrelation andachieves the worst performance In this situation the

Table 1 MRR metric on various datasets

Introduction Related Main Introduction + related Introduction +main Related +mainCNN 03312 03294 03478 02773 02815 02978SVD-CNN 03995 04078 03989 03878 03889 03845

060055050045040035030025Re

call

020015010005000

20 40 60Number of recommended citations

80 100

W2VsNPMs

RBMsCP_LDAs

SVD_CNNsNCNs

Figure 8 Comparison of recall with different methods on CiteSeer

MRR MAP and nDCG scores for top 10 recommendations04

035

03

025

02

015

01

005

0

0091600997

MRR MAP nCDG

00662

01843

02667

03687

00912009982

00663

01835

02418

03352

01288 0135601476

0256602592

03448

CP-LDARBM

W2VNPM

NCNSVD-CNN

Figure 9 Comparison of MRR MAP and nDCG with different methods on CiteSeer

Table 2 MAP metric on various datasets

Introduction Related Main Introduction + related Introduction +main Related +mainCNN 03001 02909 03107 02572 02601 02637SVD-CNN 03701 03655 03693 03498 03511 03539

Computational Intelligence and Neuroscience 9

weight matrix in the FC layer is highly correlated andS(W) has the lowest value 0e recommendation per-formance then increases while adding sot which indi-cates that reducing the correlative degree of the weightmatrix in the FC layer is critical for improving perfor-mance When sot 10 our model achieves the bestperformance

In our model d0 is the dimension of citation contextand cited document representations Figure 12 shows howthe performance of SVD-CNN varies with d0 on the samesot When d0 is small the information content of thecitation context is very small and produces worse per-formance 0e recommendation performance increases toa maximum point until d0 reaches 300 It should be noted

05

045

04

035

03

S (W

)

025

02

015

01

005

00 1 2 3 4 5

Sot6 7 8 9 10 11

IntroductionRelatedMain

Introduction + relatedIntroduction + mainRelated + main

Figure 10 0e change in S(W) during training on unmixed datasets and mixed datasets

Table 3 0e comparison of related methods in Step 1

W W⟶ US W⟶ U W⟶ UVT W⟶ Q D W⟶WPCA

Rank-1 636 636 617 617 616 636mAP 390 390 371 371 373 390T-cost 0 3627 3627 3627 3533 5765

055050045040035030

MRR

025020015010005000

05

S (W

)

10

04

03

02

01

001 3 5

Sot7 9

MRRLRMRRSVM

S (W) SOTS (W) NO-SOT

Figure 11 0e performance impact of sot on CiteSeer

10 Computational Intelligence and Neuroscience

that although the larger d0 is better the larger d0 willsignificantly increase the training time 0erefore wechoose d0 300

5 Conclusion and Future Works

We propose a convolutional neural network model withorthogonal regularization to solve the context-aware citationrecommendation task In our model orthogonal regulari-zation is achieved by using SVD to factorize the weight of theFC layer which can essentially make each vector in thefeature map more independent 0e orthogonal regulari-zation also enhances the feature extraction ability of CNN0e experimental results show that SVD-CNN outperformsthe other compared methods on CiteSeer Our model onlytakes the abstract as the content of the cited paper In thefuture we will explore the performance of our model byusing the full text of papers

Data Availability

Previously reported CiteSeer data were used to support thisstudy and are available at [httpspsuappboxcomvrefseer] 0ese prior datasets are cited at relevant placeswithin the text as references [4]

Conflicts of Interest

0e authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

0is work was partially supported by the National NaturalScience Foundation of China (project no 61373046) andthe National Key Research and Development Programs ofChina (project nos 2018AAA0101100 and2019YFB2102500)

References

[1] M A Angrosh S Cranefield and N Stanger ldquoConditionalrandom field based sentence context identification enhancingcitation services for the research communityrdquo in Proceedingsof the First Australasian Web Conference Adelaide AustraliaJanuary 2013

[2] Q He J Pei D Kifer et al ldquoContext-aware citation rec-ommendationrdquo in Proceedings of the International Conferenceon World Wide Web Raleigh NC USA April 2010

[3] Q He D Kifer J Pei et al ldquoCitation recommendationwithout author supervisionrdquo in Proceedings of the FourthACM international Conference on Web Search and DataMining Hong Kong China February 2011

[4] W Huang ldquoA neural probabilistic model for context basedcitation recommendationrdquo in Proceedings of the AAAIConference on Artificial Intelligence Austin TX USA January2015

[5] J Tan X Wan and J Xiao ldquoA neural network approach toquote recommendation in writingsrdquo in Proceedings of theACM International on Conference on Information andKnowledge Management Indianapolis IN USA October2016

[6] X Ren J Liu X Yu et al ldquoCluscite effective citation rec-ommendation by information network-based clusteringrdquo inProceedings of the 20th ACM SIGKDD International Con-ference on Knowledge Discovery and Data Mining New YorkNY USA August 2014

[7] T Ebesu and Y Fang ldquoNeural citation network for context-aware citation recommendationrdquo in Proceedings of the 40thInternational ACM SIGIR Conference on Research and De-velopment in Information Retrieval pp 1093ndash1096 ShinjukuJapan August 2017

[8] D M Blei A Y Ng and M I Jordan ldquoLatentdirichlet allocationrdquo Journal of Machine Learning Researchvol 3 pp 993ndash1022 2003

[9] S Bradshaw ldquoReference directed indexing redeeming rele-vance for subject search in citation indexesrdquo Research andAdvanced Technology for Digital Libraries vol 2769pp 499ndash510 2003

[10] N Meuschke B Gipp and M Lipinsk ldquoCITREC an eval-uation framework for citation-based similarity measures

06

055

05

045

Valu

e of e

valu

atio

n sta

ndar

d

04

035

03

025100 200 300 400 500

03302 03512 03687 03701 0372203003 03225 03352 03398 0340903101 03312 03448 03486 0349905456 05689 05801

Vector dimension of input layer05842 05867

Various d_0 for the effects of the model

MRRMAPnDCGRecall10

Figure 12 0e performance impact of d0 on CiteSeer

Computational Intelligence and Neuroscience 11

based on TREC genomics and PubMed centralrdquo 2015 httphdlhandlenet214273680

[11] A Ritchie S Robertson and S Teufel ldquoComparing CitationContexts for information Retrievalrdquo in Proceedings of the 17thACM Conference on Information and Knowledge Manage-ment pp 213ndash222 Napa Valley CA USA October 2008

[12] C F Van Loan e Block Jacobi Method for Computing theSingular Value Decomposition Department of ComputerScience Cornell University Ithaca NY USA 1985

[13] C Bhagavatula S Feldman R Power et al ldquoContent-basedcitation recommendationrdquo 2018 httpsarxivorgpdf18020830201v1pdf

[14] H Jia and E Saule ldquoLocal is good a fast citation recom-mendation approachrdquo Lecture Notes in Computer ScienceVol 10772 Springer Berlin Germany 2018

[15] Y Sun W Ni and R Men ldquoA personalized paper recom-mendation approach based on web paper mining and re-viewerrsquos interest modellingrdquo in Proceedings of theInternational Conference on Research Challenges in ComputerScience Shanghai China December 2009

[16] B Shaparenko and T Joachims ldquoInformation genealogyUncovering the flow of ideas in non-hyperlinked documentdatabasesrdquo in Proceedings of the ACM SIGKDD internationalConference on Knowledge Discovery and Data Mining SanJose CA USA August 2007

[17] T Strohman W B Croft and D Jensen ldquoRecommendingcitations for academic papersrdquo in Proceedings of the Annualinternational ACM SIGIR Conference on Research and De-velopment in information Retrieval Amsterdam NetherlandsJuly 2007

[18] A Livne V Gokuladas J Teevan et al ldquoCiteSight supportingcontextual citation recommendation using differentialsearchrdquo in Proceedings of the International ACM SIGIRConference on Research amp Development in informationRetrieval Gold Coast Australia July 2014

[19] Y Lu J He D Shan et al ldquoRecommending citations withtranslation modelrdquo in Proceedings of the ACM internationalConference on Information and Knowledge ManagementGlasgow UK October 2011

[20] W Huang P Mitra S Kataria et al ldquoRecommending cita-tions translating papers into referencesrdquo in Proceedings of theACM international Conference on Information and KnowledgeManagement Shanghai China November 2014

[21] X Tang X Wan X Zhang et al ldquoCross-language context-aware citation recommendation in scientific articlesrdquo inProceedings of the International ACM SIGIR Conference onResearch amp Development in information Retrieval Gold CoastUK July 2014

[22] A Brock T Lim J M Ritchie et al ldquoNeural photo edi-tingwith introspective adversarial networksrdquo in InternationalConference on Learning Representations 2017

[23] I J Goodfellow J Pouget-Abadie MMirza et al ldquoGenerativeadversarial netsrdquo in Proceedings of the International Con-ference on Neural Information Processing Systems MontrealCanada December 2014

[24] A Brock J Donahue K Simonyan et al ldquoLarge scale GANtraining for high fidelity natural image synthesisrdquo 2018httpsarxivorgabs180911096

[25] Y Sun L Zheng W Deng et al ldquoSVDNet for pedestrianretrievalrdquo in Proceedings of the 2017 IEEE InternationalConference on Computer Vision (ICCV) pp 3820ndash3828Venice Italy October 2017

[26] Q Zheng M Yang J Yang Q Zhang and X ZhangldquoImprovement of generalization ability of deep CNN via

implicit regularization in two-stage training processrdquo IEEEAccess vol 6 no 1109 pp 15844ndash15869 2018

[27] Y Wang D Gong Z Zheng et al ldquoOrthogonal deep featuresdecomposition for age-invariant face recognitionrdquo in Pro-ceedings of the European Conference on Computer Vision(ECCV) Munich Germany September 2018

[28] Y Chen X Jin J Feng et al ldquoTraining group orthogonalneural networks with privileged informationrdquo in Proceedingsof the Twenty-Sixth International Joint Conference on ArtificialIntelligence Melbourne Australia August 2017

[29] T Mikolov I Sutskever K Chen et al ldquoDistributedrepresentations of words and phrases and their composi-tionalityrdquo in Proceedings of the 26th International Con-ference on Neural Information Processing Systems LakeTahoe NV USA December 2013

[30] A Rajaraman and J D Ullman ldquoData miningrdquo Mining ofMassive Datasets vol 3 no 2 pp 1ndash17 2014

[31] J Duchi E Hazan and Y Singer ldquoAdaptive subgradientmethods for online learning and stochastic optimizationrdquoJournal of Machine Learning Research vol 12 no 7pp 2121ndash2159 2011

[32] T Miyato A M Dai and I Goodfellow ldquoAdversarial trainingmethods for semi-supervised text classificationrdquo in Pro-ceedings of the International Conference on LearningRepresentations San Juan Puerto Rico May 2016

[33] I J Goodfellow J Shlens and C Szegedy ldquoExplaining andharnessing adversarial examplesrdquo in Proceedings of the In-ternational Conference on Learning Representations BanffCanada April 2014

[34] K Chandrasekaran S Gauch P Lakkaraju et al ldquoConcept-based document recommendations for CiteSeer authorsrdquo inProceedings of the International Conference on AdaptiveHypermedia and Adaptive Web-Based Systems HannoverGermany August 2008

[35] E Voorhees ldquo0e trec-8 question answering track reportrdquo inProceedings of the TRECrsquo00 pp 77ndash82 Gaithersburg MDUSA 2000

[36] S Kataria P Mitra and S Bhatia ldquoUtilizing context ingenerative bayesian models for linked corpusrdquo in Proceedingsof the Twenty-Fourth AAAI Conference on ArtificialIntelligence Atlanta GA USA July 2010

[37] J Tang and J Zhang ldquoA discriminative approach to topic-based citation recommendationrdquo in Proceedings of the Pacific-Asia Conference Hyderabad India July 2009

12 Computational Intelligence and Neuroscience

Page 7: SVD-CNN:AConvolutionalNeuralNetworkModelwith ...

[(1k) 1] As a result when S(W) is close to 1k or 0 theweight matrix will have high relevance

34 Complexity Analysis Assume that the training samplesize is |C| the average number of words in each citationcontext is |c| Cl is the number of kernels in the l-th layer andwis the size of the sliding window For one convolution layerthe training complexity is O(Clminus1 middot Cl middot w middot (s minus w + 1)) 0etraining complexity of one w-ap layer is O(C2

l middot w middot s) 0etraining complexity of one all-ap layer isO(C2

l middot (s minus w + 1))which was improved by C F Van Loan [12] computing theeigenvalue for SVD matrix decomposition with K size takesO(K) on the way of JACOBI Assume that the size of theweight matrix in the SVD-FC layer isK and the channel ofthe input matrix is Cin 0e computational cost for the SVD-FC layer is O(2K2 middot Cin + K)

4 Experiment

41 Dataset We use the CiteSeer dataset [34] to evaluatethe performance of our model 0e dataset was publishedby Huang et al [4] In this dataset citation relationshipsare extracted by a pair of citation contexts and the ab-stracts of cited papers A citation context includes thesentence where the citation placeholder appears and thesentences before and after the citation placeholderWithin each paper in the corpus the 50 words before and50 words after each citation reference are treated as thecorresponding citation context (a discussion on thenumber of words can be found in [7]) Before wordembedding we also remove stop words from the contextsTo preserve the time-sensitive pastpresentfuture tensesof verbs and the singularplural styles of named entitiesno stemming is done but all words are transferred tolower-case 0e training set contains 3989547 pairs ofreference contexts and citations and the test set contains1021685 citation relations

Following common practice in information retrieval(IR) we employ the following four evaluation metrics toevaluate recommendation results recall mean reciprocalrank (MRR) mean average precision (MAP) and normal-ized discounted cumulative gain (nDCG)

42 EvaluationMetric For each query in the test set we usethe original set of references as the ground truth Rg Assumethat the set of recommended citations is Rr and the correctrecommendations are Rg capRr Recall is defined as

recall Rg capRr

11138681113868111386811138681113868

11138681113868111386811138681113868

Rg

(19)

In our experiments the number of recommended ci-tations ranges from 1 to 10 Recall evaluation does not revealthe order of recommended references To address thisproblem we select the following two additional metrics

For a query q let rankq be the rank of the first correctrecommendation within the list MRR [35] is defined as

MRR 1

|Q|1113944qisinQ

1rankq

(20)

where Q is the testing set MRR reveals the average rankingof the first correct recommendation

For each citation placeholder we search the papers thatmay be referenced at this citation placeholder Each retrievalmodel returns a ranked list of papers Since there may be oneor more references for one citation context we use meanaverage precision (MAP) as the evaluation metric

MAP d1 dN( 1113857 1113936i R di( 1113857i( 11138571113936jleiR dj1113872 1113873

1113936iR di( 1113857 (21)

where R(di) is a binary function indicating whether doc-ument di is relevant or not For our problem the papers citedat the citation placeholder are considered relevantdocuments

We use normalized discounted cumulative gain (NDCG)to measure the ranked recommendation list 0e NDCGvalue of a ranking list at position i is calculated as

NDCG d1 dN( 1113857 1113944i

2rel di( ) minus 1lni+1 (22)

where rel (di) is the 4-scale relevance of document di in theranked list We use the average cocited probability [2] oflangdi dlowastrang to weigh the citation relevance score of di to dlowast(anoriginal citation of the query) We report the average NDCGscore over all testing documents

43 BaselineComparison We choose the following methodsfor comparison

Cite-PLSA-LDA (CP-LDA) [36] we use the originalimplementation provided by the author 0e number oftopics is set to 60

(i) Restricted Boltzmann Machine (RBM-CS) [37] Wetrain two layers of RBM-CS according to the sug-gestion of the author We set the hidden layer size to600

(ii) Word2vec Model (W2V) [29] We use the word2vecmodel to learn words and document representa-tions 0e cited document is treated as a ldquowordrdquo (adocument uses a unique marker when it is cited bydifferent papers) 0e dimensions of the word anddocument vectors are set to n 300

(iii) Neural Probabilistic Model (NPM) [4] We followthe original implementation 0e dimensions of theword and document representation vector are set ton 600 For negative sampling we set the numberof negative samples k 10 where k is the number ofnoise words in the citation context For noisecontrast estimation we set the number of noisesamples k 1000

(iv) Neural Citation Network (NCN) [7] In NCN thegradient clipping is 5 the dropout probability is 02and the recurrent layers are 2 0e region sizes for

Computational Intelligence and Neuroscience 7

the encoder are set to 4 4 and 5 and the region sizesfor the author network are set to 1 and 2

Figures 8 and 9 show the performance of eachmethod onthe CiteSeer dataset It is obvious that the SVD-FC modelleads the performance in most cases More detailed analysesare given as follows

First we perform a comparison among CP-LDA RBMW2V and SVD-CNN Our SVD-CNN completely andsignificantly exceeds other models in all metrics 0e successof ourmodel is ascribed to the content and correlation of ournetwork Due to the lack of citation context information wefind that W2V is obviously worse than other methods interms of all metrics CP-LDA works much better than W2Vwhich indicates that link information is very important forfinding relevant papers RBM-CS shows a clear performancegain over W2V because RBM-CS automatically discoverstopical aspects of each paper based on citation contextHowever the vector representations of citation context inRBM-CS are extracted by traditional word vector repre-sentations which fully neglect semantic relations betweenthe citation document and citation context and thus may belimited by vocabulary

Second we compare the performance among NPMNCN and SVD-CNN It is not surprising that NPM andNCN achieve worse performance than SVD-CNN since theirdistributed representation of words and documents reliessolely on deep learning without restraint NPM recommendscitations based on trained distributed representations NCNfurther enhances the performance by considering authorinformation and using a more sophisticated neural networkarchitecture However the CNN in NCN does not haveorthogonal constraints which makes it difficult to capturedifferent types of citing activities In addition NCN onlyutilizes the title of the cited paper for a decoder which isapparently not sufficient for learning good embedding

44 e Influence on the Link Prediction of Reference PatternInteractionalFeatures According to the chapter positions ofcitation context in the article we divide the training set intothree parts the introduction part contains 1307885 pairs ofreference contexts and citations the related word partcontains 1599897 pairs of citations and the main partcontains 1024783 pairs Furthermore these datasets formthree mixed datasets In this part of the experiment we usethe CNN model without SVD as the baseline 0ese datasetsare tested in a ratio of 3 1 In Tables 1 and 2 we show theresults on the abovementioned datasets

From the results we obtain the following observationsFirst both CNN and SVD-CNN outperform unmixed

datasets over mixed datasets across the different evaluationmetrics which shows that the diversity of reference patternsincreases the difficulty of citation recommendation tasks

Second in Tables 1 and 2 we observe that our model isparticularly good at resolving the difficulties in mixeddatasets which come from the diversity of referencepatterns

To better explore why mixed datasets are more complexthan unmixed datasets in Figure 10 we show the change in

S(W) during the training process of SVD-CNN amongvarious datasets

As shown in Figure 10 the increase in S(W) on themixed datasets indicates that SVD-CNN is good at decor-relation We can also see in Tables 1 and 2 that the CNNmodel has pretty performance on unmixed datasets whileachieving poor performance on mixed datasets HoweverSVD-CNN achieves almost the same performance on thetwo types of datasets 0is proves that the correlation fromvarious reference patterns can significantly affect the linkprediction

0e reason why the change in S(W) is not large on theunmixed datasets is that reference patterns of unmixeddatasets have similar features which belong to the samecategory As a result the orthogonality of the weight matrixis hard to improve on unmixed datasets However a citationrecommendation algorithm has pretty performance on theunmixed datasets because there are low complexities

Although mixed datasets are more complicated thanunmixed datasets SVD-CNN still performs well in mixeddatasets 0is indicates that SVD-CNN reduces the negativeimpact of the correlation of reference patterns and ourapproach is more suitable for complex scenarios

45 Comparison with Other Types of Decorrelation In ad-dition to SVD there are still some other methods fordecorrelating the feature matrix However these methodscannot maintain the discriminating ability of the CNNmodel To illustrate this we compare SVD with severalvarieties as follows

(1) Using the originally learned W

(2) Replacing W with US

(3) Replacing W with U

(4) Replacing W with UVT

(5) Replacing Wwith Q D where D is the diagonalmatrix extracted from the upper triangle matrix inQ-R decomposition

(6) Replacing W with WPCA where WPCA is the diagonalmatrix extracted from the weight matrix W after theprocessing of dimension reduction by PCA

After convergence of training different orthogonalmatrices are used to replace the weight matrix W We defineT-cost as the time cost of replacing the weight which isequivalent to the proportion of the added time to the originaltime As shown in Table 3 other types of decorrelationdegrade the performance in addition to W⟶ US andW⟶WPCA However the time cost of W⟶WPCA ismore than that of W⟶ US

46 Ablation Study In our method there are two essentialparameters a term sot which means the number of SOTiterations and a biased parameter d0 In this section weconduct an ablation study of these parameters

We first evaluate the effectiveness of sot by empiri-cally fixing d0 300 Since sot defines the loop time of

8 Computational Intelligence and Neuroscience

orthogonal constraint training it should be set as anonnegative value Figure 11 illustrates the MRR with sotfrom 0 to 10 on the CiteSeer dataset We can see that the

performance improves as the value of sot increasesWhen sot 0 the model has no decorrelation andachieves the worst performance In this situation the

Table 1 MRR metric on various datasets

Introduction Related Main Introduction + related Introduction +main Related +mainCNN 03312 03294 03478 02773 02815 02978SVD-CNN 03995 04078 03989 03878 03889 03845

060055050045040035030025Re

call

020015010005000

20 40 60Number of recommended citations

80 100

W2VsNPMs

RBMsCP_LDAs

SVD_CNNsNCNs

Figure 8 Comparison of recall with different methods on CiteSeer

MRR MAP and nDCG scores for top 10 recommendations04

035

03

025

02

015

01

005

0

0091600997

MRR MAP nCDG

00662

01843

02667

03687

00912009982

00663

01835

02418

03352

01288 0135601476

0256602592

03448

CP-LDARBM

W2VNPM

NCNSVD-CNN

Figure 9 Comparison of MRR MAP and nDCG with different methods on CiteSeer

Table 2 MAP metric on various datasets

Introduction Related Main Introduction + related Introduction +main Related +mainCNN 03001 02909 03107 02572 02601 02637SVD-CNN 03701 03655 03693 03498 03511 03539

Computational Intelligence and Neuroscience 9

weight matrix in the FC layer is highly correlated andS(W) has the lowest value 0e recommendation per-formance then increases while adding sot which indi-cates that reducing the correlative degree of the weightmatrix in the FC layer is critical for improving perfor-mance When sot 10 our model achieves the bestperformance

In our model d0 is the dimension of citation contextand cited document representations Figure 12 shows howthe performance of SVD-CNN varies with d0 on the samesot When d0 is small the information content of thecitation context is very small and produces worse per-formance 0e recommendation performance increases toa maximum point until d0 reaches 300 It should be noted

05

045

04

035

03

S (W

)

025

02

015

01

005

00 1 2 3 4 5

Sot6 7 8 9 10 11

IntroductionRelatedMain

Introduction + relatedIntroduction + mainRelated + main

Figure 10 0e change in S(W) during training on unmixed datasets and mixed datasets

Table 3 0e comparison of related methods in Step 1

W W⟶ US W⟶ U W⟶ UVT W⟶ Q D W⟶WPCA

Rank-1 636 636 617 617 616 636mAP 390 390 371 371 373 390T-cost 0 3627 3627 3627 3533 5765

055050045040035030

MRR

025020015010005000

05

S (W

)

10

04

03

02

01

001 3 5

Sot7 9

MRRLRMRRSVM

S (W) SOTS (W) NO-SOT

Figure 11 0e performance impact of sot on CiteSeer

10 Computational Intelligence and Neuroscience

that although the larger d0 is better the larger d0 willsignificantly increase the training time 0erefore wechoose d0 300

5 Conclusion and Future Works

We propose a convolutional neural network model withorthogonal regularization to solve the context-aware citationrecommendation task In our model orthogonal regulari-zation is achieved by using SVD to factorize the weight of theFC layer which can essentially make each vector in thefeature map more independent 0e orthogonal regulari-zation also enhances the feature extraction ability of CNN0e experimental results show that SVD-CNN outperformsthe other compared methods on CiteSeer Our model onlytakes the abstract as the content of the cited paper In thefuture we will explore the performance of our model byusing the full text of papers

Data Availability

Previously reported CiteSeer data were used to support thisstudy and are available at [httpspsuappboxcomvrefseer] 0ese prior datasets are cited at relevant placeswithin the text as references [4]

Conflicts of Interest

0e authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

0is work was partially supported by the National NaturalScience Foundation of China (project no 61373046) andthe National Key Research and Development Programs ofChina (project nos 2018AAA0101100 and2019YFB2102500)

References

[1] M A Angrosh S Cranefield and N Stanger ldquoConditionalrandom field based sentence context identification enhancingcitation services for the research communityrdquo in Proceedingsof the First Australasian Web Conference Adelaide AustraliaJanuary 2013

[2] Q He J Pei D Kifer et al ldquoContext-aware citation rec-ommendationrdquo in Proceedings of the International Conferenceon World Wide Web Raleigh NC USA April 2010

[3] Q He D Kifer J Pei et al ldquoCitation recommendationwithout author supervisionrdquo in Proceedings of the FourthACM international Conference on Web Search and DataMining Hong Kong China February 2011

[4] W Huang ldquoA neural probabilistic model for context basedcitation recommendationrdquo in Proceedings of the AAAIConference on Artificial Intelligence Austin TX USA January2015

[5] J Tan X Wan and J Xiao ldquoA neural network approach toquote recommendation in writingsrdquo in Proceedings of theACM International on Conference on Information andKnowledge Management Indianapolis IN USA October2016

[6] X Ren J Liu X Yu et al ldquoCluscite effective citation rec-ommendation by information network-based clusteringrdquo inProceedings of the 20th ACM SIGKDD International Con-ference on Knowledge Discovery and Data Mining New YorkNY USA August 2014

[7] T Ebesu and Y Fang ldquoNeural citation network for context-aware citation recommendationrdquo in Proceedings of the 40thInternational ACM SIGIR Conference on Research and De-velopment in Information Retrieval pp 1093ndash1096 ShinjukuJapan August 2017

[8] D M Blei A Y Ng and M I Jordan ldquoLatentdirichlet allocationrdquo Journal of Machine Learning Researchvol 3 pp 993ndash1022 2003

[9] S Bradshaw ldquoReference directed indexing redeeming rele-vance for subject search in citation indexesrdquo Research andAdvanced Technology for Digital Libraries vol 2769pp 499ndash510 2003

[10] N Meuschke B Gipp and M Lipinsk ldquoCITREC an eval-uation framework for citation-based similarity measures

06

055

05

045

Valu

e of e

valu

atio

n sta

ndar

d

04

035

03

025100 200 300 400 500

03302 03512 03687 03701 0372203003 03225 03352 03398 0340903101 03312 03448 03486 0349905456 05689 05801

Vector dimension of input layer05842 05867

Various d_0 for the effects of the model

MRRMAPnDCGRecall10

Figure 12 0e performance impact of d0 on CiteSeer

Computational Intelligence and Neuroscience 11

based on TREC genomics and PubMed centralrdquo 2015 httphdlhandlenet214273680

[11] A Ritchie S Robertson and S Teufel ldquoComparing CitationContexts for information Retrievalrdquo in Proceedings of the 17thACM Conference on Information and Knowledge Manage-ment pp 213ndash222 Napa Valley CA USA October 2008

[12] C F Van Loan e Block Jacobi Method for Computing theSingular Value Decomposition Department of ComputerScience Cornell University Ithaca NY USA 1985

[13] C Bhagavatula S Feldman R Power et al ldquoContent-basedcitation recommendationrdquo 2018 httpsarxivorgpdf18020830201v1pdf

[14] H Jia and E Saule ldquoLocal is good a fast citation recom-mendation approachrdquo Lecture Notes in Computer ScienceVol 10772 Springer Berlin Germany 2018

[15] Y Sun W Ni and R Men ldquoA personalized paper recom-mendation approach based on web paper mining and re-viewerrsquos interest modellingrdquo in Proceedings of theInternational Conference on Research Challenges in ComputerScience Shanghai China December 2009

[16] B Shaparenko and T Joachims ldquoInformation genealogyUncovering the flow of ideas in non-hyperlinked documentdatabasesrdquo in Proceedings of the ACM SIGKDD internationalConference on Knowledge Discovery and Data Mining SanJose CA USA August 2007

[17] T Strohman W B Croft and D Jensen ldquoRecommendingcitations for academic papersrdquo in Proceedings of the Annualinternational ACM SIGIR Conference on Research and De-velopment in information Retrieval Amsterdam NetherlandsJuly 2007

[18] A Livne V Gokuladas J Teevan et al ldquoCiteSight supportingcontextual citation recommendation using differentialsearchrdquo in Proceedings of the International ACM SIGIRConference on Research amp Development in informationRetrieval Gold Coast Australia July 2014

[19] Y Lu J He D Shan et al ldquoRecommending citations withtranslation modelrdquo in Proceedings of the ACM internationalConference on Information and Knowledge ManagementGlasgow UK October 2011

[20] W Huang P Mitra S Kataria et al ldquoRecommending cita-tions translating papers into referencesrdquo in Proceedings of theACM international Conference on Information and KnowledgeManagement Shanghai China November 2014

[21] X Tang X Wan X Zhang et al ldquoCross-language context-aware citation recommendation in scientific articlesrdquo inProceedings of the International ACM SIGIR Conference onResearch amp Development in information Retrieval Gold CoastUK July 2014

[22] A Brock T Lim J M Ritchie et al ldquoNeural photo edi-tingwith introspective adversarial networksrdquo in InternationalConference on Learning Representations 2017

[23] I J Goodfellow J Pouget-Abadie MMirza et al ldquoGenerativeadversarial netsrdquo in Proceedings of the International Con-ference on Neural Information Processing Systems MontrealCanada December 2014

[24] A Brock J Donahue K Simonyan et al ldquoLarge scale GANtraining for high fidelity natural image synthesisrdquo 2018httpsarxivorgabs180911096

[25] Y Sun L Zheng W Deng et al ldquoSVDNet for pedestrianretrievalrdquo in Proceedings of the 2017 IEEE InternationalConference on Computer Vision (ICCV) pp 3820ndash3828Venice Italy October 2017

[26] Q Zheng M Yang J Yang Q Zhang and X ZhangldquoImprovement of generalization ability of deep CNN via

implicit regularization in two-stage training processrdquo IEEEAccess vol 6 no 1109 pp 15844ndash15869 2018

[27] Y Wang D Gong Z Zheng et al ldquoOrthogonal deep featuresdecomposition for age-invariant face recognitionrdquo in Pro-ceedings of the European Conference on Computer Vision(ECCV) Munich Germany September 2018

[28] Y Chen X Jin J Feng et al ldquoTraining group orthogonalneural networks with privileged informationrdquo in Proceedingsof the Twenty-Sixth International Joint Conference on ArtificialIntelligence Melbourne Australia August 2017

[29] T Mikolov I Sutskever K Chen et al ldquoDistributedrepresentations of words and phrases and their composi-tionalityrdquo in Proceedings of the 26th International Con-ference on Neural Information Processing Systems LakeTahoe NV USA December 2013

[30] A Rajaraman and J D Ullman ldquoData miningrdquo Mining ofMassive Datasets vol 3 no 2 pp 1ndash17 2014

[31] J Duchi E Hazan and Y Singer ldquoAdaptive subgradientmethods for online learning and stochastic optimizationrdquoJournal of Machine Learning Research vol 12 no 7pp 2121ndash2159 2011

[32] T Miyato A M Dai and I Goodfellow ldquoAdversarial trainingmethods for semi-supervised text classificationrdquo in Pro-ceedings of the International Conference on LearningRepresentations San Juan Puerto Rico May 2016

[33] I J Goodfellow J Shlens and C Szegedy ldquoExplaining andharnessing adversarial examplesrdquo in Proceedings of the In-ternational Conference on Learning Representations BanffCanada April 2014

[34] K Chandrasekaran S Gauch P Lakkaraju et al ldquoConcept-based document recommendations for CiteSeer authorsrdquo inProceedings of the International Conference on AdaptiveHypermedia and Adaptive Web-Based Systems HannoverGermany August 2008

[35] E Voorhees ldquo0e trec-8 question answering track reportrdquo inProceedings of the TRECrsquo00 pp 77ndash82 Gaithersburg MDUSA 2000

[36] S Kataria P Mitra and S Bhatia ldquoUtilizing context ingenerative bayesian models for linked corpusrdquo in Proceedingsof the Twenty-Fourth AAAI Conference on ArtificialIntelligence Atlanta GA USA July 2010

[37] J Tang and J Zhang ldquoA discriminative approach to topic-based citation recommendationrdquo in Proceedings of the Pacific-Asia Conference Hyderabad India July 2009

12 Computational Intelligence and Neuroscience

Page 8: SVD-CNN:AConvolutionalNeuralNetworkModelwith ...

the encoder are set to 4 4 and 5 and the region sizesfor the author network are set to 1 and 2

Figures 8 and 9 show the performance of eachmethod onthe CiteSeer dataset It is obvious that the SVD-FC modelleads the performance in most cases More detailed analysesare given as follows

First we perform a comparison among CP-LDA RBMW2V and SVD-CNN Our SVD-CNN completely andsignificantly exceeds other models in all metrics 0e successof ourmodel is ascribed to the content and correlation of ournetwork Due to the lack of citation context information wefind that W2V is obviously worse than other methods interms of all metrics CP-LDA works much better than W2Vwhich indicates that link information is very important forfinding relevant papers RBM-CS shows a clear performancegain over W2V because RBM-CS automatically discoverstopical aspects of each paper based on citation contextHowever the vector representations of citation context inRBM-CS are extracted by traditional word vector repre-sentations which fully neglect semantic relations betweenthe citation document and citation context and thus may belimited by vocabulary

Second we compare the performance among NPMNCN and SVD-CNN It is not surprising that NPM andNCN achieve worse performance than SVD-CNN since theirdistributed representation of words and documents reliessolely on deep learning without restraint NPM recommendscitations based on trained distributed representations NCNfurther enhances the performance by considering authorinformation and using a more sophisticated neural networkarchitecture However the CNN in NCN does not haveorthogonal constraints which makes it difficult to capturedifferent types of citing activities In addition NCN onlyutilizes the title of the cited paper for a decoder which isapparently not sufficient for learning good embedding

44 e Influence on the Link Prediction of Reference PatternInteractionalFeatures According to the chapter positions ofcitation context in the article we divide the training set intothree parts the introduction part contains 1307885 pairs ofreference contexts and citations the related word partcontains 1599897 pairs of citations and the main partcontains 1024783 pairs Furthermore these datasets formthree mixed datasets In this part of the experiment we usethe CNN model without SVD as the baseline 0ese datasetsare tested in a ratio of 3 1 In Tables 1 and 2 we show theresults on the abovementioned datasets

From the results we obtain the following observationsFirst both CNN and SVD-CNN outperform unmixed

datasets over mixed datasets across the different evaluationmetrics which shows that the diversity of reference patternsincreases the difficulty of citation recommendation tasks

Second in Tables 1 and 2 we observe that our model isparticularly good at resolving the difficulties in mixeddatasets which come from the diversity of referencepatterns

To better explore why mixed datasets are more complexthan unmixed datasets in Figure 10 we show the change in

S(W) during the training process of SVD-CNN amongvarious datasets

As shown in Figure 10 the increase in S(W) on themixed datasets indicates that SVD-CNN is good at decor-relation We can also see in Tables 1 and 2 that the CNNmodel has pretty performance on unmixed datasets whileachieving poor performance on mixed datasets HoweverSVD-CNN achieves almost the same performance on thetwo types of datasets 0is proves that the correlation fromvarious reference patterns can significantly affect the linkprediction

0e reason why the change in S(W) is not large on theunmixed datasets is that reference patterns of unmixeddatasets have similar features which belong to the samecategory As a result the orthogonality of the weight matrixis hard to improve on unmixed datasets However a citationrecommendation algorithm has pretty performance on theunmixed datasets because there are low complexities

Although mixed datasets are more complicated thanunmixed datasets SVD-CNN still performs well in mixeddatasets 0is indicates that SVD-CNN reduces the negativeimpact of the correlation of reference patterns and ourapproach is more suitable for complex scenarios

45 Comparison with Other Types of Decorrelation In ad-dition to SVD there are still some other methods fordecorrelating the feature matrix However these methodscannot maintain the discriminating ability of the CNNmodel To illustrate this we compare SVD with severalvarieties as follows

(1) Using the originally learned W

(2) Replacing W with US

(3) Replacing W with U

(4) Replacing W with UVT

(5) Replacing Wwith Q D where D is the diagonalmatrix extracted from the upper triangle matrix inQ-R decomposition

(6) Replacing W with WPCA where WPCA is the diagonalmatrix extracted from the weight matrix W after theprocessing of dimension reduction by PCA

After convergence of training different orthogonalmatrices are used to replace the weight matrix W We defineT-cost as the time cost of replacing the weight which isequivalent to the proportion of the added time to the originaltime As shown in Table 3 other types of decorrelationdegrade the performance in addition to W⟶ US andW⟶WPCA However the time cost of W⟶WPCA ismore than that of W⟶ US

46 Ablation Study In our method there are two essentialparameters a term sot which means the number of SOTiterations and a biased parameter d0 In this section weconduct an ablation study of these parameters

We first evaluate the effectiveness of sot by empiri-cally fixing d0 300 Since sot defines the loop time of

8 Computational Intelligence and Neuroscience

orthogonal constraint training it should be set as anonnegative value Figure 11 illustrates the MRR with sotfrom 0 to 10 on the CiteSeer dataset We can see that the

performance improves as the value of sot increasesWhen sot 0 the model has no decorrelation andachieves the worst performance In this situation the

Table 1 MRR metric on various datasets

Introduction Related Main Introduction + related Introduction +main Related +mainCNN 03312 03294 03478 02773 02815 02978SVD-CNN 03995 04078 03989 03878 03889 03845

060055050045040035030025Re

call

020015010005000

20 40 60Number of recommended citations

80 100

W2VsNPMs

RBMsCP_LDAs

SVD_CNNsNCNs

Figure 8 Comparison of recall with different methods on CiteSeer

MRR MAP and nDCG scores for top 10 recommendations04

035

03

025

02

015

01

005

0

0091600997

MRR MAP nCDG

00662

01843

02667

03687

00912009982

00663

01835

02418

03352

01288 0135601476

0256602592

03448

CP-LDARBM

W2VNPM

NCNSVD-CNN

Figure 9 Comparison of MRR MAP and nDCG with different methods on CiteSeer

Table 2 MAP metric on various datasets

Introduction Related Main Introduction + related Introduction +main Related +mainCNN 03001 02909 03107 02572 02601 02637SVD-CNN 03701 03655 03693 03498 03511 03539

Computational Intelligence and Neuroscience 9

weight matrix in the FC layer is highly correlated andS(W) has the lowest value 0e recommendation per-formance then increases while adding sot which indi-cates that reducing the correlative degree of the weightmatrix in the FC layer is critical for improving perfor-mance When sot 10 our model achieves the bestperformance

In our model d0 is the dimension of citation contextand cited document representations Figure 12 shows howthe performance of SVD-CNN varies with d0 on the samesot When d0 is small the information content of thecitation context is very small and produces worse per-formance 0e recommendation performance increases toa maximum point until d0 reaches 300 It should be noted

05

045

04

035

03

S (W

)

025

02

015

01

005

00 1 2 3 4 5

Sot6 7 8 9 10 11

IntroductionRelatedMain

Introduction + relatedIntroduction + mainRelated + main

Figure 10 0e change in S(W) during training on unmixed datasets and mixed datasets

Table 3 0e comparison of related methods in Step 1

W W⟶ US W⟶ U W⟶ UVT W⟶ Q D W⟶WPCA

Rank-1 636 636 617 617 616 636mAP 390 390 371 371 373 390T-cost 0 3627 3627 3627 3533 5765

055050045040035030

MRR

025020015010005000

05

S (W

)

10

04

03

02

01

001 3 5

Sot7 9

MRRLRMRRSVM

S (W) SOTS (W) NO-SOT

Figure 11 0e performance impact of sot on CiteSeer

10 Computational Intelligence and Neuroscience

that although the larger d0 is better the larger d0 willsignificantly increase the training time 0erefore wechoose d0 300

5 Conclusion and Future Works

We propose a convolutional neural network model withorthogonal regularization to solve the context-aware citationrecommendation task In our model orthogonal regulari-zation is achieved by using SVD to factorize the weight of theFC layer which can essentially make each vector in thefeature map more independent 0e orthogonal regulari-zation also enhances the feature extraction ability of CNN0e experimental results show that SVD-CNN outperformsthe other compared methods on CiteSeer Our model onlytakes the abstract as the content of the cited paper In thefuture we will explore the performance of our model byusing the full text of papers

Data Availability

Previously reported CiteSeer data were used to support thisstudy and are available at [httpspsuappboxcomvrefseer] 0ese prior datasets are cited at relevant placeswithin the text as references [4]

Conflicts of Interest

0e authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

0is work was partially supported by the National NaturalScience Foundation of China (project no 61373046) andthe National Key Research and Development Programs ofChina (project nos 2018AAA0101100 and2019YFB2102500)

References

[1] M A Angrosh S Cranefield and N Stanger ldquoConditionalrandom field based sentence context identification enhancingcitation services for the research communityrdquo in Proceedingsof the First Australasian Web Conference Adelaide AustraliaJanuary 2013

[2] Q He J Pei D Kifer et al ldquoContext-aware citation rec-ommendationrdquo in Proceedings of the International Conferenceon World Wide Web Raleigh NC USA April 2010

[3] Q He D Kifer J Pei et al ldquoCitation recommendationwithout author supervisionrdquo in Proceedings of the FourthACM international Conference on Web Search and DataMining Hong Kong China February 2011

[4] W Huang ldquoA neural probabilistic model for context basedcitation recommendationrdquo in Proceedings of the AAAIConference on Artificial Intelligence Austin TX USA January2015

[5] J Tan X Wan and J Xiao ldquoA neural network approach toquote recommendation in writingsrdquo in Proceedings of theACM International on Conference on Information andKnowledge Management Indianapolis IN USA October2016

[6] X Ren J Liu X Yu et al ldquoCluscite effective citation rec-ommendation by information network-based clusteringrdquo inProceedings of the 20th ACM SIGKDD International Con-ference on Knowledge Discovery and Data Mining New YorkNY USA August 2014

[7] T Ebesu and Y Fang ldquoNeural citation network for context-aware citation recommendationrdquo in Proceedings of the 40thInternational ACM SIGIR Conference on Research and De-velopment in Information Retrieval pp 1093ndash1096 ShinjukuJapan August 2017

[8] D M Blei A Y Ng and M I Jordan ldquoLatentdirichlet allocationrdquo Journal of Machine Learning Researchvol 3 pp 993ndash1022 2003

[9] S Bradshaw ldquoReference directed indexing redeeming rele-vance for subject search in citation indexesrdquo Research andAdvanced Technology for Digital Libraries vol 2769pp 499ndash510 2003

[10] N Meuschke B Gipp and M Lipinsk ldquoCITREC an eval-uation framework for citation-based similarity measures

06

055

05

045

Valu

e of e

valu

atio

n sta

ndar

d

04

035

03

025100 200 300 400 500

03302 03512 03687 03701 0372203003 03225 03352 03398 0340903101 03312 03448 03486 0349905456 05689 05801

Vector dimension of input layer05842 05867

Various d_0 for the effects of the model

MRRMAPnDCGRecall10

Figure 12 0e performance impact of d0 on CiteSeer

Computational Intelligence and Neuroscience 11

based on TREC genomics and PubMed centralrdquo 2015 httphdlhandlenet214273680

[11] A Ritchie S Robertson and S Teufel ldquoComparing CitationContexts for information Retrievalrdquo in Proceedings of the 17thACM Conference on Information and Knowledge Manage-ment pp 213ndash222 Napa Valley CA USA October 2008

[12] C F Van Loan e Block Jacobi Method for Computing theSingular Value Decomposition Department of ComputerScience Cornell University Ithaca NY USA 1985

[13] C Bhagavatula S Feldman R Power et al ldquoContent-basedcitation recommendationrdquo 2018 httpsarxivorgpdf18020830201v1pdf

[14] H Jia and E Saule ldquoLocal is good a fast citation recom-mendation approachrdquo Lecture Notes in Computer ScienceVol 10772 Springer Berlin Germany 2018

[15] Y Sun W Ni and R Men ldquoA personalized paper recom-mendation approach based on web paper mining and re-viewerrsquos interest modellingrdquo in Proceedings of theInternational Conference on Research Challenges in ComputerScience Shanghai China December 2009

[16] B Shaparenko and T Joachims ldquoInformation genealogyUncovering the flow of ideas in non-hyperlinked documentdatabasesrdquo in Proceedings of the ACM SIGKDD internationalConference on Knowledge Discovery and Data Mining SanJose CA USA August 2007

[17] T Strohman W B Croft and D Jensen ldquoRecommendingcitations for academic papersrdquo in Proceedings of the Annualinternational ACM SIGIR Conference on Research and De-velopment in information Retrieval Amsterdam NetherlandsJuly 2007

[18] A Livne V Gokuladas J Teevan et al ldquoCiteSight supportingcontextual citation recommendation using differentialsearchrdquo in Proceedings of the International ACM SIGIRConference on Research amp Development in informationRetrieval Gold Coast Australia July 2014

[19] Y Lu J He D Shan et al ldquoRecommending citations withtranslation modelrdquo in Proceedings of the ACM internationalConference on Information and Knowledge ManagementGlasgow UK October 2011

[20] W Huang P Mitra S Kataria et al ldquoRecommending cita-tions translating papers into referencesrdquo in Proceedings of theACM international Conference on Information and KnowledgeManagement Shanghai China November 2014

[21] X Tang X Wan X Zhang et al ldquoCross-language context-aware citation recommendation in scientific articlesrdquo inProceedings of the International ACM SIGIR Conference onResearch amp Development in information Retrieval Gold CoastUK July 2014

[22] A Brock T Lim J M Ritchie et al ldquoNeural photo edi-tingwith introspective adversarial networksrdquo in InternationalConference on Learning Representations 2017

[23] I J Goodfellow J Pouget-Abadie MMirza et al ldquoGenerativeadversarial netsrdquo in Proceedings of the International Con-ference on Neural Information Processing Systems MontrealCanada December 2014

[24] A Brock J Donahue K Simonyan et al ldquoLarge scale GANtraining for high fidelity natural image synthesisrdquo 2018httpsarxivorgabs180911096

[25] Y Sun L Zheng W Deng et al ldquoSVDNet for pedestrianretrievalrdquo in Proceedings of the 2017 IEEE InternationalConference on Computer Vision (ICCV) pp 3820ndash3828Venice Italy October 2017

[26] Q Zheng M Yang J Yang Q Zhang and X ZhangldquoImprovement of generalization ability of deep CNN via

implicit regularization in two-stage training processrdquo IEEEAccess vol 6 no 1109 pp 15844ndash15869 2018

[27] Y Wang D Gong Z Zheng et al ldquoOrthogonal deep featuresdecomposition for age-invariant face recognitionrdquo in Pro-ceedings of the European Conference on Computer Vision(ECCV) Munich Germany September 2018

[28] Y Chen X Jin J Feng et al ldquoTraining group orthogonalneural networks with privileged informationrdquo in Proceedingsof the Twenty-Sixth International Joint Conference on ArtificialIntelligence Melbourne Australia August 2017

[29] T Mikolov I Sutskever K Chen et al ldquoDistributedrepresentations of words and phrases and their composi-tionalityrdquo in Proceedings of the 26th International Con-ference on Neural Information Processing Systems LakeTahoe NV USA December 2013

[30] A Rajaraman and J D Ullman ldquoData miningrdquo Mining ofMassive Datasets vol 3 no 2 pp 1ndash17 2014

[31] J Duchi E Hazan and Y Singer ldquoAdaptive subgradientmethods for online learning and stochastic optimizationrdquoJournal of Machine Learning Research vol 12 no 7pp 2121ndash2159 2011

[32] T Miyato A M Dai and I Goodfellow ldquoAdversarial trainingmethods for semi-supervised text classificationrdquo in Pro-ceedings of the International Conference on LearningRepresentations San Juan Puerto Rico May 2016

[33] I J Goodfellow J Shlens and C Szegedy ldquoExplaining andharnessing adversarial examplesrdquo in Proceedings of the In-ternational Conference on Learning Representations BanffCanada April 2014

[34] K Chandrasekaran S Gauch P Lakkaraju et al ldquoConcept-based document recommendations for CiteSeer authorsrdquo inProceedings of the International Conference on AdaptiveHypermedia and Adaptive Web-Based Systems HannoverGermany August 2008

[35] E Voorhees ldquo0e trec-8 question answering track reportrdquo inProceedings of the TRECrsquo00 pp 77ndash82 Gaithersburg MDUSA 2000

[36] S Kataria P Mitra and S Bhatia ldquoUtilizing context ingenerative bayesian models for linked corpusrdquo in Proceedingsof the Twenty-Fourth AAAI Conference on ArtificialIntelligence Atlanta GA USA July 2010

[37] J Tang and J Zhang ldquoA discriminative approach to topic-based citation recommendationrdquo in Proceedings of the Pacific-Asia Conference Hyderabad India July 2009

12 Computational Intelligence and Neuroscience

Page 9: SVD-CNN:AConvolutionalNeuralNetworkModelwith ...

orthogonal constraint training it should be set as anonnegative value Figure 11 illustrates the MRR with sotfrom 0 to 10 on the CiteSeer dataset We can see that the

performance improves as the value of sot increasesWhen sot 0 the model has no decorrelation andachieves the worst performance In this situation the

Table 1 MRR metric on various datasets

Introduction Related Main Introduction + related Introduction +main Related +mainCNN 03312 03294 03478 02773 02815 02978SVD-CNN 03995 04078 03989 03878 03889 03845

060055050045040035030025Re

call

020015010005000

20 40 60Number of recommended citations

80 100

W2VsNPMs

RBMsCP_LDAs

SVD_CNNsNCNs

Figure 8 Comparison of recall with different methods on CiteSeer

MRR MAP and nDCG scores for top 10 recommendations04

035

03

025

02

015

01

005

0

0091600997

MRR MAP nCDG

00662

01843

02667

03687

00912009982

00663

01835

02418

03352

01288 0135601476

0256602592

03448

CP-LDARBM

W2VNPM

NCNSVD-CNN

Figure 9 Comparison of MRR MAP and nDCG with different methods on CiteSeer

Table 2 MAP metric on various datasets

Introduction Related Main Introduction + related Introduction +main Related +mainCNN 03001 02909 03107 02572 02601 02637SVD-CNN 03701 03655 03693 03498 03511 03539

Computational Intelligence and Neuroscience 9

weight matrix in the FC layer is highly correlated andS(W) has the lowest value 0e recommendation per-formance then increases while adding sot which indi-cates that reducing the correlative degree of the weightmatrix in the FC layer is critical for improving perfor-mance When sot 10 our model achieves the bestperformance

In our model d0 is the dimension of citation contextand cited document representations Figure 12 shows howthe performance of SVD-CNN varies with d0 on the samesot When d0 is small the information content of thecitation context is very small and produces worse per-formance 0e recommendation performance increases toa maximum point until d0 reaches 300 It should be noted

05

045

04

035

03

S (W

)

025

02

015

01

005

00 1 2 3 4 5

Sot6 7 8 9 10 11

IntroductionRelatedMain

Introduction + relatedIntroduction + mainRelated + main

Figure 10 0e change in S(W) during training on unmixed datasets and mixed datasets

Table 3 0e comparison of related methods in Step 1

W W⟶ US W⟶ U W⟶ UVT W⟶ Q D W⟶WPCA

Rank-1 636 636 617 617 616 636mAP 390 390 371 371 373 390T-cost 0 3627 3627 3627 3533 5765

055050045040035030

MRR

025020015010005000

05

S (W

)

10

04

03

02

01

001 3 5

Sot7 9

MRRLRMRRSVM

S (W) SOTS (W) NO-SOT

Figure 11 0e performance impact of sot on CiteSeer

10 Computational Intelligence and Neuroscience

that although the larger d0 is better the larger d0 willsignificantly increase the training time 0erefore wechoose d0 300

5 Conclusion and Future Works

We propose a convolutional neural network model withorthogonal regularization to solve the context-aware citationrecommendation task In our model orthogonal regulari-zation is achieved by using SVD to factorize the weight of theFC layer which can essentially make each vector in thefeature map more independent 0e orthogonal regulari-zation also enhances the feature extraction ability of CNN0e experimental results show that SVD-CNN outperformsthe other compared methods on CiteSeer Our model onlytakes the abstract as the content of the cited paper In thefuture we will explore the performance of our model byusing the full text of papers

Data Availability

Previously reported CiteSeer data were used to support thisstudy and are available at [httpspsuappboxcomvrefseer] 0ese prior datasets are cited at relevant placeswithin the text as references [4]

Conflicts of Interest

0e authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

0is work was partially supported by the National NaturalScience Foundation of China (project no 61373046) andthe National Key Research and Development Programs ofChina (project nos 2018AAA0101100 and2019YFB2102500)

References

[1] M A Angrosh S Cranefield and N Stanger ldquoConditionalrandom field based sentence context identification enhancingcitation services for the research communityrdquo in Proceedingsof the First Australasian Web Conference Adelaide AustraliaJanuary 2013

[2] Q He J Pei D Kifer et al ldquoContext-aware citation rec-ommendationrdquo in Proceedings of the International Conferenceon World Wide Web Raleigh NC USA April 2010

[3] Q He D Kifer J Pei et al ldquoCitation recommendationwithout author supervisionrdquo in Proceedings of the FourthACM international Conference on Web Search and DataMining Hong Kong China February 2011

[4] W Huang ldquoA neural probabilistic model for context basedcitation recommendationrdquo in Proceedings of the AAAIConference on Artificial Intelligence Austin TX USA January2015

[5] J Tan X Wan and J Xiao ldquoA neural network approach toquote recommendation in writingsrdquo in Proceedings of theACM International on Conference on Information andKnowledge Management Indianapolis IN USA October2016

[6] X Ren J Liu X Yu et al ldquoCluscite effective citation rec-ommendation by information network-based clusteringrdquo inProceedings of the 20th ACM SIGKDD International Con-ference on Knowledge Discovery and Data Mining New YorkNY USA August 2014

[7] T Ebesu and Y Fang ldquoNeural citation network for context-aware citation recommendationrdquo in Proceedings of the 40thInternational ACM SIGIR Conference on Research and De-velopment in Information Retrieval pp 1093ndash1096 ShinjukuJapan August 2017

[8] D M Blei A Y Ng and M I Jordan ldquoLatentdirichlet allocationrdquo Journal of Machine Learning Researchvol 3 pp 993ndash1022 2003

[9] S Bradshaw ldquoReference directed indexing redeeming rele-vance for subject search in citation indexesrdquo Research andAdvanced Technology for Digital Libraries vol 2769pp 499ndash510 2003

[10] N Meuschke B Gipp and M Lipinsk ldquoCITREC an eval-uation framework for citation-based similarity measures

06

055

05

045

Valu

e of e

valu

atio

n sta

ndar

d

04

035

03

025100 200 300 400 500

03302 03512 03687 03701 0372203003 03225 03352 03398 0340903101 03312 03448 03486 0349905456 05689 05801

Vector dimension of input layer05842 05867

Various d_0 for the effects of the model

MRRMAPnDCGRecall10

Figure 12 0e performance impact of d0 on CiteSeer

Computational Intelligence and Neuroscience 11

based on TREC genomics and PubMed centralrdquo 2015 httphdlhandlenet214273680

[11] A Ritchie S Robertson and S Teufel ldquoComparing CitationContexts for information Retrievalrdquo in Proceedings of the 17thACM Conference on Information and Knowledge Manage-ment pp 213ndash222 Napa Valley CA USA October 2008

[12] C F Van Loan e Block Jacobi Method for Computing theSingular Value Decomposition Department of ComputerScience Cornell University Ithaca NY USA 1985

[13] C Bhagavatula S Feldman R Power et al ldquoContent-basedcitation recommendationrdquo 2018 httpsarxivorgpdf18020830201v1pdf

[14] H Jia and E Saule ldquoLocal is good a fast citation recom-mendation approachrdquo Lecture Notes in Computer ScienceVol 10772 Springer Berlin Germany 2018

[15] Y Sun W Ni and R Men ldquoA personalized paper recom-mendation approach based on web paper mining and re-viewerrsquos interest modellingrdquo in Proceedings of theInternational Conference on Research Challenges in ComputerScience Shanghai China December 2009

[16] B Shaparenko and T Joachims ldquoInformation genealogyUncovering the flow of ideas in non-hyperlinked documentdatabasesrdquo in Proceedings of the ACM SIGKDD internationalConference on Knowledge Discovery and Data Mining SanJose CA USA August 2007

[17] T Strohman W B Croft and D Jensen ldquoRecommendingcitations for academic papersrdquo in Proceedings of the Annualinternational ACM SIGIR Conference on Research and De-velopment in information Retrieval Amsterdam NetherlandsJuly 2007

[18] A Livne V Gokuladas J Teevan et al ldquoCiteSight supportingcontextual citation recommendation using differentialsearchrdquo in Proceedings of the International ACM SIGIRConference on Research amp Development in informationRetrieval Gold Coast Australia July 2014

[19] Y Lu J He D Shan et al ldquoRecommending citations withtranslation modelrdquo in Proceedings of the ACM internationalConference on Information and Knowledge ManagementGlasgow UK October 2011

[20] W Huang P Mitra S Kataria et al ldquoRecommending cita-tions translating papers into referencesrdquo in Proceedings of theACM international Conference on Information and KnowledgeManagement Shanghai China November 2014

[21] X Tang X Wan X Zhang et al ldquoCross-language context-aware citation recommendation in scientific articlesrdquo inProceedings of the International ACM SIGIR Conference onResearch amp Development in information Retrieval Gold CoastUK July 2014

[22] A Brock T Lim J M Ritchie et al ldquoNeural photo edi-tingwith introspective adversarial networksrdquo in InternationalConference on Learning Representations 2017

[23] I J Goodfellow J Pouget-Abadie MMirza et al ldquoGenerativeadversarial netsrdquo in Proceedings of the International Con-ference on Neural Information Processing Systems MontrealCanada December 2014

[24] A Brock J Donahue K Simonyan et al ldquoLarge scale GANtraining for high fidelity natural image synthesisrdquo 2018httpsarxivorgabs180911096

[25] Y Sun L Zheng W Deng et al ldquoSVDNet for pedestrianretrievalrdquo in Proceedings of the 2017 IEEE InternationalConference on Computer Vision (ICCV) pp 3820ndash3828Venice Italy October 2017

[26] Q Zheng M Yang J Yang Q Zhang and X ZhangldquoImprovement of generalization ability of deep CNN via

implicit regularization in two-stage training processrdquo IEEEAccess vol 6 no 1109 pp 15844ndash15869 2018

[27] Y Wang D Gong Z Zheng et al ldquoOrthogonal deep featuresdecomposition for age-invariant face recognitionrdquo in Pro-ceedings of the European Conference on Computer Vision(ECCV) Munich Germany September 2018

[28] Y Chen X Jin J Feng et al ldquoTraining group orthogonalneural networks with privileged informationrdquo in Proceedingsof the Twenty-Sixth International Joint Conference on ArtificialIntelligence Melbourne Australia August 2017

[29] T Mikolov I Sutskever K Chen et al ldquoDistributedrepresentations of words and phrases and their composi-tionalityrdquo in Proceedings of the 26th International Con-ference on Neural Information Processing Systems LakeTahoe NV USA December 2013

[30] A Rajaraman and J D Ullman ldquoData miningrdquo Mining ofMassive Datasets vol 3 no 2 pp 1ndash17 2014

[31] J Duchi E Hazan and Y Singer ldquoAdaptive subgradientmethods for online learning and stochastic optimizationrdquoJournal of Machine Learning Research vol 12 no 7pp 2121ndash2159 2011

[32] T Miyato A M Dai and I Goodfellow ldquoAdversarial trainingmethods for semi-supervised text classificationrdquo in Pro-ceedings of the International Conference on LearningRepresentations San Juan Puerto Rico May 2016

[33] I J Goodfellow J Shlens and C Szegedy ldquoExplaining andharnessing adversarial examplesrdquo in Proceedings of the In-ternational Conference on Learning Representations BanffCanada April 2014

[34] K Chandrasekaran S Gauch P Lakkaraju et al ldquoConcept-based document recommendations for CiteSeer authorsrdquo inProceedings of the International Conference on AdaptiveHypermedia and Adaptive Web-Based Systems HannoverGermany August 2008

[35] E Voorhees ldquo0e trec-8 question answering track reportrdquo inProceedings of the TRECrsquo00 pp 77ndash82 Gaithersburg MDUSA 2000

[36] S Kataria P Mitra and S Bhatia ldquoUtilizing context ingenerative bayesian models for linked corpusrdquo in Proceedingsof the Twenty-Fourth AAAI Conference on ArtificialIntelligence Atlanta GA USA July 2010

[37] J Tang and J Zhang ldquoA discriminative approach to topic-based citation recommendationrdquo in Proceedings of the Pacific-Asia Conference Hyderabad India July 2009

12 Computational Intelligence and Neuroscience

Page 10: SVD-CNN:AConvolutionalNeuralNetworkModelwith ...

weight matrix in the FC layer is highly correlated andS(W) has the lowest value 0e recommendation per-formance then increases while adding sot which indi-cates that reducing the correlative degree of the weightmatrix in the FC layer is critical for improving perfor-mance When sot 10 our model achieves the bestperformance

In our model d0 is the dimension of citation contextand cited document representations Figure 12 shows howthe performance of SVD-CNN varies with d0 on the samesot When d0 is small the information content of thecitation context is very small and produces worse per-formance 0e recommendation performance increases toa maximum point until d0 reaches 300 It should be noted

05

045

04

035

03

S (W

)

025

02

015

01

005

00 1 2 3 4 5

Sot6 7 8 9 10 11

IntroductionRelatedMain

Introduction + relatedIntroduction + mainRelated + main

Figure 10 0e change in S(W) during training on unmixed datasets and mixed datasets

Table 3 0e comparison of related methods in Step 1

W W⟶ US W⟶ U W⟶ UVT W⟶ Q D W⟶WPCA

Rank-1 636 636 617 617 616 636mAP 390 390 371 371 373 390T-cost 0 3627 3627 3627 3533 5765

055050045040035030

MRR

025020015010005000

05

S (W

)

10

04

03

02

01

001 3 5

Sot7 9

MRRLRMRRSVM

S (W) SOTS (W) NO-SOT

Figure 11 0e performance impact of sot on CiteSeer

10 Computational Intelligence and Neuroscience

that although the larger d0 is better the larger d0 willsignificantly increase the training time 0erefore wechoose d0 300

5 Conclusion and Future Works

We propose a convolutional neural network model withorthogonal regularization to solve the context-aware citationrecommendation task In our model orthogonal regulari-zation is achieved by using SVD to factorize the weight of theFC layer which can essentially make each vector in thefeature map more independent 0e orthogonal regulari-zation also enhances the feature extraction ability of CNN0e experimental results show that SVD-CNN outperformsthe other compared methods on CiteSeer Our model onlytakes the abstract as the content of the cited paper In thefuture we will explore the performance of our model byusing the full text of papers

Data Availability

Previously reported CiteSeer data were used to support thisstudy and are available at [httpspsuappboxcomvrefseer] 0ese prior datasets are cited at relevant placeswithin the text as references [4]

Conflicts of Interest

0e authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

0is work was partially supported by the National NaturalScience Foundation of China (project no 61373046) andthe National Key Research and Development Programs ofChina (project nos 2018AAA0101100 and2019YFB2102500)

References

[1] M A Angrosh S Cranefield and N Stanger ldquoConditionalrandom field based sentence context identification enhancingcitation services for the research communityrdquo in Proceedingsof the First Australasian Web Conference Adelaide AustraliaJanuary 2013

[2] Q He J Pei D Kifer et al ldquoContext-aware citation rec-ommendationrdquo in Proceedings of the International Conferenceon World Wide Web Raleigh NC USA April 2010

[3] Q He D Kifer J Pei et al ldquoCitation recommendationwithout author supervisionrdquo in Proceedings of the FourthACM international Conference on Web Search and DataMining Hong Kong China February 2011

[4] W Huang ldquoA neural probabilistic model for context basedcitation recommendationrdquo in Proceedings of the AAAIConference on Artificial Intelligence Austin TX USA January2015

[5] J Tan X Wan and J Xiao ldquoA neural network approach toquote recommendation in writingsrdquo in Proceedings of theACM International on Conference on Information andKnowledge Management Indianapolis IN USA October2016

[6] X Ren J Liu X Yu et al ldquoCluscite effective citation rec-ommendation by information network-based clusteringrdquo inProceedings of the 20th ACM SIGKDD International Con-ference on Knowledge Discovery and Data Mining New YorkNY USA August 2014

[7] T Ebesu and Y Fang ldquoNeural citation network for context-aware citation recommendationrdquo in Proceedings of the 40thInternational ACM SIGIR Conference on Research and De-velopment in Information Retrieval pp 1093ndash1096 ShinjukuJapan August 2017

[8] D M Blei A Y Ng and M I Jordan ldquoLatentdirichlet allocationrdquo Journal of Machine Learning Researchvol 3 pp 993ndash1022 2003

[9] S Bradshaw ldquoReference directed indexing redeeming rele-vance for subject search in citation indexesrdquo Research andAdvanced Technology for Digital Libraries vol 2769pp 499ndash510 2003

[10] N Meuschke B Gipp and M Lipinsk ldquoCITREC an eval-uation framework for citation-based similarity measures

06

055

05

045

Valu

e of e

valu

atio

n sta

ndar

d

04

035

03

025100 200 300 400 500

03302 03512 03687 03701 0372203003 03225 03352 03398 0340903101 03312 03448 03486 0349905456 05689 05801

Vector dimension of input layer05842 05867

Various d_0 for the effects of the model

MRRMAPnDCGRecall10

Figure 12 0e performance impact of d0 on CiteSeer

Computational Intelligence and Neuroscience 11

based on TREC genomics and PubMed centralrdquo 2015 httphdlhandlenet214273680

[11] A Ritchie S Robertson and S Teufel ldquoComparing CitationContexts for information Retrievalrdquo in Proceedings of the 17thACM Conference on Information and Knowledge Manage-ment pp 213ndash222 Napa Valley CA USA October 2008

[12] C F Van Loan e Block Jacobi Method for Computing theSingular Value Decomposition Department of ComputerScience Cornell University Ithaca NY USA 1985

[13] C Bhagavatula S Feldman R Power et al ldquoContent-basedcitation recommendationrdquo 2018 httpsarxivorgpdf18020830201v1pdf

[14] H Jia and E Saule ldquoLocal is good a fast citation recom-mendation approachrdquo Lecture Notes in Computer ScienceVol 10772 Springer Berlin Germany 2018

[15] Y Sun W Ni and R Men ldquoA personalized paper recom-mendation approach based on web paper mining and re-viewerrsquos interest modellingrdquo in Proceedings of theInternational Conference on Research Challenges in ComputerScience Shanghai China December 2009

[16] B Shaparenko and T Joachims ldquoInformation genealogyUncovering the flow of ideas in non-hyperlinked documentdatabasesrdquo in Proceedings of the ACM SIGKDD internationalConference on Knowledge Discovery and Data Mining SanJose CA USA August 2007

[17] T Strohman W B Croft and D Jensen ldquoRecommendingcitations for academic papersrdquo in Proceedings of the Annualinternational ACM SIGIR Conference on Research and De-velopment in information Retrieval Amsterdam NetherlandsJuly 2007

[18] A Livne V Gokuladas J Teevan et al ldquoCiteSight supportingcontextual citation recommendation using differentialsearchrdquo in Proceedings of the International ACM SIGIRConference on Research amp Development in informationRetrieval Gold Coast Australia July 2014

[19] Y Lu J He D Shan et al ldquoRecommending citations withtranslation modelrdquo in Proceedings of the ACM internationalConference on Information and Knowledge ManagementGlasgow UK October 2011

[20] W Huang P Mitra S Kataria et al ldquoRecommending cita-tions translating papers into referencesrdquo in Proceedings of theACM international Conference on Information and KnowledgeManagement Shanghai China November 2014

[21] X Tang X Wan X Zhang et al ldquoCross-language context-aware citation recommendation in scientific articlesrdquo inProceedings of the International ACM SIGIR Conference onResearch amp Development in information Retrieval Gold CoastUK July 2014

[22] A Brock T Lim J M Ritchie et al ldquoNeural photo edi-tingwith introspective adversarial networksrdquo in InternationalConference on Learning Representations 2017

[23] I J Goodfellow J Pouget-Abadie MMirza et al ldquoGenerativeadversarial netsrdquo in Proceedings of the International Con-ference on Neural Information Processing Systems MontrealCanada December 2014

[24] A Brock J Donahue K Simonyan et al ldquoLarge scale GANtraining for high fidelity natural image synthesisrdquo 2018httpsarxivorgabs180911096

[25] Y Sun L Zheng W Deng et al ldquoSVDNet for pedestrianretrievalrdquo in Proceedings of the 2017 IEEE InternationalConference on Computer Vision (ICCV) pp 3820ndash3828Venice Italy October 2017

[26] Q Zheng M Yang J Yang Q Zhang and X ZhangldquoImprovement of generalization ability of deep CNN via

implicit regularization in two-stage training processrdquo IEEEAccess vol 6 no 1109 pp 15844ndash15869 2018

[27] Y Wang D Gong Z Zheng et al ldquoOrthogonal deep featuresdecomposition for age-invariant face recognitionrdquo in Pro-ceedings of the European Conference on Computer Vision(ECCV) Munich Germany September 2018

[28] Y Chen X Jin J Feng et al ldquoTraining group orthogonalneural networks with privileged informationrdquo in Proceedingsof the Twenty-Sixth International Joint Conference on ArtificialIntelligence Melbourne Australia August 2017

[29] T Mikolov I Sutskever K Chen et al ldquoDistributedrepresentations of words and phrases and their composi-tionalityrdquo in Proceedings of the 26th International Con-ference on Neural Information Processing Systems LakeTahoe NV USA December 2013

[30] A Rajaraman and J D Ullman ldquoData miningrdquo Mining ofMassive Datasets vol 3 no 2 pp 1ndash17 2014

[31] J Duchi E Hazan and Y Singer ldquoAdaptive subgradientmethods for online learning and stochastic optimizationrdquoJournal of Machine Learning Research vol 12 no 7pp 2121ndash2159 2011

[32] T Miyato A M Dai and I Goodfellow ldquoAdversarial trainingmethods for semi-supervised text classificationrdquo in Pro-ceedings of the International Conference on LearningRepresentations San Juan Puerto Rico May 2016

[33] I J Goodfellow J Shlens and C Szegedy ldquoExplaining andharnessing adversarial examplesrdquo in Proceedings of the In-ternational Conference on Learning Representations BanffCanada April 2014

[34] K Chandrasekaran S Gauch P Lakkaraju et al ldquoConcept-based document recommendations for CiteSeer authorsrdquo inProceedings of the International Conference on AdaptiveHypermedia and Adaptive Web-Based Systems HannoverGermany August 2008

[35] E Voorhees ldquo0e trec-8 question answering track reportrdquo inProceedings of the TRECrsquo00 pp 77ndash82 Gaithersburg MDUSA 2000

[36] S Kataria P Mitra and S Bhatia ldquoUtilizing context ingenerative bayesian models for linked corpusrdquo in Proceedingsof the Twenty-Fourth AAAI Conference on ArtificialIntelligence Atlanta GA USA July 2010

[37] J Tang and J Zhang ldquoA discriminative approach to topic-based citation recommendationrdquo in Proceedings of the Pacific-Asia Conference Hyderabad India July 2009

12 Computational Intelligence and Neuroscience

Page 11: SVD-CNN:AConvolutionalNeuralNetworkModelwith ...

that although the larger d0 is better the larger d0 willsignificantly increase the training time 0erefore wechoose d0 300

5 Conclusion and Future Works

We propose a convolutional neural network model withorthogonal regularization to solve the context-aware citationrecommendation task In our model orthogonal regulari-zation is achieved by using SVD to factorize the weight of theFC layer which can essentially make each vector in thefeature map more independent 0e orthogonal regulari-zation also enhances the feature extraction ability of CNN0e experimental results show that SVD-CNN outperformsthe other compared methods on CiteSeer Our model onlytakes the abstract as the content of the cited paper In thefuture we will explore the performance of our model byusing the full text of papers

Data Availability

Previously reported CiteSeer data were used to support thisstudy and are available at [httpspsuappboxcomvrefseer] 0ese prior datasets are cited at relevant placeswithin the text as references [4]

Conflicts of Interest

0e authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

0is work was partially supported by the National NaturalScience Foundation of China (project no 61373046) andthe National Key Research and Development Programs ofChina (project nos 2018AAA0101100 and2019YFB2102500)

References

[1] M A Angrosh S Cranefield and N Stanger ldquoConditionalrandom field based sentence context identification enhancingcitation services for the research communityrdquo in Proceedingsof the First Australasian Web Conference Adelaide AustraliaJanuary 2013

[2] Q He J Pei D Kifer et al ldquoContext-aware citation rec-ommendationrdquo in Proceedings of the International Conferenceon World Wide Web Raleigh NC USA April 2010

[3] Q He D Kifer J Pei et al ldquoCitation recommendationwithout author supervisionrdquo in Proceedings of the FourthACM international Conference on Web Search and DataMining Hong Kong China February 2011

[4] W Huang ldquoA neural probabilistic model for context basedcitation recommendationrdquo in Proceedings of the AAAIConference on Artificial Intelligence Austin TX USA January2015

[5] J Tan X Wan and J Xiao ldquoA neural network approach toquote recommendation in writingsrdquo in Proceedings of theACM International on Conference on Information andKnowledge Management Indianapolis IN USA October2016

[6] X Ren J Liu X Yu et al ldquoCluscite effective citation rec-ommendation by information network-based clusteringrdquo inProceedings of the 20th ACM SIGKDD International Con-ference on Knowledge Discovery and Data Mining New YorkNY USA August 2014

[7] T Ebesu and Y Fang ldquoNeural citation network for context-aware citation recommendationrdquo in Proceedings of the 40thInternational ACM SIGIR Conference on Research and De-velopment in Information Retrieval pp 1093ndash1096 ShinjukuJapan August 2017

[8] D M Blei A Y Ng and M I Jordan ldquoLatentdirichlet allocationrdquo Journal of Machine Learning Researchvol 3 pp 993ndash1022 2003

[9] S Bradshaw ldquoReference directed indexing redeeming rele-vance for subject search in citation indexesrdquo Research andAdvanced Technology for Digital Libraries vol 2769pp 499ndash510 2003

[10] N Meuschke B Gipp and M Lipinsk ldquoCITREC an eval-uation framework for citation-based similarity measures

06

055

05

045

Valu

e of e

valu

atio

n sta

ndar

d

04

035

03

025100 200 300 400 500

03302 03512 03687 03701 0372203003 03225 03352 03398 0340903101 03312 03448 03486 0349905456 05689 05801

Vector dimension of input layer05842 05867

Various d_0 for the effects of the model

MRRMAPnDCGRecall10

Figure 12 0e performance impact of d0 on CiteSeer

Computational Intelligence and Neuroscience 11

based on TREC genomics and PubMed centralrdquo 2015 httphdlhandlenet214273680

[11] A Ritchie S Robertson and S Teufel ldquoComparing CitationContexts for information Retrievalrdquo in Proceedings of the 17thACM Conference on Information and Knowledge Manage-ment pp 213ndash222 Napa Valley CA USA October 2008

[12] C F Van Loan e Block Jacobi Method for Computing theSingular Value Decomposition Department of ComputerScience Cornell University Ithaca NY USA 1985

[13] C Bhagavatula S Feldman R Power et al ldquoContent-basedcitation recommendationrdquo 2018 httpsarxivorgpdf18020830201v1pdf

[14] H Jia and E Saule ldquoLocal is good a fast citation recom-mendation approachrdquo Lecture Notes in Computer ScienceVol 10772 Springer Berlin Germany 2018

[15] Y Sun W Ni and R Men ldquoA personalized paper recom-mendation approach based on web paper mining and re-viewerrsquos interest modellingrdquo in Proceedings of theInternational Conference on Research Challenges in ComputerScience Shanghai China December 2009

[16] B Shaparenko and T Joachims ldquoInformation genealogyUncovering the flow of ideas in non-hyperlinked documentdatabasesrdquo in Proceedings of the ACM SIGKDD internationalConference on Knowledge Discovery and Data Mining SanJose CA USA August 2007

[17] T Strohman W B Croft and D Jensen ldquoRecommendingcitations for academic papersrdquo in Proceedings of the Annualinternational ACM SIGIR Conference on Research and De-velopment in information Retrieval Amsterdam NetherlandsJuly 2007

[18] A Livne V Gokuladas J Teevan et al ldquoCiteSight supportingcontextual citation recommendation using differentialsearchrdquo in Proceedings of the International ACM SIGIRConference on Research amp Development in informationRetrieval Gold Coast Australia July 2014

[19] Y Lu J He D Shan et al ldquoRecommending citations withtranslation modelrdquo in Proceedings of the ACM internationalConference on Information and Knowledge ManagementGlasgow UK October 2011

[20] W Huang P Mitra S Kataria et al ldquoRecommending cita-tions translating papers into referencesrdquo in Proceedings of theACM international Conference on Information and KnowledgeManagement Shanghai China November 2014

[21] X Tang X Wan X Zhang et al ldquoCross-language context-aware citation recommendation in scientific articlesrdquo inProceedings of the International ACM SIGIR Conference onResearch amp Development in information Retrieval Gold CoastUK July 2014

[22] A Brock T Lim J M Ritchie et al ldquoNeural photo edi-tingwith introspective adversarial networksrdquo in InternationalConference on Learning Representations 2017

[23] I J Goodfellow J Pouget-Abadie MMirza et al ldquoGenerativeadversarial netsrdquo in Proceedings of the International Con-ference on Neural Information Processing Systems MontrealCanada December 2014

[24] A Brock J Donahue K Simonyan et al ldquoLarge scale GANtraining for high fidelity natural image synthesisrdquo 2018httpsarxivorgabs180911096

[25] Y Sun L Zheng W Deng et al ldquoSVDNet for pedestrianretrievalrdquo in Proceedings of the 2017 IEEE InternationalConference on Computer Vision (ICCV) pp 3820ndash3828Venice Italy October 2017

[26] Q Zheng M Yang J Yang Q Zhang and X ZhangldquoImprovement of generalization ability of deep CNN via

implicit regularization in two-stage training processrdquo IEEEAccess vol 6 no 1109 pp 15844ndash15869 2018

[27] Y Wang D Gong Z Zheng et al ldquoOrthogonal deep featuresdecomposition for age-invariant face recognitionrdquo in Pro-ceedings of the European Conference on Computer Vision(ECCV) Munich Germany September 2018

[28] Y Chen X Jin J Feng et al ldquoTraining group orthogonalneural networks with privileged informationrdquo in Proceedingsof the Twenty-Sixth International Joint Conference on ArtificialIntelligence Melbourne Australia August 2017

[29] T Mikolov I Sutskever K Chen et al ldquoDistributedrepresentations of words and phrases and their composi-tionalityrdquo in Proceedings of the 26th International Con-ference on Neural Information Processing Systems LakeTahoe NV USA December 2013

[30] A Rajaraman and J D Ullman ldquoData miningrdquo Mining ofMassive Datasets vol 3 no 2 pp 1ndash17 2014

[31] J Duchi E Hazan and Y Singer ldquoAdaptive subgradientmethods for online learning and stochastic optimizationrdquoJournal of Machine Learning Research vol 12 no 7pp 2121ndash2159 2011

[32] T Miyato A M Dai and I Goodfellow ldquoAdversarial trainingmethods for semi-supervised text classificationrdquo in Pro-ceedings of the International Conference on LearningRepresentations San Juan Puerto Rico May 2016

[33] I J Goodfellow J Shlens and C Szegedy ldquoExplaining andharnessing adversarial examplesrdquo in Proceedings of the In-ternational Conference on Learning Representations BanffCanada April 2014

[34] K Chandrasekaran S Gauch P Lakkaraju et al ldquoConcept-based document recommendations for CiteSeer authorsrdquo inProceedings of the International Conference on AdaptiveHypermedia and Adaptive Web-Based Systems HannoverGermany August 2008

[35] E Voorhees ldquo0e trec-8 question answering track reportrdquo inProceedings of the TRECrsquo00 pp 77ndash82 Gaithersburg MDUSA 2000

[36] S Kataria P Mitra and S Bhatia ldquoUtilizing context ingenerative bayesian models for linked corpusrdquo in Proceedingsof the Twenty-Fourth AAAI Conference on ArtificialIntelligence Atlanta GA USA July 2010

[37] J Tang and J Zhang ldquoA discriminative approach to topic-based citation recommendationrdquo in Proceedings of the Pacific-Asia Conference Hyderabad India July 2009

12 Computational Intelligence and Neuroscience

Page 12: SVD-CNN:AConvolutionalNeuralNetworkModelwith ...

based on TREC genomics and PubMed centralrdquo 2015 httphdlhandlenet214273680

[11] A Ritchie S Robertson and S Teufel ldquoComparing CitationContexts for information Retrievalrdquo in Proceedings of the 17thACM Conference on Information and Knowledge Manage-ment pp 213ndash222 Napa Valley CA USA October 2008

[12] C F Van Loan e Block Jacobi Method for Computing theSingular Value Decomposition Department of ComputerScience Cornell University Ithaca NY USA 1985

[13] C Bhagavatula S Feldman R Power et al ldquoContent-basedcitation recommendationrdquo 2018 httpsarxivorgpdf18020830201v1pdf

[14] H Jia and E Saule ldquoLocal is good a fast citation recom-mendation approachrdquo Lecture Notes in Computer ScienceVol 10772 Springer Berlin Germany 2018

[15] Y Sun W Ni and R Men ldquoA personalized paper recom-mendation approach based on web paper mining and re-viewerrsquos interest modellingrdquo in Proceedings of theInternational Conference on Research Challenges in ComputerScience Shanghai China December 2009

[16] B Shaparenko and T Joachims ldquoInformation genealogyUncovering the flow of ideas in non-hyperlinked documentdatabasesrdquo in Proceedings of the ACM SIGKDD internationalConference on Knowledge Discovery and Data Mining SanJose CA USA August 2007

[17] T Strohman W B Croft and D Jensen ldquoRecommendingcitations for academic papersrdquo in Proceedings of the Annualinternational ACM SIGIR Conference on Research and De-velopment in information Retrieval Amsterdam NetherlandsJuly 2007

[18] A Livne V Gokuladas J Teevan et al ldquoCiteSight supportingcontextual citation recommendation using differentialsearchrdquo in Proceedings of the International ACM SIGIRConference on Research amp Development in informationRetrieval Gold Coast Australia July 2014

[19] Y Lu J He D Shan et al ldquoRecommending citations withtranslation modelrdquo in Proceedings of the ACM internationalConference on Information and Knowledge ManagementGlasgow UK October 2011

[20] W Huang P Mitra S Kataria et al ldquoRecommending cita-tions translating papers into referencesrdquo in Proceedings of theACM international Conference on Information and KnowledgeManagement Shanghai China November 2014

[21] X Tang X Wan X Zhang et al ldquoCross-language context-aware citation recommendation in scientific articlesrdquo inProceedings of the International ACM SIGIR Conference onResearch amp Development in information Retrieval Gold CoastUK July 2014

[22] A Brock T Lim J M Ritchie et al ldquoNeural photo edi-tingwith introspective adversarial networksrdquo in InternationalConference on Learning Representations 2017

[23] I J Goodfellow J Pouget-Abadie MMirza et al ldquoGenerativeadversarial netsrdquo in Proceedings of the International Con-ference on Neural Information Processing Systems MontrealCanada December 2014

[24] A Brock J Donahue K Simonyan et al ldquoLarge scale GANtraining for high fidelity natural image synthesisrdquo 2018httpsarxivorgabs180911096

[25] Y Sun L Zheng W Deng et al ldquoSVDNet for pedestrianretrievalrdquo in Proceedings of the 2017 IEEE InternationalConference on Computer Vision (ICCV) pp 3820ndash3828Venice Italy October 2017

[26] Q Zheng M Yang J Yang Q Zhang and X ZhangldquoImprovement of generalization ability of deep CNN via

implicit regularization in two-stage training processrdquo IEEEAccess vol 6 no 1109 pp 15844ndash15869 2018

[27] Y Wang D Gong Z Zheng et al ldquoOrthogonal deep featuresdecomposition for age-invariant face recognitionrdquo in Pro-ceedings of the European Conference on Computer Vision(ECCV) Munich Germany September 2018

[28] Y Chen X Jin J Feng et al ldquoTraining group orthogonalneural networks with privileged informationrdquo in Proceedingsof the Twenty-Sixth International Joint Conference on ArtificialIntelligence Melbourne Australia August 2017

[29] T Mikolov I Sutskever K Chen et al ldquoDistributedrepresentations of words and phrases and their composi-tionalityrdquo in Proceedings of the 26th International Con-ference on Neural Information Processing Systems LakeTahoe NV USA December 2013

[30] A Rajaraman and J D Ullman ldquoData miningrdquo Mining ofMassive Datasets vol 3 no 2 pp 1ndash17 2014

[31] J Duchi E Hazan and Y Singer ldquoAdaptive subgradientmethods for online learning and stochastic optimizationrdquoJournal of Machine Learning Research vol 12 no 7pp 2121ndash2159 2011

[32] T Miyato A M Dai and I Goodfellow ldquoAdversarial trainingmethods for semi-supervised text classificationrdquo in Pro-ceedings of the International Conference on LearningRepresentations San Juan Puerto Rico May 2016

[33] I J Goodfellow J Shlens and C Szegedy ldquoExplaining andharnessing adversarial examplesrdquo in Proceedings of the In-ternational Conference on Learning Representations BanffCanada April 2014

[34] K Chandrasekaran S Gauch P Lakkaraju et al ldquoConcept-based document recommendations for CiteSeer authorsrdquo inProceedings of the International Conference on AdaptiveHypermedia and Adaptive Web-Based Systems HannoverGermany August 2008

[35] E Voorhees ldquo0e trec-8 question answering track reportrdquo inProceedings of the TRECrsquo00 pp 77ndash82 Gaithersburg MDUSA 2000

[36] S Kataria P Mitra and S Bhatia ldquoUtilizing context ingenerative bayesian models for linked corpusrdquo in Proceedingsof the Twenty-Fourth AAAI Conference on ArtificialIntelligence Atlanta GA USA July 2010

[37] J Tang and J Zhang ldquoA discriminative approach to topic-based citation recommendationrdquo in Proceedings of the Pacific-Asia Conference Hyderabad India July 2009

12 Computational Intelligence and Neuroscience