Pre-Release Prediction of Crowd Opinion on Movies by Label ...Pre-Release Prediction of Crowd Opinion on Movies by Label Distribution Learning Xin Geng and Peng Hou School of Computer

Pre-Release Prediction of Crowd Opinion onMovies by Label Distribution Learning

Xin Geng∗ and Peng HouSchool of Computer Science and Engineering

Southeast University, Nanjing, China{xgeng, hpeng}@seu.edu.cn

AbstractThis paper studies an interesting problem: is it pos-sible to predict the crowd opinion about a moviebefore the movie is actually released? The crowdopinion is here expressed by the distribution of rat-ings given by a sufficient amount of people. Con-sequently, the pre-release crowd opinion predic-tion can be regarded as a Label Distribution Learn-ing (LDL) problem. In order to solve this prob-lem, a Label Distribution Support Vector Regressor(LDSVR) is proposed in this paper. The basic ideaof LDSVR is to fit a sigmoid function to each com-ponent of the label distribution simultaneously by amulti-output support vector machine. Experimen-tal results show that LDSVR can accurately pre-dict peoples’s rating distribution about a movie justbased on the pre-release metadata of the movie.

1 IntroductionThe movie industry is a worldwide business worth tens ofbillions of dollars. Thousands of new movies are producedand shown in movie theatres each year, among which someare successful, many are not. For movie producers, the in-creasing cost and competition boosts the investment risk. Formovie audience, the prevalent immodest advertisement andpromotion makes it hard to choose a movie worth watching.Therefore, both sides demand a reliable prediction of whatpeople will think about a particular movie before it is actu-ally released or even during its planing phase. However, toour best knowledge, there is little work, up to the present, onpre-release prediction of the crowd opinion about movies.

Aside from the unstructured reviews and discussions aboutthe movie [Diao et al., 2014], an explicit and well-structuredreflection of the crowd opinion might be the distribution ofratings given by the audience who have watched the movie,just as many movie review web sites collect from their users,such as IMDb and Netflix. Note that the average rating isnot a good indicator of the crowd opinion because the aver-aging process mingles those who like the movie and those

∗This research was supported by NSFC (61273300, 61232007),JiangsuSF (BK20140022), and the Key Lab of Computer Networkand Information Integration of Ministry of Education of China.

who dislike it. From the marketing point of view, a moviewith controversial crowd opinions (i.e., the rating distribu-tion concentrates at both low and high ratings) is generallymuch more successful than another movie with a consistentmedium rating. But they might have similar average ratings.Fig. 1 gives a typical example of such case. According tothe data from IMDb, the movies Twilight and I, Frankensteinboth have the same average rating 5.2/10. But the top twopopular ratings in the rating distribution of Twilight are thelowest rating 1 (15.7%) and the highest rating 10 (15.3%), re-spectively, while those of I, Frankenstein concentrate at themedium ratings 6 (21.4%) and 5 (20.1%). As a result, thebudget/gross ratio of Twilight is $37m/$191m and that of I,Frankenstein is $65m/$19m. Obviously, the former movieis more worthy to invest and watch. Note that the usage ofthe rating distribution is not limited to gross prediction, butincludes marketing strategy, advertising design, movie rec-ommendation, etc.

It is worth emphasizing, as will be further discussed inSection 2, that predicting the crowd opinion (overall ratingdistribution) is quite different from predicting the individ-ual opinion (person-specific rating). The latter problem hasbeen extensively studied in the area of recommender systems[Adomavicius and Tuzhilin, 2005]. While personalized rat-ing is valuable when recommending a movie to a particularuser, it does not mean much when analysing the crowd opin-ion toward a movie. Moreover, recommendation accordingto the individual rating prediction is worthwhile even aftermany users have already watched the movie, as long as thetarget user has not. But crowd opinion prediction is generallyonly useful before any user ratings are available. As a result,the prediction should only be based on the metadata availablebefore the movie is released.

Instead of putting the pre-release crowd opinion predic-tion in the recommender system scenario, this paper regardsit as a Label Distribution Learning (LDL) [Geng et al., 2013;2010] problem since the rating distribution can be naturallyviewed as a label distribution for each movie. According tothe characteristics of the movie rating distribution, we pro-pose a novel Label Distribution Support Vector Regressor(LDSVR), which can give multivariate and probabilistic out-put. The key idea of LDSVR is to fit a sigmoid function toeach component of the distribution simultaneously by a sup-port vector machine.

Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015)

3511

(a)

(b)

Figure 1: Even with the same average rating, different rat-ing distributions (crowd opinions) of (a) Twilight and (b) I,Frankenstein result in different marketing performance.

The rest of this paper is organized as follows. Section 2introduces some existing work related to pre-release crowdopinion prediction. After that, the proposed method LDSVRis described in Section 3. Then, experimental results are re-ported in Section 4. Finally, conclusions and discussions aregiven in Section 5.

2 Related WorkMovie rating prediction (abbreviated as RP) is a widely inves-tigated topic in the context of recommender systems [Ado-mavicius and Tuzhilin, 2005], where the problem of movierecommendation is reduced to predicting movie ratings for aparticular user and then recommend the movie with the high-est rating to the user. The RP methods in such case are usu-ally classified into three categories [Marovic et al., 2011], i.e.,content-based methods [Soares and Viana, 2014; Pilaszy andTikk, 2009], collaborative methods [Bresler et al., 2014; Diaoet al., 2014], and hybrid methods [Amolochitis et al., 2014;Jin et al., 2005]. However, the pre-release crowd opinion pre-diction (abbreviated as PCOP) problem raised in this paper isfundamentally different from the aforementioned RP problemmainly in the following three aspects.

1. RP usually aims to predict the rating for a particular userwhile PCOP aims to predict the overall rating distribu-tion generated from many users’ ratings, i.e., the crowdopinion rather than the individual opinion.

2. Existing RP methods usually require previous ratingson the already watched movies from the target user(content-based methods) or other users with similar pref-erences (collaborative methods). On the other hand,PCOP methods do not require any previous rating dataonce the model is trained because they do not need tolearn the preference of any particular user.

3. Most RP methods, especially the prevailing collabora-tive methods, cannot predict the rating of a movie until itis officially released. Many of them further rely on a suf-ficient number of users who have already watched andrated that movie. However, PCOP methods can makepredictions before the movie is actually released or evenas early as during its planning phase.

Another more related work is Label Distribution Learning(LDL) [Geng and Ji, 2013] recently proposed to deal with anew machine learning paradigm where each instance is anno-tated by a label distribution rather than a single label (single-label learning) or multiple labels (multi-label learning). Thelabel distribution covers a certain number of labels, represent-ing the degree to which each label describes the instance. LetX = Rq denote the input space, Y = {y1, y2, · · · , yc} de-note the complete set of labels, and dyx denote the descrip-tion degree of the label y ∈ Y to the instance x ∈ X .Given a training set S = {(x1,d1), (x2,d2), · · · , (xn,dn)},where di = [dy1

xi, dy2

xi, · · · , dyc

xi]T is the label distribution as-

sociated with the instance xi, the goal of LDL is to learn amapping from x ∈ Rq to d ∈ Rc based on S. Geng et al.[2013] construct the mapping via a conditional mass functionp(y|x) formulated as a maximum entropy model. Then theyuse Improved Iterative Scaling (IIS) [Pietra et al., 1997] orBFGS [Nocedal and Wright, 2006] to minimize the Kullback-Leibler divergence between the predicted distribution andthe ground truth distribution, resulting in two LDL algo-rithms named IIS-LLD and BFGS-LLD, respectively. Theyalso propose a neural-network-based approach named CPNNwhen the labels can be ordered [Geng et al., 2013]. As alearning framework more general than single-label and multi-label learning, LDL has been successfully applied to variousproblems, such as facial age estimation [Geng et al., 2013;2010], head pose estimation [Geng and Xia, 2014], and multi-label ranking for natural scene images [Geng and Luo, 2014].If we regard the movie pre-release feature vector as the in-stance x, the rating as the label y and the rating distributionas the label distribution d, then the PCOP problem can benaturally viewed as an LDL problem.

3 Label Distribution Support VectorRegression

In this section, we propose a Label Distribution Support Vec-tor Regressor (LDSVR) for pre-release crowd opinion pre-diction. Compared with standard SVR [Drucker et al., 1996],LDSVR must address two additional issues: (1) How to out-

3512

put a distribution composed by multiple components? (2)How to constrain each component of the distribution withinthe range of a probability, i.e., [0, 1]? The first issue mightbe tackled by building a single-output SVR for each compo-nent respectively. But as pointed out in [Perez-Cruz et al.,2002], this will cause the problem that some examples be-yond the insensitive zone might be penalized more than once.It is also prohibitive to view the distribution as a structure andsolve the problem via structured prediction [Tsochantaridiset al., 2004] because there is no direct way to define the aux-iliary discriminant function that measures the compatibilitybetween a movie and its rating distribution. A more rationalsolution to this issue might root in the Multivariate SupportVector Regression (M-SVR) [Fernandez et al., 2004], whichcan output multiple variables simultaneously. For the secondissue, we are inspired by the common practice in classifica-tion tasks to fit a sigmoid function after the SVM when aprobabilistic output for each class is expected [Platt, 1999].For regression tasks, the sigmoid function could directly actas the target model of the regression instead of a post-process.Thus, the basic idea of LDSVR is, in short, to fit a sigmoidfunction to each component of the label distribution simulta-neously by a support vector machine.

Suppose the label distribution d of the instance x is mod-eled by an element-wise sigmoid vector

d = f(x) =1

1 + exp(−Wϕ(x)− b)

= g(z) =1

1 + exp(−z),

(1)

where ϕ(x) is a nonlinear transformation of x to a higher-dimensional feature space RH, W ∈ Rc×H and b ∈ Rc

are the model parameters, z = Wϕ(x) + b, and g(z) is ashort representation for the vector obtained by applying thesigmoid function g(·) to each element of z. Then, we cangeneralize the single-output SVR by minimizing the sum ofthe target functions on all dimensions

Γ(W , b) =1

2

c∑j=1

‖ wj ‖2 +Cn∑

i=1

L(ui), (2)

where wj is the transpose of the j-th row of W and L(ui)is the loss function for the i-th example. In standard SVR[Drucker et al., 1996], the unidimensional loss is defined as ahinge loss function

Lh(uji ) =

{0 uji < ε,

uji − ε uji ≥ ε,(3)

uji = |dji − f j(xi)|, (4)

where dji and f j(xi) are the j-th elements in the correspond-ing vectors. This will create an insensitive zone determinedby ε around the estimate, i.e., the loss less than ε will be ig-nored. If we directly sum the loss functions on all dimensions,i.e., L(ui) =

∑j Lh(uji ), then, as pointed out in [Perez-

Cruz et al., 2002], some examples beyond the insensitivezone might be penalized more than once. Fig. 2 illustrates

d1

d2

g(z)

ε

u1+u1−

u2+

u2−

ρ1

ρ1 ρ1

ρ1

ρ2 ρ2

ρ2 ρ2

0 < uj+ = g(zj + 4ε)− g(zj) ≤ ε

0 < uj− = g(zj)− g(zj − 4ε) ≤ ε

Figure 2: The insensitive zones around the estimate f(x) =g(z) in the bivariate (d1 and d2) output space. The blacksquare represents the hyper-cubic insensitive zone for the twosingle-output SVRs. The blue circle represents the hyper-spherical insensitive zone for M-SVR. The shaded area rep-resents the insensitive zone for LDSVR.

this problem via a bivariate regression case, where the blacksquare represents the hyper-cubic insensitive zone for the twosingle-output SVRs. As can be seen, all the examples fallinginto the area ρ1 will be penalized once while those fallinginto the area ρ2 will be penalized twice. Also, the L1-normuji in Eq. (3) is calculated dimension by dimension, whichmakes the solution complexity grow linearly with the increaseof dimensionality [Fernandez et al., 2004]. As suggested in[Fernandez et al., 2004], we can instead use the L2-norm todefine L(ui) so that all dimensions can join the same con-straint and yield the same support vector, i.e.,

L(ui) =

{0 ui < ε,

(ui − ε)2 ui ≥ ε,(5)

ui = ‖ ei ‖=√eTi ei, (6)

ei = di − f(xi). (7)

This will generate a hyper-spherical insensitive zone with theradius ε, which is represented by the blue circle in Fig. 2. Un-fortunately, substituting Eq. (1), (5)-(7) into Eq. (2) does notlead to a convex quadratic form and the optimization processwill not just depend on inner product. Therefore, it is hard tofind the optimum as well as to apply the kernel trick.

To solve this problem, we propose an alternative loss func-tion which can reform the minimization of Eq. (2) to a con-vex quadratic programming process depending only on in-ner product. Note that Eq. (6) calculates the Euclidean dis-tance from the estimate f(xi) = g(zi) to the ground truth di.We can instead measure the loss by calculating how far awayfrom zi another point z′i ∈ Rc should move to get the sameoutput with the ground truth, i.e., g(z′i) = di. Solving this

3513

z

d

ui

u′i(zi,di)

z′i zi

g(zi)

di

4ε4ε

Figure 3: The relationship between ui and u′i. The boundariesof the insensitive zone defined on u′i have equal distance tothe sigmoid curve horizontally, but not vertically.

equation yields z′i = − log(1/di − 1). Thus, the distance u′ifrom z′i to zi can be calculated by

u′i = ‖ e′i ‖=√

(e′i)Te′i, (8)

e′i = z′i − zi = − log(1

di− 1)− (Wϕ(xi) + b). (9)

The relationship between ui and u′i is illustrated in Fig.3 andquantified in the following lemma.

Lemma 1. u′i ≥ 4ui for any xi, di,W , and b.

The proof of Lemma 1 is given in the Appendix.Replacing ui with u′i/4 in the loss function Eq. (5) will

generate an insensitive zone around the sigmoid function asrepresented by the shaded area in Fig. 3. Note that the ver-tical distance (along the d axis) from the two boundaries ofthe insensitive zone to the sigmoid curve might be different.This will result in an insensitive zone in the bivariate out-put space as represented by the shaded area in Fig. 2. Itcan be derived from Lemma 1 that, in Fig. 2, 0 < uj+ ≤ ε

and 0 < uj− ≤ ε for j = 1, 2. Thus, although not strictlyisotropic, the shaded area is a reasonable approximation tothe ideal hyper-spherical insensitive zone (the blue circle) solong as ε is small.

Replacing ui with u′i/4 in Eq. (2) yields a new target func-tion Γ′(W , b). It is trivial to prove the following theoremwith Lemma 1.

Theorem 1. Γ′(W , b) is an upper bound for Γ(W , b).

Therefore, minimizing Γ′(W , b) is equivalent to minimizingan upper bound of Γ(W , b).

It is still hard to minimize Γ′(W , b) as standard SVR doesvia solving its dual problem. Instead, we directly solve theprimal problem with an iterative quasi-Newton method calledIterative Re-Weighted Least Square (IRWLS) [Perez-Cruz etal., 2000]. Firstly, Γ′(W , b) is approximated by its first orderTaylor expansion at the solution of the current k-th iteration,

denoted byW (k) and b(k):

Γ′′(W , b) =1

2

c∑j=1

‖ wj ‖2 +Cn∑

i=1

[L(u′(k)i

4)+

dL(u′)

du′

∣∣∣∣∣u′(k)i4

(e′(k)i )T

4u′(k)i

(e′i − e′(k)i

)],

(10)

where e′(k)i and u′(k)i are calculated from W (k) and b(k).

Then, a quadratic approximation is further constructed fromEq. (10):

Γ′′′(W , b) =1

2

c∑j=1

‖ wj ‖2 +Cn∑

i=1

[L(u′(k)i

4)+

dL(u′)

du′

∣∣∣∣∣u′(k)i4

u′2i − (u′(k)i )2

4u′(k)i

]

=1

2

c∑j=1

‖ wj ‖2 +1

2

n∑i=1

aiu′2i + τ,

(11)

where

ai =C

2u′(k)i

dL(u′)

du′

∣∣∣∣∣u′(k)i4

=

0 u′(k)i < 4ε,

C(u′(k)i −4ε

)4u

′(k)i

u′(k)i ≥ 4ε,

(12)and τ is a constant term that does not depend on W and b.Eq. (11) is a weighted least square problem whose optimumcan be effectively found by letting the gradient equal zero andthen solving a system of linear equations for j = 1, . . . , c:[

ΦTDaΦ + I ΦTaaTΦ 1Ta

] [wj

bj

]=

[−ΦTDa log( 1

dj − 1)−aT log( 1

dj − 1)

],

(13)where Φ = [ϕ(x1), ..., ϕ(xn)]T, a = [a1, ..., an]T,(Da)ij = aiδij (δij is the Kronecker’s delta function), dj =

[dj1, ..., djn]T, 1 is a vector of all ones. Then, the direction

of the optimal solution of Eq. (11) is used as the descendingdirection for the optimization of Γ′(W , b), and the solutionfor the next iteration (W (k+1) and b(k+1)) is obtained via aline search algorithm [Nocedal and Wright, 2006] along thisdirection.

According to the Representer Theorem [Schlkopf andSmola, 2001], wj may be represented as a linear combi-nation of the training examples in the feature space, i.e.,wj = ΦTβj . Substituting this expression into Eq. (13) yields[

K +D−1a 1aTK 1Ta

] [βj

bj

]=

[− log( 1

dj − 1)−aT log( 1

dj − 1)

], (14)

whereKij = k(xi,xj) = ϕT(xi)ϕ(xj) is the kernel matrix.Accordingly, the later line search for Γ′(W , b) can be per-formed in terms of βj . Finally, after the optimal solution βj

is obtained, wj = ΦTβj and bj are substituted into Eq. (1)and the label distribution can be calculated indirectly in theoriginal input space X , rather than the high-dimensional fea-ture space RH, via the kernel matrixK.

3514

Table 1: Pre-release Metadata Included in the Data SetAttribute Type θ #ValuesGenre C 0 24Color C 0 2Director C 5 4021st Actor C 5 3862nd Actor C 5 2103rd Actor C 5 103Country C 5 33Language C 5 23Writer C 10 16Editor C 10 115Cinematographer C 10 173Art Direction C 10 39Costume Designer C 10 110Music By C 10 157Sound C 10 26Production Company C 20 31Year N – –Running Time N – –Budget N – –

4 Experiments4.1 MethodologyThe data set used in the experiments includes 7, 755 moviesand 54, 242, 292 ratings from 478, 656 different users. Theratings come from Netflix, which are on a scale from 1 to 5 in-tegral stars. Each movie has, on average, 6, 994 ratings. Therating distribution is calculated for each movie as an indicatorfor the crowd opinion on that movie. The pre-release meta-data are crawled from IMDb according to the unique movieIDs. Table 1 lists all the metadata included in the data set.Note that all the attributes in Table 1 can be retrieved beforethe movie is officially released or even during its planningphase. No post-release attributes are included in the data set,although some of them might be closely related to the crowdopinion, such as the box office gross. There are both numeric(N) and categorical (C) attributes in this data set. Some cat-egorical attributes, typically human names, might have toomany different values. In such case, we set a threshold θ andre-assign a new value ‘other’ to those who appear less timesin the data set than the threshold. This will filter out mostcategories with limited influence on the crowd opinion, e.g.,unfamous actors. The categorical attributes are then trans-formed into numeric ones by replacing the k-valued categor-ical attribute by k binary attributes, one for each value indi-cating whether the attribute has that value or not. Finally, allthe attributes are normalized to the same scale through themin-max normalization.

As mentioned in Section 3, LDSVR deals with two chal-lenges simultaneously: (1) multivariate output and (2) proba-bility output. In order to show the advantages of solving thesetwo problems simultaneously, LDSVR is compared with twobaseline variants. The first is to fit a sigmoid function to eachdimension separately, i.e., replace uji in Eq. (3) by the abso-lute value of the j-th element of e′i calculated by Eq. (9), andthen use Lh(|e′ji |) as the loss function. This variant is namedas S-SVR standing for Sigmoid SVR, which solves the chal-

lenge (2) but not (1). The second variant is to firstly run astandard M-SVR [Fernandez et al., 2004], and then performa post-process where the outputs are subtracted by a commonbias determined by the minimum regression output over thetraining set and then divided by the sum of all elements. Thisvariant is named as M-SVRp standing for M-SVR plus post-process, which solves the challenge (1) but not (2). Note thatthe output of all regression methods should be finally normal-ized by dividing with the sum of all elements to make surethat

∑j d

j = 1. This is reasonable in most cases where onlythe relative relationship matters in the distribution. LDSVR isalso compared with existing typical LDL algorithms includ-ing BFGS-LLD [Geng and Ji, 2013], IIS-LLD [Geng et al.,2010], AA-kNN [Geng and Ji, 2013], and CPNN [Geng etal., 2013].

The performance of the algorithms is evaluated by thosecommonly used measures in LDL, i.e., the average distanceor similarity between the predicted and ground truth labeldistributions. As suggested in [Geng and Ji, 2013], six mea-sures are used in the experiments, which include four distancemeasures (the smaller the better), i.e., Euclidean, Sørensen,Squared χ2, and Kullback-Leibler (K-L), and two similaritymeasures (the larger the better), i.e., Intersection and Fidelity.

The algorithm parameters used in the experiments are em-pirically determined. The parameter selection process isnested into the 10-fold cross validation. In detail, the wholedata set is first randomly split into 10 chunks. Each time, onechunk is used as the test set, another is used as the valida-tion set, and the rest 8 chunks are used as the training set.Then, the model is trained with different parameter settingson the training set and tested on the validation set. This pro-cedure is repeated 10 folds, and the parameter setting with thebest average performance is selected. After that, the originalvalidation set is merged into the training set and the test setremains unchanged. The model is trained with the selectedparameter setting on the updated training set and tested onthe test set. This procedure is repeated 10 folds and the meanvalue and standard deviation of each evaluation measure isreported. The final parameter settings for the compared al-gorithms are as follows. All kernel based methods (LDSVR,S-SVR and M-SVRp) use the RBF kernel with the scalingfactor σ equal to the average distance among the training ex-amples. The penalty parameter C in Eq. (2) is set to 1. Theinsensitivity parameter ε is set to 0.1. All iterative algorithmsterminate their iteration when the difference between adja-cent steps is smaller than 10−10. The number of neighbors kin AA-kNN is set to 10, and the number of hidden neurons inCPNN is set to 80.

4.2 ResultsThe experimental results of the seven algorithms on the moviedata set are tabulated in Table 2. For the four distance mea-sures (Euclidean, Sørensen, Squared χ2, and K-L), ‘↓’ indi-cates ‘the smaller the better’. For the two similarity measures(Intersection and Fidelity), ‘↑’ indicates ‘the larger the bet-ter’. The best performance on each measure is highlightedby boldface. On each measure, the algorithms are ranked indecreasing order of their performance. The ranks are given inthe brackets right after the measure values.

3515

Table 2: Experimental Results (mean±std(rank)) of the Seven Compared Algorithms

Euclidean ↓ Sørensen ↓ Squared χ2 ↓ K-L ↓ Intersection ↑ Fidelity ↑LDSVR .1587±.0026(1) .1564±.0027(1) .0887±.0031(1) .0921±.0035(1) .8436±.0027(1) .9764±.0010(1)S-SVR .1734±.0024(2) .1723±.0023(2) .1040±.0030(2) .1059±.0036(2) .8277±.0023(2) .9722±.0009(2)M-SVRp .1843±.0031(3) .1814±.0034(3.5) .1084±.0033(3) .1073±.0030(3) .8186±.0034(3.5) .9710±.0010(3)BFGS-LLD .1853±.0033(4) .1814±.0033(3.5) .1176±.0042(4) .1265±.0050(4) .8186±.0033(3.5) .9683±.0012(4)IIS-LLD .1866±.0041(5) .1828±.0044(5) .1195±.0054(5) .1288±.0070(6) .8172±.0044(5) .9676±.0014(5)AA-kNN .1917±.0045(6) .1899±.0047(6) .1246±.0062(6) .1274±.0069(5) .8101±.0047(6) .9664±.0018(6)CPNN .2209±.0148(7) .2153±.0150(7) .1625±.0206(7) .1826±.0274(7) .7847±.0150(7) .9551±.0061(7)

As can be seen from Table 2, LDSVR performs best onall of the six measures. The two variants of LDSVR, S-SVRand M-SVRp, both perform significantly worse than LDSVR.This proves the advantage of LDSVR gained by means ofsolving the multivariate output problem and the probabil-ity output problem simultaneously, rather than one by one.The performance of LDSVR is also significantly better thanthat of the compared LDL algorithms (BFGS-LLD, IIS-LLD,AA-kNN, and CPNN). The reason might be two-fold. First,while most existing LDL algorithms seek to directly mini-mize the K-L divergence between the predicted and groundtruth distributions, LDSVR takes advantage of the large mar-gin regression by a support vector machine. Second, applica-tion of the kernel trick makes it possible for LDSVR to solvethe problem in a higher-dimensional thus more discriminativefeature space without loss of computational feasibility.

5 Conclusion and DiscussionThis paper investigates possibilities to predict the crowd opin-ion about a movie before it is released. The crowd opinionis represented by the distribution of ratings given by a suf-ficient number of people who have watched the movie. Thepre-release prediction of the crowd opinion could be a crucialindicator of whether the movie will be successful or not, andthus has vast potential applications in movie production andmarketing.

This paper regards the pre-release crowd opinion pre-diction as a Label Distribution Learning (LDL) problem,and proposes a Label Distribution Support Vector Regres-sor (LDSVR) for it. The nature of user rating distributionrequires the output of LDSVR to be both multivariate andprobabilistic. Thus, the basic idea of LDSVR is to fit a sig-moid function to each component of the rating distribution si-multaneously by a multi-output support vector machine. Ex-periments on a data set including 7, 755 movies reveal thatLDSVR can perform significantly better than not only its twovariants, but also four typical LDL algorithms.

One of the key ideas of LDSVR is to measure the loss bythe distance needed to move in the input space to get the sameoutput as the ground truth. This works fine for most cases,but sometimes could be risky when the ground truth outputsof some training examples are very close to 0 or 1. In suchcase, the loss of those examples will tend to be so large thatthey will dominate the training process. Although this is not abig problem for the movie rating distributions so long as thereare a sufficient number of ratings for each movie, it might be

uji

u′jiA

B

P (zji , dji )

P ′

g(zj)

Figure 4: The relationship between uji and u′ji in the concavepart of the sigmoid.

troublesome for some other data. Therefore, a mechanism ofcompensating the loss of those examples with outputs closeto 0 or 1 might be an important future work to make LDSVRapplicable to more general cases.

Appendix: Proof of Lemma 1Proof. Firstly, consider the situation in the (zj , dj)-space(j = 1, . . . , c), where zj and dj represent the j-th dimen-sion of z and d, respectively. The projections of ui and u′iin such space are denoted by uji and u′ji , respectively. Since

ui =√∑

j(uji )

2 and u′i =√∑

j(u′ji )2, if we can prove

u′ji ≥ 4uji for j = 1, . . . , c, then we have u′i ≥ 4ui.In the (zj , dj)-space, the sigmoid function dj = g(zj) is

symmetric about the point (0, 0.5). So we only need to con-sider the case zj ≥ 0, where g(zj) is concave as shown inFig. 4. When the point is above the curve like P ′, we canalways find its counterpart P below the curve with same ver-tical and horizontal distance to the curve. Thus, we only needto consider the case when the point is below the curve.

Since g(zj) is concave, the line segment AB (blue dashline) is always below the curve. Suppose its slope is θ, andthe slope of the g(zj) curve’s tangent line (red dash line) atthe point A is θ′, then θ ≤ θ′. The derivative of g(zj) is

g′(zj) =exp(−zj)

(1 + exp(−zj))2 . (15)

Letting the second derivative of g(zj) equal zero yields zj =0. Thus, the maximum of θ′ is g′(0) = 1/4. Therefore,

θ = uji/u′ji ≤ θ′ ≤ 1/4. (16)

So u′ji ≥ 4uji , and thus u′i ≥ 4ui.

3516

References[Adomavicius and Tuzhilin, 2005] Gediminas Adomavicius

and Alexander Tuzhilin. Toward the next generation ofrecommender systems: A survey of the state-of-the-artand possible extensions. IEEE Trans. Knowl. Data Eng.,17(6):734–749, 2005.

[Amolochitis et al., 2014] Emmanouil Amolochitis, Ioan-nis T. Christou, and Zheng-Hua Tan. Implementing acommercial-strength parallel hybrid movie recommenda-tion engine. IEEE Intelligent Systems, 29(2):92–96, 2014.

[Bresler et al., 2014] Guy Bresler, George H. Chen, and De-vavrat Shah. A latent source model for online collabora-tive filtering. In Advances in Neural Information Process-ing Systems 27 (NIPS’14), pages 3347–3355, Montreal,Canada, 2014.

[Diao et al., 2014] Qiming Diao, Minghui Qiu, Chao-YuanWu, Alexander J. Smola, Jing Jiang, and Chong Wang.Jointly modeling aspects, ratings and sentiments for movierecommendation (JMARS). In Proc. 20th Int’l Conf.Knowledge Discovery and Data Mining, pages 193–202,New York, NY, 2014.

[Drucker et al., 1996] Harris Drucker, Christopher J. C.Burges, Linda Kaufman, Alex J. Smola, and VladimirVapnik. Support vector regression machines. In Advancesin Neural Information Processing Systems 9 (NIPS’96),pages 155–161, Denver, CO, 1996.

[Fernandez et al., 2004] Matilde Sanchez Fernandez, Mariode Prado-Cumplido, Jeronimo Arenas-Garcıa, and Fer-nando Perez-Cruz. SVM multiregression for nonlinearchannel estimation in multiple-input multiple-output sys-tems. IEEE Trans. Signal Processing, 52(8):2298–2307,2004.

[Geng and Ji, 2013] Xin Geng and Rongzi Ji. Label distribu-tion learning. In Proc. 13th IEEE Int’l Conf. Data MiningWorkshops, pages 377–383, Dallas, TX, 2013.

[Geng and Luo, 2014] Xin Geng and Longrun Luo. Multi-label ranking with inconsistent rankers. In Proc. IEEEConf. Computer Vision and Pattern Recognition, pages3742–3747, Columbus, OH, 2014.

[Geng and Xia, 2014] Xin Geng and Yu Xia. Head pose es-timation based on multivariate label distribution. In Proc.IEEE Conf. Computer Vision and Pattern Recognition,pages 3742–3747, Columbus, OH, 2014.

[Geng et al., 2010] Xin Geng, Kate Smith-Miles, and Zhi-Hua Zhou. Facial age estimation by learning from labeldistributions. In Proc. 24th AAAI Conf. Artificial Intelli-gence, pages 451–456, Atlanta, GA, 2010.

[Geng et al., 2013] Xin Geng, Chao Yin, and Zhi-Hua Zhou.Facial age estimation by learning from label distributions.IEEE Trans. Pattern Anal. Mach. Intell., 35(10):2401–2412, 2013.

[Jin et al., 2005] Xin Jin, Yanzan Zhou, and BamshadMobasher. A maximum entropy web recommendation sys-tem: combining collaborative and content features. In

Proc. 11th Int’l Conf. Knowledge Discovery and DataMining, pages 612–617, Chicago, IL, 2005.

[Marovic et al., 2011] Mladen Marovic, Marko Mihokovic,Mladen Miksa, Sinisa Pribil, and Alan Tus. Automaticmovie ratings prediction using machine learning. In Proc.34th Int’l Conv. Information and Communication Technol-ogy, Electronics and Microelectronics, pages 1640–1645,Opatija, Croatia, 2011.

[Nocedal and Wright, 2006] Jorge Nocedal and StephenWright. Numerical Optimization. Springer, New York,NY, 2nd edition, 2006.

[Perez-Cruz et al., 2000] Fernando Perez-Cruz, Pedro LuisAlarcon-Diana, Angel Navia-Vazquez, and AntonioArtes-Rodrıguez. Fast training of support vector classi-fiers. In Advances in Neural Information Processing Sys-tems 13 (NIPS’00), pages 734–740, Denver, CO, 2000.

[Perez-Cruz et al., 2002] Fernando Perez-Cruz, GustavoCamps-Valls, Emilio Soria-Olivas, Juan Jose Perez-Ruixo,Anıbal R. Figueiras-Vidal, and Antonio Artes-Rodrıguez.Multi-dimensional function approximation and regressionestimation. In Proc. Int’l Conf. Artificial Neural Networks,pages 757–762, Madrid, Spain, 2002.

[Pietra et al., 1997] Stephen Della Pietra, Vincent J. DellaPietra, and John D. Lafferty. Inducing features of randomfields. IEEE Trans. Pattern Anal. Mach. Intell., 19(4):380–393, 1997.

[Pilaszy and Tikk, 2009] Istvan Pilaszy and Domonkos Tikk.Recommending new movies: even a few ratings are morevaluable than metadata. In Proc. 2009 ACM Conferenceon Recommender Systems, pages 93–100, New York, NY,2009.

[Platt, 1999] John C. Platt. Probabilistic outputs for sup-port vector machines and comparisons to regularized like-lihood methods. In Alexander J. Smola, Peter J. Bartlett,Bernhard Scholkopf, and Dale Schuurmans, editors, Ad-vances in Large Margin Classifiers, pages 61–74. MITPress, 1999.

[Schlkopf and Smola, 2001] Bernhard Schlkopf and Alexan-der J. Smola. Learning with Kernels. The MIT Press,Cambridge, MA, 2nd edition, 2001.

[Soares and Viana, 2014] Marcio Soares and Paula Viana.Tuning metadata for better movie content-based recom-mendation systems. Multimedia Tools and Applications,Published online, 2014.

[Tsochantaridis et al., 2004] Ioannis Tsochantaridis,Thomas Hofmann, Thorsten Joachims, and YaseminAltun. Support vector machine learning for interdepen-dent and structured output spaces. In Proc. 21st Int’l Conf.Machine Learning, Banff, Canada, 2004.

3517

Pre-Release Prediction of Crowd Opinion on Movies by Label ...Pre-Release Prediction of Crowd Opinion on Movies by Label Distribution Learning Xin Geng and Peng Hou School of Computer

Documents