PHENICX-WD-WP1-UPF-130315- DeliverableTemplate-1.2 Page 1 of 34 PHENICX_WD_WP1_DeliverableTemplate_20120314_MTG-UPF Page 1 of 34 D5.4 Discovery algorithms and models for social communities: Algorithms for inferring community structures and models of typical digital social communities Grant Agreement nr 601166 Project title Performances as Highly Enriched aNd Interactive Concert eXperiences Project acronym PHENICX Start date of project (dur.) Feb 1 st , 2013 (3 years) Document reference PHENICX-D-WP5-JKU-130731- D5.4_DiscoveryAlgorithmsAndModelsForSocialCommunities -03 Report availability PU - Public Document due Date July 31 th , 2014 Actual date of delivery July 31 th , 2014 Leader JKU Reply to Marko Tkalčič ([email protected]) Additional main contributors (author’s name / partner acr.) Markus Schedl (JKU) Cynthia Liem (TUD) Mark Melenhorst (TUD) Document status Final
34
Embed
D5.4 Discovery algorithms and models for social ...
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
PHENICX-WD-WP1-UPF-130315- DeliverableTemplate-1.2 Page 1 of 34 PHENICX_WD_WP1_DeliverableTemplate_20120314_MTG-UPF Page 1 of 34
D5.4 Discovery algorithms and models for social communities: Algorithms for inferring community structures and models of typical digital social communities
Grant Agreement nr 601166
Project title Performances as Highly Enriched aNd Interactive Concert
2.1 MAIN OBJECTIVES AND GOALS .......................................................................................... 6
3 DATA ACQUISITION .......................................................................................................................... 6
4 USER CLUSTERING PROCEDURE .................................................................................................. 8
5 CLUSTERING USERS BASED ON DEMOGRAPHICS ................................................................. 8
6 CLUSTERING BASED ON MUSIC PREFERENCES .................................................................... 12
7 CLUSTERING BASED ON PREFERENCES OF SUPPORTING MULTIMEDIA MATERIAL ............................................................................................................................................... 18
10.1 LIST OF AUTHORS ................................................................................................................. 24
10.2 APPENDIX 2: USER STUDY SURVEY PAGES ................................................................. 25
PHENICX-WD-WP1-UPF-140520-DeliverableTemplate -1.2 Page 3 of 34
Executive Summary
This deliverable captures the work done on the segmentation of users within the WP5 work package (Profiling and personalisation). More specifically, the deliverable reflects the work done
in task T5.4 (Matching users at different levels of specificity).
The task T5.4 takes as input the data from task T5.1 (Mining user-related profile information),
where D5.2 (Standardized corpus of user profiling information) is reported, and provides part of
the inputs for the task T6.2 (Personalized multimodal information system).
This deliverable presents methods for the discovery of social communities, i.e. how user group
together based on specific properties. The clustering of users is very important for the personalization part of the project, as it will provide the basis for the development of the
personalization algorithms.
The identification of user clusters in the feature space designed for this deliverable is of paramount importance for the design and development of the personalized multi-modal
information system.
One of the demonstrators of the personalized system will be the Use Case 1 “Digital Program
Notes”. In this demonstrator, a user will get a personalized set of supporting material (in forms of additional text, images or sounds about the performer, piece or composer). In order to develop
user modeling techniques for this kind of personalization a suitable dataset is required. As D5.1
and D5.2 showed, the consumption logs on mobile devices and the huge social media streams do not contain enough useful information to perform a satisfactory modeling of users for the
specific case of the personalized multimodal information system as foreseen in the DoW.
Hence, we carried out a user study to collect the needed features and user preferences. Based
on these data we performed user-clustering techniques to identify in which aspects users can be
grouped.
Bearing in mind, that the personalization will be included in the use case “Digital Program Notes”
where a collection of personalized supporting multimedia material will be delivered to the user based on his personal profile, we carried out a user study on Amazon Mechanical Turk. The
subjects were asked three set of questions: (i) demographics (age, gender, music listening behaviours, personality), (ii) general music preferences (genre preferences) and (iii) preferences
about supporting multimedia material. Tha latter was composed of a set of 14 conditions where
users were asked to rate different aspects of supporting multimedia items of different lengths(long/short), modalities (text, images, audio) and entities (composer, piece, performer).
In order to achieve the grouping of users as requested in the DoW we performed user clustering in three types of features: demographics, general music preferences, preferences about the
supporting multimedia material.
The following steps were carried out for each type of features:
Step 1: Principal component Analysis (PCA) to identify the variables that account for most
of the variance. This step was done for two reasons (i) dimensionality reduction and (ii)
identification of the variables with the highest variance, which are the candidates for clustering
Step 2: visual inspection of the distribution of data points (single- and pair-wise) In this
step we were inspecting visually the distributions of the data points within a single feature
PHENICX-WD-WP1-UPF-140520-DeliverableTemplate -1.2 Page 4 of 34
(histograms) and pair-wise. This inspection was important to visually identify the
clustering of data points as well as possible correlations between variables
Step 3: GMM clustering on the most variable features (pair-wise) In this last step we
performed the Gaussian Mixture Model (GMM) clustering with two clusters over pairs of
features
For the demographics features, data points did not show any natural pair-wise clusters, except
for the personality factors of extraversion and agreeableness where there appear to be two distinct clusters: (i) users with high extraversion and low agreeableness and (ii) users with low
extraversion and high agreeableness.
In the general music preferences set of features some of the genre features (avant-garde, heavy metal and rap) exhibit bimodal histograms, hence suggesting clustering of users into those that
like the genre and those who don’t like the genre. When observing the genre features pairwise and perform the clustering we observed that some genre pairs tend to produce clusters (e.g. the
latin-reggae, the international-latin, the electronic-new age or the country-religious pairs).
In the feature space of preferences about the supporting multimedia material the distributions of answers to the questions for each condition follow a bell-shaped curve. Due to the large number
of possible combinations the pairwise analysis has not been concluded yet as of the writing of this deliverable.
This deliverable will be used by tasks T6.2 (Personalized multimodal information system), tasks
T7.1 and T7.2 (especially the demonstrator UC1: Digital Program Notes).
The following steps need to be taken in order to reach fully functional user models for the
personalized system:
Perform a more in-depth cluster analysis of clustering for the supporting MM material
Perform regression analysis to develop predictive models from independent variables
(which can be acquired through initial questionnaires, past usage or cross-domain) to dependant variables (preferences about the supporting multimedia material)
PHENICX-WD-WP1-UPF-140520-DeliverableTemplate -1.2 Page 5 of 34
1 BACKGROUND
This deliverable captures the work done on the segmentation of users within the WP5 work
package (Profiling and personalisation). More specifically, the deliverable reflects the work done in task T5.4 (Matching users at different levels of specificity).
The task T5.4 takes as input the data from task T5.1 (Mining user-related profile information),
where D5.2 (Standardized corpus of user profiling information) is reported, and provides part of the inputs for the task T6.2 (Personalized multimodal information system).
This deliverable presents methods for the discovery of social communities, i.e. how user group together based on specific properties. The clustering of users is very important for the
personalization part of the project, as it will provide the basis for the development of the
personalization algorithms.
According to the description of work, this deliverable requires to address:
Relating users to each other based on differing profile attributes, such as demographics,
taste
Relating users to each other through user’s relation to musical items, which can consider
topics such as familiarity with a musical item, and general interaction information
The expected result is a multi-faceted and flexible addressing of user-user and user-item-user similarity.
As reported in the deliverable D5.1 social media streams contain little information about classical
music usage. Also, as reported in D5.2, mobile users rarely consume classical music. Hence, in addition to the corpora reported in these two deliverables, we collected additional data in order
to gain insights into users preferences and consequently be able to devise methods for the discovery and modeling of social communities.
Furthermore, the personalization process will be shown in the demonstrator “Digital Program
Notes” (UC1).
In this deliverable we report the acquisition of the required data through a user study and the
analysis of the clustering of users on this dataset.
PHENICX-WD-WP1-UPF-140520-DeliverableTemplate -1.2 Page 6 of 34
2 INTRODUCTION
2.1 Main objectives and goals
The main objective of this deliverable is to report on the data and methods employed to perform user-clustering techniques. The identification of user clusters in the feature space designed for
this deliverable is of paramount importance for the design and development of the personalized multi-modal information system.
One of the demonstrators of the personalized system will be the Use Case 1 “Digital Program Notes”. In this demonstrator, a user will get a personalized set of supporting material (in forms
of additional text, images or sounds about the performer, piece or composer). In order to develop
user modeling techniques for this kind of personalization a suitable dataset is required. As D5.1 and D5.2 showed, the consumption logs on mobile devices and the huge social media streams
do not contain enough useful information to perform a satisfactory modeling of users for the specific case of the personalized multimodal information system as foreseen in the DoW.
Hence, we carried out a user study to collect the needed features and user preferences. Based
on these data we performed user-clustering techniques to identify in which aspects users can be grouped.
In this document we first present the acquisition of the data required (Sec. 3). Then we outline
the clustering technique involved (Sec. 4). Finally, we present the outcomes of the clustering
techniques in Sects. 5, 6 and 7.
3 DATA ACQUISITION
When designing the data acquisition we had in mind the use case “Digital Program Notes” where
a collection of personalized supporting multimedia material will be delivered to the user based on
his personal profile. Hence, a user study has been conducted with 167 participants, mostly from the US. The subjects were put in a set of conditions that reflected combinations of three variables
with the following values (lengths):
- Textual information (long/short)
- Audio information (single/multiple) - Pictures (single/many)
The variables were related to the following entities:
- Composer - Orchestra
- Piece In total we had 14 conditions:
PHENICX-WD-WP1-UPF-140520-DeliverableTemplate -1.2 Page 7 of 34
condition
ID length modality entity
1 long text composer
2 short text composer
3 short image composer
4 long image composer
5 long audio composer
6 short text piece
7 long text piece
8 short image piece
9 long image piece
10 short text orchestra
11 long text orchestra
12 short image orchestra
13 long image orchestra
18 long audio piece
Furthermore we collected features that we hypothesized could be good predictors of the users'
clusters. We collected two more groups of data: (i) demographics and (ii) music preferences.
In the demographics section, we asked the users to provide us with the following information:
- Age - Gender
- Time spent listening to classical music - Time spent listening to non-classical music
- Time spent playing an instrument
- Musical education - Number of attended classical concets
- Number of attended non-classical concerts
Furthermore, we asked the subjects to fill in the Ten-Items-Personality questionnaire [tipi] to
assess their personality profiles in terms of the Five Factor Personality Model (FFM) [Gosling et al. 2003].
In the music preferences section we asked the subjects to express their preferences towards
some music genres. The set of genres was selected from the most popular genres at the Last.fm
website.
The user-study web survey is available here: http://bird.cp.jku.at/phenicx_us_guest/ [survey] Screenshots are in Appendix 10.2.
PHENICX-WD-WP1-UPF-140520-DeliverableTemplate -1.2 Page 8 of 34
4 USER CLUSTERING PROCEDURE
In order to achieve the grouping of users as requested in the DoW we performed user clustering
in three types of features: i. Demographics
ii. General music preferences
iii. Preferences about the supporting multimedia material
For each type of features, we followed the clustering method outlined in Figure 1.
Figure 1: The clustering method employed
The following steps were carried out for each type of features:
Step 1: Principal component Analysis (PCA) to identify the variables that account for
most of the variance. This step was done for two reasons (i) dimensionality reduction and (ii) identification of
the variables with the highest variance, which are the candidates for clustering Step 2: visual inspection of the distribution of data points (single- and pair-wise)
In this step we were inspecting visually the distributions of the data points within a single
feature (histograms) and pair-wise. This inspection was important to visually identify the clustering of data points as well as possible correlations between variables
Step 3: GMM clustering on the most variable features (pair-wise) In this last step we performed the Gaussian Mixture Model (GMM) clustering with two
clusters over pairs of features
5 CLUSTERING USERS BASED ON DEMOGRAPHICS
The PCA analysis showed that 95 % of the variance is explained by the first four components, as
can be seen in Figure 2.
User features
Principal Component
Analysis
Selected features
Histogram visual
inspection
Pairwise GMM
clustering
GMM clusters models
PHENICX-WD-WP1-UPF-140520-DeliverableTemplate -1.2 Page 9 of 34
Figure 2: PCA analysis of the demographics features
The first two components account for roughly 80% of the variance. The feature weights in the first two components are the following (sorted by the absolute value of the weights):
First component:
feature ID feature name weight
10 Listening-non-classical 0.927461
7 Age -0.357710
14 concert-non-clasical 0.101956
11 playing 0.032736
12 musical education 0.016817
9 Listening-classical 0.006512
8 Gender 0.004539
1 Extraversion -0.004093
2 Agreeableness -0.003092
13 concert-clasical -0.002865
4 Stability 0.002661
3 Conscientiousness 0.002059
5 Openness 0.001450
1 2 3 40
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Principal Component
Variance E
xpla
ined (
%)
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
PHENICX-WD-WP1-UPF-140520-DeliverableTemplate -1.2 Page 10 of 34
Second component:
feature ID feature name weight
7 Age 0.931621
10 Listening-non-classical 0.360512
12 musical education 0.029036
11 playing -0.026504
9 Listening-classical 0.017113
13 concert-clasical -0.011386
14 concert-non-clasical -0.008378
5 Openness -0.005510
8 Gender -0.003827
2 Agreeableness 0.003812
4 Stability 0.003533
1 Extraversion -0.001385
3 Conscientiousness 0.000743
Based on the above tables we selected those features that are present in the first two factors and have the absolute value of the weight higher than 0.1. These features are : 7 (age), 10
Furthermore, a visual inspection and the GMM clustering technique in the space of the first two
principal components suggest that users do not tend to group in clusters, as can be seen in
Figure 3.
Figure 3: GMM clustering on the first two PCA components of the demographics features
-40 -20 0 20 40 60 80 100-30
-20
-10
0
10
20
30
40
first PCA component
second P
CA
com
ponent
GMM clusters of the first two PCA components
PHENICX-WD-WP1-UPF-140520-DeliverableTemplate -1.2 Page 11 of 34
The visual inspection of the histograms of the observed features (Figure 4) suggest that the data
do not form clusters.
Figure 4: Histograms of the demographics features
A visual inspection of the pairwise plot of data points suggested that the data do not group in
clusters except for the features extraversion and agreeableness (Figure 5).
0 1 20
50
100Extraversion
0 1 20
50Agreeableness
0 0.5 1 1.50
50
100Conscientiousness
0 1 20
50
100Stability
0 1 20
50Openness
20 40 60 800
50Age
0 0.5 10
50
100Gender
0 5 100
100
200Listening-classical
5 15 253545 5565 7585950
100
200Listening-non-classical
0 10 20 300
100
200playing
0 20 400
100
200musical education
0 5 100
100
200concert-clasical
0 20 40 600
100
200concert-non-clasical
PHENICX-WD-WP1-UPF-140520-DeliverableTemplate -1.2 Page 12 of 34
Figure 5: GMM clusters in the agreeableness and extraversion features space
On our sample there appear to be two distinct clusters: users with high extraversion and low agreeableness and users with low extraversion and high agreeableness.
6 CLUSTERING BASED ON MUSIC PREFERENCES
The features used in the clustering based on music preferences are the features describing the
user preferences to various genres of music. The genres that define the features are: Avant-Garde
Blues
Classical
Country
Easy Listening
Electronic
Folk
Heavy Metal
International
Jazz
Latin
New Age
Pop/Rock
R&B
Rap
Reggae
-0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
Extraversion
Agre
eable
ness
pdf(obj,[x,y])
PHENICX-WD-WP1-UPF-140520-DeliverableTemplate -1.2 Page 13 of 34
Religious
Vocal
As can be seen in Figure 6, the variance is more equally distributed over several components.
Figure 6: PCA of the features describing user genre preferences
PCA first component:
25 latin 0.308804
23 international 0.303600
24 jazz 0.289231
30 reggae 0.283428
16 blues 0.283175
28 rnb 0.279361
15 avant garde 0.264402
32 vocal 0.259802
29 rap 0.250575
18 country 0.247058
21 folk 0.243983
26 new age 0.205037
31 religious 0.202804
19 easy listening 0.186288
20 electronic 0.166741
17 classical 0.135895
27 pop rock 0.073136
22 heavy metal -0.000545
1 2 3 4 5 6 7 8 9 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Principal Component
Variance E
xpla
ined (
%)
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
PHENICX-WD-WP1-UPF-140520-DeliverableTemplate -1.2 Page 14 of 34
PCA second component:
22 heavy metal 0.521260
18 country -0.391733
19 easy listening -0.375172
31 religious -0.366743
30 reggae 0.274788
20 electronic 0.274458
29 rap 0.229055
32 vocal -0.200794
16 blues 0.124734
15 avant garde 0.114366
23 international 0.099874
25 latin -0.078235
28 rnb 0.064448
24 jazz 0.054952
27 pop rock -0.046903
21 folk 0.039674
17 classical -0.012432
26 new age 0.011101
Distribution and GMM clustering of the first two PCA components reveals a tendency to cluster
into two groups, although very close, as shown in Figure 7.
Figure 7: Clustering in the space of the first two PCA components of the genre preferences space
-12 -10 -8 -6 -4 -2 0 2 4 6 8-5
-4
-3
-2
-1
0
1
2
3
4
5
first PCA component
second P
CA
com
ponent
GMM clusters of the first two PCA components
PHENICX-WD-WP1-UPF-140520-DeliverableTemplate -1.2 Page 15 of 34
The histograms are presented in Figure 8, the distributions of preferences for genres have
different patterns. While most of the genres exhibit a somewhat bell-shaped distribution, some
of the genres appear exhibit a bimodal histogram, implying that users tend to cluster into those who like the genre and those who don’t like the genre (e.g. avant-garde, heavy metal or rap).
Figure 8: Distribution of genre preferences among users
Some combinations of genre features tend to create clusters of users as can be seen in Figure
PHENICX-WD-WP1-UPF-140520-DeliverableTemplate -1.2 Page 16 of 34
Figure 9: Clusters in the latin-reggae feature space
Figure 10: Clusters in the international-latin feature space
-1 0 1 2 3 4 5 6-1
0
1
2
3
4
5
6
latin
reggae
-1 0 1 2 3 4 5 6-1
0
1
2
3
4
5
6
international
latin
PHENICX-WD-WP1-UPF-140520-DeliverableTemplate -1.2 Page 17 of 34
Figure 11: clusters in the elctronic-new age feature space
Figure 12: Clusters in the country-religious feature space
-1 0 1 2 3 4 5 6-1
0
1
2
3
4
5
6
electronic
new
age
-1 0 1 2 3 4 5 6-1
0
1
2
3
4
5
6
country
relig
ious
PHENICX-WD-WP1-UPF-140520-DeliverableTemplate -1.2 Page 18 of 34
7 CLUSTERING BASED ON PREFERENCES OF SUPPORTING MULTIMEDIA MATERIAL
In terms of supporting multimedia material, we want to discover how users form clusters based
on two aspects of the supporting MM item: likeness and novelty. In fact, for each of the items in the user study, we asked the users two questions:
1. »I find the content above interesting« to measure how much the user liked the item 2. »I learned something new from the content above.« to measure if the item was novel to
the user
Also for each of the items we also asked the users how much of the items did she/he consume. Hence, for each item (condition in the user study) we had therefore three features: the
consumption, the likeness and the novelty. Given 14 conditions, we had a total of 42 features.
The PCA revealed that the first component explains much more variance (more than 30%) than
the other components (each less than 10 %). The threshold of 95% of the whole variance is reached with the first ten components (out of 42 variables).
The coding of the feature names was done by merging the condition ID and the question type. For each condition, the user was asked three questions. Beside the two questions mentioned
above we also asked how much of the item the user has consumed. The conditoin ID and the question are separated by a hyphen. Hence the variable name »c5-A2« means »condition 5,
answer 2«. The coding scheme is outlined in Table 1.
1 2 3 4 5 6 7 8 9 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Principal Component
Variance E
xpla
ined (
%)
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
PHENICX-WD-WP1-UPF-140520-DeliverableTemplate -1.2 Page 19 of 34
condition
ID length modality entity
consumption question 1
(likeness)
question 2
(novelty)
1 long text composer c1-consumption c1-A1 c1-A2
2 short text composer c2-consumption c2-A1 c2-A2
3 short image composer c3-consumption c3-A1 c3-A2
4 long image composer c4-consumption c4-A1 c4-A2
5 long audio composer c5-consumption c5-A1 c5-A2
6 short text piece c6-consumption c6-A1 c6-A2
7 long text piece c7-consumption c7-A1 c7-A2
8 short image piece c8-consumption c8-A1 c8-A2
9 long image piece c9-consumption c9-A1 c9-A2
10 short text orchestra c10-consumption c10-A1 c10-A2
11 long text orchestra c11-consumption c11-A1 c11-A2
12 short image orchestra c12-consumption c12-A1 c12-A2
13 long image orchestra c13-consumption c13-A1 c13-A2
18 long audio piece c18-consumption c18-A1 c18-A2
Table 1: Feature names coding table
The first PCA component is composed as follows:
34 c1-A1 0.236708
33 c1-consumption 0.230158
51 c7-A1 0.217614
46 c5-A1 0.203067
37 c2-A1 0.202655
63 c11-A1 0.200896
73 c18-A2 0.195757
60 c10-A1 0.193796
64 c11-A2 0.193683
50 c7-consumption 0.193269
35 c1-A2 0.192121
45 c5-A1 0.188850
52 c7-A2 0.184513
48 c6-A1 0.179007
62 c11-consumption 0.178515
40 c3-A1 0.172805
67 c12-A2 0.166570
41 c3-A2 0.165165
72 c18-A1 0.158511
38 c2-A2 0.152254
36 c2-consumption 0.137092
PHENICX-WD-WP1-UPF-140520-DeliverableTemplate -1.2 Page 20 of 34
55 c8-A2 0.130973
58 c9-A2 0.128763
43 c4-A2 0.128714
61 c10-A2 0.128081
71 c18-consumption 0.123146
39 c3-consumption 0.121113
49 c6-A2 0.120396
57 c9-A1 0.118104
42 c4-consumption 0.114457
44 c5-consumption 0.114398
65 c12-consumption 0.109351
53 c8-consumption 0.105658
70 c13-A2 0.103594
54 c8-A1 0.103187
56 c9-consumption 0.099231
66 c12-A1 0.098989
68 c13-consumption 0.094569
59 c10-consumption 0.094564
47 c6-consumption 0.087937
69 c13-A1 0.082330
The first two components do not appear to form distinct clusters
The distributions of answers to the questions for each condition follow a bell-shaped curve as can
be seen in Figure 13.
-12 -10 -8 -6 -4 -2 0 2 4 6 8-6
-4
-2
0
2
4
6
first PCA component
second P
CA
com
ponent
GMM clusters of the first two PCA components
PHENICX-WD-WP1-UPF-140520-DeliverableTemplate -1.2 Page 21 of 34
Figure 13: Histograms of the 42 features on preferences of the supporting multimodal material
Due to the large number of possible combinations, the pairwise analysis has not been concluded
yet as of the writing of this deliverable.
0 50
50100
c1-consumption
0 50
50100
c1-A1
0 50
50100
c1-A2
0 50
100200
c2-consumption
0 50
50100
c2-A1
0 50
50100
c2-A2
0 50
50100
c3-consumption
0 50
50100
c3-A1
0 50
50100
c3-A2
0 50
50100
c4-consumption
0 50
50100
c4-A2
0 50
50100
c5-consumption
0 50
50100
c5-A1
0 50
50100
c5-A1
0 50
100200
c6-consumption
0 50
50100
c6-A1
0 50
50100
c6-A2
0 50
50100
c7-consumption
0 50
50100
c7-A1
0 50
50100
c7-A2
0 50
50100
c8-consumption
0 50
50100
c8-A1
0 50
50100
c8-A2
0 50
50100
c9-consumption
0 50
50100
c9-A1
0 50
50100
c9-A2
0 50
100200
c10-consumption
0 50
50100
c10-A1
0 50
50100
c10-A2
0 50
50100
c11-consumption
0 50
50100
c11-A1
0 50
50100
c11-A2
0 50
50100
c12-consumption
0 50
100200
c12-A1
0 50
50100
c12-A2
0 50
50100
c13-consumption
0 50
50100
c13-A1
0 50
50100
c13-A2
0 50
50100
c18-consumption
0 50
50100
c18-A1
0 50
50100
c18-A2
PHENICX-WD-WP1-UPF-140520-DeliverableTemplate -1.2 Page 22 of 34
8 CONCLUSION
In this deliverable, we presented the outcomes of the data collection and clustering of users
based on demographics, genre preferences and preferences about the supporting multimedia materials.
The analysis of user clusters, done using GMM, showed that users generally do not tend to form distinctive clusters, except for certain personality traits (extraversion-agreeableness) and genre
preferences (e.g. country-religious).
This deliverable will be used by tasks T6.2 (Personalized multimodal information system), tasks
T7.1 and T7.2 (especially the demonstrator UC1: Digital Program Notes).
The following steps need to be taken in order to reach fully functional user models for the personalized system:
Perform a more in-depth cluster analysis of clustering for the supporting MM material
group of features Apply cluster quality measures (e.g. between-variance-within-variance ratio)
Perform regression analysis to develop predictive models from independent variables
(which can be acquired through initial questionnaires, past usage or cross-domain) to
dependant variables (preferences about the supporting multimedia material)
PHENICX-WD-WP1-UPF-140520-DeliverableTemplate -1.2 Page 23 of 34
9 REFERENCES
9.1 Written references
Hu, R., & Pu, P. (2010). Using Personality Information in Collaborative Filtering for New Users. Recommender Systems and the Social Web, 17. Retrieved from
http://www.dcs.warwick.ac.uk/~ssanand/RSWeb_files/Proceedings_RSWEB-10.pdf#page=23 Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix Factorization Techniques for Recommender
Systems. Computer, 42(8), 30–37. doi:10.1109/MC.2009.263 Adomavicius, G., & Tuzhilin, a. (2005). Toward the next generation of recommender systems:
a survey of the state-of-the-art and possible extensions. IEEE Transactions on Knowledge and Data Engineering, 17(6), 734–749. doi:10.1109/TKDE.2005.99
Tkalcic, M., Kunaver, M., Košir, A., & Tasic, J. (2011). Addressing the new user problem with
a personality based user similarity measure. Joint Proceedings of the Workshop on Decision Making and Recommendation Acceptance Issues in Recommender Systems (DEMRA 2011) and the 2nd Workshop on User Models for Motivational Systems: The Affective and the Rational Routes to Persuasion (UMMS 2011). Retrieved from http://ceur-ws.org/Vol-740/DEMRA_UMMS_2011_proceedings.pdf#page=106
Gosling, S. D., Rentfrow, P. J., & Swann, W. B. (2003). A very brief measure of the Big-Five personality domains. Journal of Research in Personality, 37(6), 504–528. doi:10.1016/S0092-