Page 1
Analysis & Knowledge Extraction of Online User Behaviour and Visual Content
for Art and Culture Events
Marco Brambilla Tahereh Arabghalizi Behnam Rahdari
Marco Brambilla
Contacts: @marcobrambi, [email protected] , http://datascience.deib.polimi.it
UNIVERSITY OF PITTSBURGH
Page 2
Agenda
Context
Method
• Pre-processing
• Topic analysis
• User clustering
• Multimedia: Images• concepts vs. text extraction
• color schema and the main color pattern(s)
• Prediction of interests
Challenges & Conclusions
Page 3
Context
• Role of social media in our life
• Social media for cultural and artistic events
• Behaviour and content
• Multi-disciplinary collaboration on social media analysis and
cultural heritage
• Collaboration: Politecnico di Milano, Musei di Brescia, University
of Pittsburg
Page 4
Research Questions
Topics of interest of visitors?
Categorization of users?
Demographics of visitors?
Engagement and online
participation?
Relation between photos, time,
location, text and the event?
Page 5
Approach
Domain-specific pipeline to profile social media users
and content in cultural or art events
Page 6
Case Study
The Floating Piers by Christo and Jeanne Claude
Iseo Lake, Italy
June 2016
Page 10
Case Study
• 17 MLN $
• 220,000 floating blocks
• 1.5 MLN visitors in 16 days
Page 12
Data Extraction
• Using Instagram and Twitter APIs
• Extract relevant tweets/posts during the event
• Extract all relevant users
o That tweet/post directly
o that like, comment, retweet, etc.
• Extract all properties
o Textual: bio, tweet/post text, hashtag, etc.
o Quantitative: #followers, #followings, etc.
o Media: photos, metadata (geotag, …)
Page 13
Tweets Posts
14,062 30,256
Users Users
23,916 94,666
Authors Reacting Authors Reacting
7,724 16,197 16,681 77,985
From June 10th to July 30th
Collected Data
Page 14
• Text normalization (NLP)
• Language identification and translation
• Gender detection
• Data cleansing
• Store clean and transformed data
Preprocessing
Page 15
Time Distribution (Twitter)
Page 16
Time series – Instagram vs. Twitter
Page 17
Instagram Likes and Comments
Page 18
Italy Lombardy Region Iseo Lake
Geographical Distribution (Instagram)
Page 19
Data Analysis Process
1. Document Term Matrix (DTM)
2. Topic Extraction
3. Dimension Reduction
4. Cluster Analysis and Validation
5. Prediction
6. Media Analysis
7. Content Network Analysis
Page 21
Document-term Matrix
A matrix that describes the frequency of terms that
occur in a collection of documents
Terms
Documents
Art Travel Italy Design …
Post 1 0 1 1 0
Post 2 1 2 0 1
Post 3 0 0 1 0
Post 4 1 1 3 1
…
Page 22
Topic Extraction
Latent Dirichlet Allocation (LDA):
documents as mixtures of topics (with probability)
Input: Document Term Matrix
Outputs: Topics, Topic Probabilities Matrix
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 …
Post 1 0.19 0.16 0.27 0.14 0.11 0.13
Post 2 0.31 0.18 0.21 0.08 0.10 0.12
Post 3 0.25 0.24 0.20 0.17 0.09 0.05
Post 4 0.19 0.32 0.22 0.10 0.07 0.10
…
Page 23
Dimensionality Reduction
• Hundreds of topics extracted with LDA
• Using Principle Component Analysis (PCA) to extract a smaller set
of linearly uncorrelated topics
> 0.95
Variance share Cumulative variance share
Page 25
Cluster Analysis
• Apply clustering algorithms over Topic Probabilities
Matrix to cluster users
• Multiple data slices
• Multiple algorithms
o K-means
o Hierarchical
o DBSCAN
Topic 1
Topic 3
Topic 2
Page 26
Cluster Validity
• How to evaluate the “goodness” of the resulting
clusters?
• Validation Measures
– Internal : ex. Silhouette Coefficient, Dunn’s Index,
Calinski-Harabasz index, etc.
– External: ex. Entropy, Purity, Rand index, etc.
Page 28
Travel
Lovers
Art
Lovers
Internet & Tech
Lovers
Users’ Biography Word Clouds
Cluster Labeling
Page 29
Word Network for Clusters
Page 30
Travel Lovers
Art Lovers
Tech Lovers
Hierarchical Clustering
Page 31
Language
Gender
Impact of Demographics
Page 33
Prediction
Predict the category or the interest area of potential new users for
similar cultural or art events in the future
Decision Trees
o Prepare Required Data
o Grow Decision Tree
o Extract rules from the tree
o Predict using test data
o Evaluate
Page 34
Extracted Rules
Rule 1 : if (0.36 < Bio_score < 0.37 OR Bio_score < 0.35)
then Travel Lover
Rule 2: if (0.35 < Bio_score < 0.36 AND Status_count >
14.5) OR (Bio_score > 0.37 AND language != Italian)
then Art Lover
Rule 3: if (Bio_score > 0.37 AND Language = Italian) then
Tech Lover
Otherwise: Not Interested
accuracy = 62 %
Prediction rules
Page 37
Tweets Posts
14,062 30,256
Users Users
23,916 94,666
Authors Reacting Authors Reacting
7,724 16,197 16,681 77,985
From June 10th to July 30th
Only Instagram
Page 38
Used Instagram Filters
Page 39
People in Pictures
Page 40
Age Sex50.4% female
49.6% male
Visitor Analytics
Race
Bias of the medium?
Page 41
Image content analsys
Concept extraction (DNN based third party
service)
Comparison with hashtags / text
Image low-level feature analysis
Page 42
Concepts in Pictures Hashtags
Users tend not to report the actual content of the photos
in their textual descriptions /hashtags
Object Extraction from Pictures
Page 43
Main color shades among all photos
Color Detection for Subject Identification
Page 44
Confusion Matrix
Simple techniques “good enough”?
Objects or Colors?
Page 45
Ongoing Challenges
Page 46
Future Challenges of KE
Determining exact
positioning based on
perspective
Page 47
Future Challenges of KE
Network structures
and their temporal
evolution
Max graph perturbation
Daily graph variations
Page 48
Future Challenges
Real cross-disciplinarity
(cultural heritage, humanities,
social science)
No visitors for the cultural part of the event!
(exhibition at the museum)
Exhibit--->
Page 49
Conclusions
• (Sometimes) Simple methods work just fine
• Interesting profiling and behaviour detection
• Still far from cross-disciplinary approaches
Page 50
Contacts: Marco Brambilla, @marcobrambi, [email protected]
http://datascience.deib.polimi.it
http://www.marco-brambilla.com
Analysis of Online User Behaviourfor Art and Culture Events
Marco Brambilla, Tahereh Arabghalizi, Behnam Rahdari
UNIVERSITY OF PITTSBURGH