Top Banner
News-oriented multimedia search over multiple social networks Katerina Iliakopoulou, Symeon Papadopoulos and Yiannis Kompatsiaris 1 Centre for Research and Technology Hellas (CERTH) – Information Technologies Institute (ITI) CBMI 2015, June 11, 2015, Prague, Czech Republic Presented by Katerina Andreadou
24
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: News-oriented multimedia search over multiple social networks

News-oriented multimedia search over multiple social networksKaterina Iliakopoulou, Symeon Papadopoulos and Yiannis Kompatsiaris

1Centre for Research and Technology Hellas (CERTH) – Information Technologies Institute (ITI)

CBMI 2015, June 11, 2015, Prague, Czech Republic

Presented by Katerina Andreadou

Page 2: News-oriented multimedia search over multiple social networks

The rise of Online Social Networks (OSNs)

#2

• Increasingly popular Massive amounts of data– Both text and multimedia

• Content peaks when– A planned event takes place (e.g., Olympic games)– An unexpected news story breaks (e.g., earthquake)

Journalistic practices now involve the use of user-generated content from OSNs for reporting on news

stories and events

Page 3: News-oriented multimedia search over multiple social networks

The Problem

#3

• News stories are covered in multiple OSNs– Twitter, Facebook, Google+, Instagram, Tumblr, Flickr

• No effective means of searching over multiple OSNs– Necessary to build appropriate queries– Find relevant hashtags and query keywords

• Effective querying is not straightforward– Long complicated queries retrieve no results– Vague queries bring back irrelevant content

Page 4: News-oriented multimedia search over multiple social networks

The Problem is also OSN-specific

#4

• Flickr search is more flexible– It returns results that contain all requested keywords or a

portion of them with the appropriate ranking

• Instagram is more restrictive– It can only handle hashtags – It returns very few or no results to multi-keyword queries

• The order of keywords is also crucial for some OSNs

Query formulation has to be OSN-specific

Page 5: News-oriented multimedia search over multiple social networks

Content requirements

#5

• High relevance to the topic of interest

• High quality of multimedia

• Diversity of retrieved content

• Usefulness with respect to reporting and publication

Page 6: News-oriented multimedia search over multiple social networks

Related work• Optimization of query formulation methods utilizing

terms, proximities and phrases with respect to their frequency and text position– Markov random field models (Metzler et al., 2005)– Positional language models (Lv et al., 2009)– Query operations (Mishne et al., 2005)

• Improve query formulation by modelling query concepts– Learning concept importance (Bendersky et al., 2010)– Latent content expansion using markov random fields

(Metzler et al., 2007)

#6

Page 7: News-oriented multimedia search over multiple social networks

Goals and Contributions• A novel graph-based query formulation method

– Catered for the special characteristics of each OSN– Captures the primary entities and their associations– Builds numerous queries by greedy graph traversal

• A relevance classification method– 12 features based on content (text, visual) and context

(popularity, publication time)

• Evaluation of the framework in real-world events and stories

#7

Page 8: News-oriented multimedia search over multiple social networks

Overview of the Framework

#8

Page 9: News-oriented multimedia search over multiple social networks

Step I: Collection of highly relevant content

• Query six OSNs with a high precision query q0 to build an initial collection M0

– news story headline– official name of the event

• Lower the possibility of noisy content by– discarding all material retrieved before the story broke

• Only some OSNs were found to contribute to the collection: Twitter, Flickr, Google+

#9

Page 10: News-oriented multimedia search over multiple social networks

Step II: Keyword and hashtag extraction

• Extract the Named Entities from the M0 metadata• Discard all stop-words and filter out HTML tags, web

links and social network account names• Perform stemming for keywords that are not listed as

Named Entities to group keywords with similar meaning

Create a list of keywords and a list of hashtags, each associated with a frequency count

#10

Page 11: News-oriented multimedia search over multiple social networks

Step III: Graph construction

• Vertices set of selected keywords• Edges their pairwise adjacency relations

– adjacency is computed with respect to the text metadata

• Each edge frequency of appearance of the phrase composed of the edge keywords

• Only significant keywords are considered keywords with greater frequency than the average– elimination of noisy keywords– cost-effectiveness

#11

Page 12: News-oriented multimedia search over multiple social networks

Step IV: Query building

• Query path from a starting node to an end node given a maximum number of L hops

• Starting node high out-degree or connected to heavy weighted edges

• Total score for a node

• Penalize queries with high text similarity Jaccard coefficient

#12

Page 13: News-oriented multimedia search over multiple social networks

Example: 86th Academy Awards

#13

Page 14: News-oriented multimedia search over multiple social networks

Step V: Relevance classification

Textual relevance is computed wrt the high precision query q0

• title & description

• tags

#14

Popularity

Textual relevanceVisual similarityTemporal proximity to the storyImage dimensions

Page 15: News-oriented multimedia search over multiple social networks

Evaluation

#15

• Choose 20 events and news stories which took place up to five months before data collection– the older the event, the more content disappears from the

OSNs

• Choose events with considerable size and variety

• Set the maximum number of keyword-base queries Mmax=20 and the maximum number of hashtag-based queries to Mmax=10

Page 16: News-oriented multimedia search over multiple social networks

Data statistics

#16

• More than 88K images for all 20 events

• ~4.4K images per event/new story on average

• Events are associated on average with more images (5.5K) than news stories (3.3K)

Number of images collected during the first querying step

Number of images collected during the second querying step

Page 17: News-oriented multimedia search over multiple social networks

Media volume per OSN

#17

• Flickr contributes the most (66.9%) with Twitter following (19%)

• Instagram and Google+ less but considerable• Tumblr and Facebook the least content

– Tumblr has significantly lower usage– Facebook has very poor search API behaviour

• Increase between the two retrieval steps– Facebook, Flickr, Tumblr: 5x– Google+, Instagram: even higher (8.1x and 6.8x)– Twitter: 3x

Page 18: News-oriented multimedia search over multiple social networks

Quality of formulated queries

#18

• Evaluate the relevance and quality of the retrieved content in the second step (Mext)– A large majority (90%) of the images retrieved in the first

step (M0) were relevant– Four human annotators

• Relevance is high (>50%) for 3 events• Relevance is decent (>40%) for 3 news stories• Half of the events and news stories are characterized

by low-to-medium relevance (10% - 40%)• Relevance is very low (<10%) for two events and two

news stories

Page 19: News-oriented multimedia search over multiple social networks

Why is irrelevant content collected?

#19

• Vague keyword-based queries or hashtags– Example: British Academy Film Awards most popular

hashtag british– Example: Sundance Film Festival vague query film festival

• False keyword-based queries – They contain keywords irrelevant to the subject– They are left-overs from the graph pruning, they should

have been eliminated

Page 20: News-oriented multimedia search over multiple social networks

Relevance classification

#20

DT Decision Tree RF Random ForestSVM Support Vector Machine MP Multilayer Perceptor

Page 21: News-oriented multimedia search over multiple social networks

Relevance classification

#21

• RF outperforms the rest in all cases

• DT is also very good• SVM has the worst

performance– Input features are

not normalized– A few of them are

quantized to a small set of possible values

Page 22: News-oriented multimedia search over multiple social networks

Conclusion - Contributions

• Searching for multimedia content around events and news stories over multiple OSNs is challenging!– Collect high quality relevant content in spite of the

different behaviors and requirements of the OSNs

• We proposed a multi-step process including– a graph-based query building method– a relevance classification step

• We evaluated the framework on a set of 20 large-scale events and news stories of global interest

#22

Page 23: News-oriented multimedia search over multiple social networks

Future Work

• Improve the performance of the query building method when the number of collected items in the first step is small

• Extract statistically grounded relevance features– Take into account distribution differences in different OSNs

• Apply the method while the event evolves

• Add support for the collection of video content

#23

Page 24: News-oriented multimedia search over multiple social networks

Thank you!

• Slides: http://www.slideshare.net/sympapadopoulos/newsoriented-multimedia-search-over-multiple-social-networks

• Get in touch: @matzika00 / [email protected] @sympapadopoulos / [email protected]

#24