Top Banner
Flickr Tag Analysis Flickr Tag Analysis Ahmet Ahmet Iscen Iscen
22

Flickr Tag Analysis Ahmet Iscen. Outline Social Media What is Flickr? Flickr Photos Association Rule Latent Semantic Analysis Latent Dirichlet Allocation.

Dec 17, 2015

Download

Documents

Cori Ryan
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Flickr Tag Analysis Ahmet Iscen. Outline Social Media What is Flickr? Flickr Photos Association Rule Latent Semantic Analysis Latent Dirichlet Allocation.

Flickr Tag AnalysisFlickr Tag Analysis

Ahmet IscenAhmet Iscen

Page 2: Flickr Tag Analysis Ahmet Iscen. Outline Social Media What is Flickr? Flickr Photos Association Rule Latent Semantic Analysis Latent Dirichlet Allocation.

Outline Social Media What is Flickr? Flickr Photos Association Rule Latent Semantic Analysis Latent Dirichlet Allocation Conclusions

Page 3: Flickr Tag Analysis Ahmet Iscen. Outline Social Media What is Flickr? Flickr Photos Association Rule Latent Semantic Analysis Latent Dirichlet Allocation.

Social Media Important part of our daily lives today

Twitter 12th largest country in the world

Two new members sign up every second to LinkedIn

Page 4: Flickr Tag Analysis Ahmet Iscen. Outline Social Media What is Flickr? Flickr Photos Association Rule Latent Semantic Analysis Latent Dirichlet Allocation.

What is Flickr? Image and video

hosting Acquired by Yahoo! in

2005 51 million registered

members and 80 million unique visitors as of June 2011

6 million photos Widely used by

researchers

Page 5: Flickr Tag Analysis Ahmet Iscen. Outline Social Media What is Flickr? Flickr Photos Association Rule Latent Semantic Analysis Latent Dirichlet Allocation.

Flickr

Page 6: Flickr Tag Analysis Ahmet Iscen. Outline Social Media What is Flickr? Flickr Photos Association Rule Latent Semantic Analysis Latent Dirichlet Allocation.

Dataset Xirong Li's Flickr-3.5M Dataset 3,500,000 images 570,000 unique tags 270,000 unique user-ids Randomly selected 250,000 images with their

tags

http://staff.science.uva.nl/~xirong/index.php?n=DataSet.Flickr3m

Page 7: Flickr Tag Analysis Ahmet Iscen. Outline Social Media What is Flickr? Flickr Photos Association Rule Latent Semantic Analysis Latent Dirichlet Allocation.

Challenges Tags totally depend on the user Can be extremely noisy Huge range of possible words Examples:

milos tasic milosevrodjendan verjaardagmilos

desember 2005

tmo

Page 8: Flickr Tag Analysis Ahmet Iscen. Outline Social Media What is Flickr? Flickr Photos Association Rule Latent Semantic Analysis Latent Dirichlet Allocation.

Preprocessing Eliminate stopwords (a,for,the etc.) Eliminate extreme words (those that appear

less than 20 photos and more than 80% of the photos.

Porter Stemmer (only for association rule) Convert everything to lowercase Eliminate tags with less than 2 letters and

more than 20 letters Eliminate numerical tags

Page 9: Flickr Tag Analysis Ahmet Iscen. Outline Social Media What is Flickr? Flickr Photos Association Rule Latent Semantic Analysis Latent Dirichlet Allocation.

Association Rules Mining Rapid Miner

[york] --> [new] (confidence: 0.910) Support: 0.04

[geolat, geolon] --> [geotag] (confidence: 0.986) Support: 0.03

[hors, lotharlez] --> [caballo, cheval, hestur] (confidence: 0.846) Support: 0.03

[paard] --> [hors, lotharlenz, zirg] (confidence: 0.802) Support: 0.03

[hors, paard] --> [lotharlenz, zirg] (confidence: 0.802) Support: 0.03

Page 10: Flickr Tag Analysis Ahmet Iscen. Outline Social Media What is Flickr? Flickr Photos Association Rule Latent Semantic Analysis Latent Dirichlet Allocation.

Association Rules Mining Poor results.

Probably due to noise and variance in data.

Takes too much time to process the words and find rules.

Need find alternative methods

Page 11: Flickr Tag Analysis Ahmet Iscen. Outline Social Media What is Flickr? Flickr Photos Association Rule Latent Semantic Analysis Latent Dirichlet Allocation.

Latent Semantic Analysis Same as LSI (LSI used in IR field) SVD on document-term matrix to reduce

dimensionality Words are compared by taking the cosine of

the angle between two vectors by any two rows.

Page 12: Flickr Tag Analysis Ahmet Iscen. Outline Social Media What is Flickr? Flickr Photos Association Rule Latent Semantic Analysis Latent Dirichlet Allocation.

Implementation Gensim – topic modeling toolkit

Python

Tested different corpus and topic sizes

Page 13: Flickr Tag Analysis Ahmet Iscen. Outline Social Media What is Flickr? Flickr Photos Association Rule Latent Semantic Analysis Latent Dirichlet Allocation.

Latent Semantic Analysis 250000 photos, 20 topicstopic #0: 0.997*"wedding" + 0.047*"family" + 0.023*"friends" + 0.022*"party" +

0.019*"reception" + 0.013*"california" + 0.011*"ceremony" + 0.009*"india" + 0.008*"church" + 0.008*"sanfrancisco"

topic #11: 0.491*"newyork" + -0.463*"china" + 0.448*"nyc" + -0.233*"beach" + 0.174*"newyorkcity" + 0.146*"italy" + -0.132*"friends" + -0.123*"flowers" + 0.119*"new" + -0.117*"beijing"

topic #4: 0.586*"paris" + -0.524*"family" + 0.417*"france" + 0.186*"london" + 0.178*"party" + -0.169*"halloween" + 0.156*"europe" + -0.121*"japan" + 0.103*"travel" + 0.063*"birthday"

topic #1: 0.701*"halloween" + 0.588*"party" + 0.169*"friends" + 0.165*"family" + 0.157*"birthday" + 0.126*"japan" + 0.071*"christmas" + 0.059*"london" + 0.058*"travel" + 0.055*"beach"

Page 14: Flickr Tag Analysis Ahmet Iscen. Outline Social Media What is Flickr? Flickr Photos Association Rule Latent Semantic Analysis Latent Dirichlet Allocation.

Latent Semantic Analysis 250000 photos, 50 topicstopic #10: -0.655*"friends" + 0.633*"china" + 0.221*"travel" + 0.166*"beijing" +

0.136*"party" + -0.088*"beach" + 0.075*"vacation" + 0.071*"greatwall" + 0.070*"shanghai" + -0.066*"flowers"

topic #28: -0.580*"india" + -0.323*"trip" + 0.279*"nature" + 0.262*"snow" + -0.258*"dog" + -0.224*"sunset" + 0.200*"winter" .

topic #20: -0.527*"cat" + 0.511*"sunset" + 0.266*"sky" + -0.242*"california" + -0.209*"sanfrancisco" + 0.198*"clouds" + -0.167*"beach" + -0.156*"flower" + -0.149*"cats" + -0.132*"dog"

topic #17: -0.323*"california" + -0.272*"sanfrancisco" + 0.269*"cat" + 0.254*"horse" + 0.211*"pferd" + 0.207*"cheval" + 0.205*"caballo" + 0.205*"paard" + 0.204*"hest" + 0.204*"cavalo"

Page 15: Flickr Tag Analysis Ahmet Iscen. Outline Social Media What is Flickr? Flickr Photos Association Rule Latent Semantic Analysis Latent Dirichlet Allocation.

Latent Semantic Analysis 250000 photos, 100 topicstopic #29: 0.689*"australia" + 0.279*"sydney" + -0.233*"nature" + 0.220*"trip" + -

0.209*"france" + -0.187*"india" + -0.175*"snow" + 0.157*"new" + 0.144*"paris" + -0.134*"winter"

topic #58: 0.401*"geotagged" + 0.385*"geolat" + 0.380*"geolon" + -0.261*"people" + 0.259*"day" + 0.198*"england" + 0.191*"newzealand" + -0.178*"canada" + 0.168*"water" + -0.144*"portrait".

topic #45: 0.406*"fall" + 0.398*"park" + 0.315*"october" + -0.291*"animals" + 0.289*"autumn" + -0.262*"art" + 0.182*"leaves" + -0.175*"zoo" + -0.163*"sky" + 0.132*"garden"

topic #85: -0.673*"hongkong" + 0.221*"florida" + 0.221*"singapore" + 0.209*"winter" + 0.174*"museum" + -0.170*"boston" + -0.165*"scotland" + -0.153*"prague" + 0.153*"cats" + -0.136*"island"

Page 16: Flickr Tag Analysis Ahmet Iscen. Outline Social Media What is Flickr? Flickr Photos Association Rule Latent Semantic Analysis Latent Dirichlet Allocation.

Latent Semantic Analysis Notice the negative weights.

Hard to interpret

Probabilistic methods are not used

Page 17: Flickr Tag Analysis Ahmet Iscen. Outline Social Media What is Flickr? Flickr Photos Association Rule Latent Semantic Analysis Latent Dirichlet Allocation.

Latent Dirichlet Allocation Expectation- Maximization Each document is a mixture of topics Find the posterior for topics in the E-Step

p(topic t | document d) Then update the assignment of the current

word in the M-Step

p(word w | topic t)

Page 18: Flickr Tag Analysis Ahmet Iscen. Outline Social Media What is Flickr? Flickr Photos Association Rule Latent Semantic Analysis Latent Dirichlet Allocation.

Latent Dirichlet Allocation 250000 photos, 20 topicstopic #13: 0.088*party + 0.072*halloween + 0.027*lake + 0.024*boat + 0.022*home +

0.019*park + 0.018*river + 0.016*ice + 0.015*spring + 0.014*birthday

topic #3: 0.046*trip + 0.044*vacation + 0.044*sanfrancisco + 0.040*california + 0.026*road + 0.024*cats + 0.018*school + 0.018*cruise + 0.014*ca + 0.014*old

topic #8: 0.051*paris + 0.042*france + 0.027*july + 0.027*4th + 0.025*music + 0.022*car + 0.021*rock + 0.020*dogs + 0.020*concert + 0.016*geotagged

Page 19: Flickr Tag Analysis Ahmet Iscen. Outline Social Media What is Flickr? Flickr Photos Association Rule Latent Semantic Analysis Latent Dirichlet Allocation.

Latent Dirichlet Allocation 250000 photos, 50 topicstopic #7: 0.111*sunset + 0.108*beach + 0.089*holiday + 0.047*fun + 0.029*smile +

0.028*forest + 0.023*rose + 0.020*wood + 0.019*disneyland + 0.019*costarica

topic #14: 0.141*vacation + 0.046*san + 0.037*francisco + 0.034*sports + 0.020*hockey + 0.020*top + 0.019*cake + 0.014*cafe + 0.013*biking + 0.013*ruins

topic #23: 0.112*trip + 0.070*bridge + 0.057*road + 0.048*blue + 0.048*building + 0.042*film + 0.035*orange + 0.022*university + 0.021*telephone + 0.018*sky

topic #29: 0.124*party + 0.110*friends + 0.085*christmas + 0.045*rock + 0.038*lake + 0.038*ireland + 0.031*castle + 0.026*africa + 0.025*live + 0.025*music

Page 20: Flickr Tag Analysis Ahmet Iscen. Outline Social Media What is Flickr? Flickr Photos Association Rule Latent Semantic Analysis Latent Dirichlet Allocation.

Latent Dirichlet Allocation 250000 photos, 100 topicstopic #10: 0.109*hawaii + 0.093*island + 0.060*la + 0.030*photoshop + 0.027*walk +

0.026*hdr + 0.024*maui + 0.023*us + 0.019*fountain + 0.018*beach

topic #24: 0.172*house + 0.106*architecture + 0.077*festival + 0.068*airplane + 0.038*flying + 0.029*flight + 0.026*air + 0.025*aircraft + 0.021*aviation + 0.020*airshow

topic #34: 0.231*vacation + 0.159*trip + 0.136*lake + 0.095*florida + 0.088*birds + 0.062*san + 0.051*francisco + 0.015*yellowstone + 0.015*kayak + 0.015*maltay

topic #70: 0.114*november + 0.074*thanksgiving + 0.050*soccer + 0.048*polarbear + 0.048*ski + 0.041*basketball + 0.035*safari + 0.034*bear + 0.023*wien + 0.021*flood

Page 21: Flickr Tag Analysis Ahmet Iscen. Outline Social Media What is Flickr? Flickr Photos Association Rule Latent Semantic Analysis Latent Dirichlet Allocation.

Conclusions LSA and LDA are more useful for analyzing

tags than Association Rule Mining

There is no “best” number of topics

Human interpretation still might be required

Page 22: Flickr Tag Analysis Ahmet Iscen. Outline Social Media What is Flickr? Flickr Photos Association Rule Latent Semantic Analysis Latent Dirichlet Allocation.

Future Works Increase the corpus size to 1000000

documents

Analyze Flickr groups as well