Flickr Tag Analysis Flickr Tag Analysis Ahmet Ahmet Iscen Iscen
Outline Social Media What is Flickr? Flickr Photos Association Rule Latent Semantic Analysis Latent Dirichlet Allocation Conclusions
Social Media Important part of our daily lives today
Twitter 12th largest country in the world
Two new members sign up every second to LinkedIn
What is Flickr? Image and video
hosting Acquired by Yahoo! in
2005 51 million registered
members and 80 million unique visitors as of June 2011
6 million photos Widely used by
researchers
Dataset Xirong Li's Flickr-3.5M Dataset 3,500,000 images 570,000 unique tags 270,000 unique user-ids Randomly selected 250,000 images with their
tags
http://staff.science.uva.nl/~xirong/index.php?n=DataSet.Flickr3m
Challenges Tags totally depend on the user Can be extremely noisy Huge range of possible words Examples:
milos tasic milosevrodjendan verjaardagmilos
desember 2005
tmo
Preprocessing Eliminate stopwords (a,for,the etc.) Eliminate extreme words (those that appear
less than 20 photos and more than 80% of the photos.
Porter Stemmer (only for association rule) Convert everything to lowercase Eliminate tags with less than 2 letters and
more than 20 letters Eliminate numerical tags
Association Rules Mining Rapid Miner
[york] --> [new] (confidence: 0.910) Support: 0.04
[geolat, geolon] --> [geotag] (confidence: 0.986) Support: 0.03
[hors, lotharlez] --> [caballo, cheval, hestur] (confidence: 0.846) Support: 0.03
[paard] --> [hors, lotharlenz, zirg] (confidence: 0.802) Support: 0.03
[hors, paard] --> [lotharlenz, zirg] (confidence: 0.802) Support: 0.03
Association Rules Mining Poor results.
Probably due to noise and variance in data.
Takes too much time to process the words and find rules.
Need find alternative methods
Latent Semantic Analysis Same as LSI (LSI used in IR field) SVD on document-term matrix to reduce
dimensionality Words are compared by taking the cosine of
the angle between two vectors by any two rows.
Latent Semantic Analysis 250000 photos, 20 topicstopic #0: 0.997*"wedding" + 0.047*"family" + 0.023*"friends" + 0.022*"party" +
0.019*"reception" + 0.013*"california" + 0.011*"ceremony" + 0.009*"india" + 0.008*"church" + 0.008*"sanfrancisco"
topic #11: 0.491*"newyork" + -0.463*"china" + 0.448*"nyc" + -0.233*"beach" + 0.174*"newyorkcity" + 0.146*"italy" + -0.132*"friends" + -0.123*"flowers" + 0.119*"new" + -0.117*"beijing"
topic #4: 0.586*"paris" + -0.524*"family" + 0.417*"france" + 0.186*"london" + 0.178*"party" + -0.169*"halloween" + 0.156*"europe" + -0.121*"japan" + 0.103*"travel" + 0.063*"birthday"
topic #1: 0.701*"halloween" + 0.588*"party" + 0.169*"friends" + 0.165*"family" + 0.157*"birthday" + 0.126*"japan" + 0.071*"christmas" + 0.059*"london" + 0.058*"travel" + 0.055*"beach"
Latent Semantic Analysis 250000 photos, 50 topicstopic #10: -0.655*"friends" + 0.633*"china" + 0.221*"travel" + 0.166*"beijing" +
0.136*"party" + -0.088*"beach" + 0.075*"vacation" + 0.071*"greatwall" + 0.070*"shanghai" + -0.066*"flowers"
topic #28: -0.580*"india" + -0.323*"trip" + 0.279*"nature" + 0.262*"snow" + -0.258*"dog" + -0.224*"sunset" + 0.200*"winter" .
topic #20: -0.527*"cat" + 0.511*"sunset" + 0.266*"sky" + -0.242*"california" + -0.209*"sanfrancisco" + 0.198*"clouds" + -0.167*"beach" + -0.156*"flower" + -0.149*"cats" + -0.132*"dog"
topic #17: -0.323*"california" + -0.272*"sanfrancisco" + 0.269*"cat" + 0.254*"horse" + 0.211*"pferd" + 0.207*"cheval" + 0.205*"caballo" + 0.205*"paard" + 0.204*"hest" + 0.204*"cavalo"
Latent Semantic Analysis 250000 photos, 100 topicstopic #29: 0.689*"australia" + 0.279*"sydney" + -0.233*"nature" + 0.220*"trip" + -
0.209*"france" + -0.187*"india" + -0.175*"snow" + 0.157*"new" + 0.144*"paris" + -0.134*"winter"
topic #58: 0.401*"geotagged" + 0.385*"geolat" + 0.380*"geolon" + -0.261*"people" + 0.259*"day" + 0.198*"england" + 0.191*"newzealand" + -0.178*"canada" + 0.168*"water" + -0.144*"portrait".
topic #45: 0.406*"fall" + 0.398*"park" + 0.315*"october" + -0.291*"animals" + 0.289*"autumn" + -0.262*"art" + 0.182*"leaves" + -0.175*"zoo" + -0.163*"sky" + 0.132*"garden"
topic #85: -0.673*"hongkong" + 0.221*"florida" + 0.221*"singapore" + 0.209*"winter" + 0.174*"museum" + -0.170*"boston" + -0.165*"scotland" + -0.153*"prague" + 0.153*"cats" + -0.136*"island"
Latent Semantic Analysis Notice the negative weights.
Hard to interpret
Probabilistic methods are not used
Latent Dirichlet Allocation Expectation- Maximization Each document is a mixture of topics Find the posterior for topics in the E-Step
p(topic t | document d) Then update the assignment of the current
word in the M-Step
p(word w | topic t)
Latent Dirichlet Allocation 250000 photos, 20 topicstopic #13: 0.088*party + 0.072*halloween + 0.027*lake + 0.024*boat + 0.022*home +
0.019*park + 0.018*river + 0.016*ice + 0.015*spring + 0.014*birthday
topic #3: 0.046*trip + 0.044*vacation + 0.044*sanfrancisco + 0.040*california + 0.026*road + 0.024*cats + 0.018*school + 0.018*cruise + 0.014*ca + 0.014*old
topic #8: 0.051*paris + 0.042*france + 0.027*july + 0.027*4th + 0.025*music + 0.022*car + 0.021*rock + 0.020*dogs + 0.020*concert + 0.016*geotagged
Latent Dirichlet Allocation 250000 photos, 50 topicstopic #7: 0.111*sunset + 0.108*beach + 0.089*holiday + 0.047*fun + 0.029*smile +
0.028*forest + 0.023*rose + 0.020*wood + 0.019*disneyland + 0.019*costarica
topic #14: 0.141*vacation + 0.046*san + 0.037*francisco + 0.034*sports + 0.020*hockey + 0.020*top + 0.019*cake + 0.014*cafe + 0.013*biking + 0.013*ruins
topic #23: 0.112*trip + 0.070*bridge + 0.057*road + 0.048*blue + 0.048*building + 0.042*film + 0.035*orange + 0.022*university + 0.021*telephone + 0.018*sky
topic #29: 0.124*party + 0.110*friends + 0.085*christmas + 0.045*rock + 0.038*lake + 0.038*ireland + 0.031*castle + 0.026*africa + 0.025*live + 0.025*music
Latent Dirichlet Allocation 250000 photos, 100 topicstopic #10: 0.109*hawaii + 0.093*island + 0.060*la + 0.030*photoshop + 0.027*walk +
0.026*hdr + 0.024*maui + 0.023*us + 0.019*fountain + 0.018*beach
topic #24: 0.172*house + 0.106*architecture + 0.077*festival + 0.068*airplane + 0.038*flying + 0.029*flight + 0.026*air + 0.025*aircraft + 0.021*aviation + 0.020*airshow
topic #34: 0.231*vacation + 0.159*trip + 0.136*lake + 0.095*florida + 0.088*birds + 0.062*san + 0.051*francisco + 0.015*yellowstone + 0.015*kayak + 0.015*maltay
topic #70: 0.114*november + 0.074*thanksgiving + 0.050*soccer + 0.048*polarbear + 0.048*ski + 0.041*basketball + 0.035*safari + 0.034*bear + 0.023*wien + 0.021*flood
Conclusions LSA and LDA are more useful for analyzing
tags than Association Rule Mining
There is no “best” number of topics
Human interpretation still might be required