Top Banner
«Tag-based Social Interest Discovery » Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc., CA Paper presentation: Konstantinos Zacharis, Dept. of Comp. & Comm.Engineering, UTH
17

«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference () Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,

Dec 26, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: «Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference () Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,

«Tag-based Social Interest Discovery»Proceedings of the 17th International World Wide Web Conference (WWW2008)

Xin Li, Lei Guo, Yihong Zhao

Yahoo! Inc., CA

Paper presentation:

Konstantinos Zacharis, Dept. of Comp. & Comm.Engineering, UTH

Page 2: «Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference () Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,

Paper Outline

• Introduction• Previous work• Data collection pre-processing• Tag analysis• System architecture• Evaluation• Conclusions and future work

Page 3: «Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference () Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,

Introduction

• Problem statement: discover common interests shared by users in a social network system1

• Two approaches: user-centric (by analyzing online user connections) and object-centric (by analyzing objects transferred, also offline)

• Paper’s approach: concentrate on user-defined tags (examining pairs of tag-URL’s)

1 Most famous commercial such systems are:

http://del.icio.us/

http://www.facebook.com/

http://www.myspace.com/

http://www.youtube.com/

Page 4: «Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference () Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,

Why study tags: 4 key observations

• Tag vocabulary is rich and large enough• For each URL, # of unique tags associated is smaller

than # of keywords in the referred web page• For the same URL there may be different tags. The tag

and keyword vectors are, however, quite similar• Tags carry the variation of human judgement and

therefore can help identify social interests concisely and within finer granularity

Page 5: «Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference () Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,

Previously …• User-centric approach: relations forming online (e.g.

through blogging), difficult to extract (non-trivial)• Object-centric : locating common objects that different

users share through the network, but objects are non-descriptive and implicit to users

• Tagging techniques have already been used in social nets and blogs (often under descriptor “collaborative tagging”). There has also been proof of the power law obeyed by tag frequency in such nets. But novel idea here is to analyze co-occurrence of multiple tags, instead of single ones

Page 6: «Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference () Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,

Data collection/pre-processing

• Partial dump of del.icio.us database activity• All non-HTML and non-English objects discarded,

pages encoded to UTF-8 • Then pages filtered for stopwords (producing

keywords)• Then tags and keywords normalized with Porter

stemming algorithm• #Tag vocabulary ~ 300,000• #Keyword vocabulary ~ 4,000,000

Page 7: «Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference () Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,

Distribution of data

Distribution of tags (zipfian) is basically different from that of customers in online shopping systems

Page 8: «Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference () Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,

Tag analysis (1), VSM

Table shows intuitively that user-generated tags have a higher level abstraction of the content (initial observation) and are therefore more appropriate to represent also web page content

1. Use of the Vector Space Model for tf and idf calculation

2. Each URL is represented by two vectors, one in tag space and the other in keyword space

Page 9: «Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference () Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,

Tag analysis (2), statistical estimators

• Tag vocabulary coverage is up to 90% of URL keywords (satisfactory)

• Tag matching by URL is almost complete (the opposite)

• Total tag # that users generate is limited for a given page, no matter how popular it is

• When multiple tags are used together, they define a topic of interest. This topic corresponds to a virtual community of users (they may have no physical or online connection in the real world)

Page 10: «Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference () Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,

Proposed software architecture

Post stream p=(user, URL, tags), where (user, URL)=key

Page 11: «Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference () Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,

Topic Discovery (1)

• Problem: find a set of frequent tag patterns within a given set of posts (well studied in other domains e.g. supermarket transactions)

• Solution: classical association rule learning algorithms (e.g. Apriori)

• Another approach: probabilistic learning by EM algorithm (A. Plangprasopchok, K. Lerman - AAAI 2007)

Page 12: «Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference () Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,

Clustering (naive approach) (2)

Step 6 is computationally intensive. A prefix tree implementation over the merged topics can reduce complexity

Page 13: «Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference () Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,

Indexing (3)

• Kinds of queries executed by the system:

– For a given topic, a) list all URLs that contain this topic and b) list all users that are interested in this topic

– For given tags, list all topics containing the tags– For a given URL, list all topics this URL belongs to– For a given URL and topic, list all appropriate users

Page 14: «Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference () Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,

Evaluation (1)

• Metrics: compare intra- with inter- topic similarity (cosine) to see how well are clusters formed

• Tag-based topic clustering and similarity computation is simple and accurate and also computationally cost-effective, because the dimension of term vector space is significantly reduced

• Topic clustering is also accurate because it is based on multiple co-occurring tags

Page 15: «Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference () Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,

Evaluation (2)

• Topics discovered capture almost 90% of interests of users

• To evaluate the quality of URL clusters, a review by 4 human editors was conducted

• Cluster sizes follow power law distribution (few hot topics on internet capture a large amount of users)

• Each topic usually contains no more than 5 tags

Page 16: «Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference () Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,

Conclusions

• Paper justifies use of tags as more appropriate for representing user interest

• No information on the online or offline social connection among users was necessary

• Paper provides an inside view to document semantics (by comparing tags and keywords)

• Paper demonstrates extensive computational (in statistics) and graphical properties. Can easily be characterized as a complete report

Page 17: «Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference () Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,

• Any questions?

Thank you for your attention!