«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference () Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,

Post on 26-Dec-2015

213 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

Transcript

«Tag-based Social Interest Discovery»Proceedings of the 17th International World Wide Web Conference (WWW2008)

Xin Li, Lei Guo, Yihong Zhao

Yahoo! Inc., CA

Paper presentation:

Konstantinos Zacharis, Dept. of Comp. & Comm.Engineering, UTH

Paper Outline

• Introduction• Previous work• Data collection pre-processing• Tag analysis• System architecture• Evaluation• Conclusions and future work

Introduction

• Problem statement: discover common interests shared by users in a social network system1

• Two approaches: user-centric (by analyzing online user connections) and object-centric (by analyzing objects transferred, also offline)

• Paper’s approach: concentrate on user-defined tags (examining pairs of tag-URL’s)

1 Most famous commercial such systems are:

http://del.icio.us/

http://www.facebook.com/

http://www.myspace.com/

http://www.youtube.com/

Why study tags: 4 key observations

• Tag vocabulary is rich and large enough• For each URL, # of unique tags associated is smaller

than # of keywords in the referred web page• For the same URL there may be different tags. The tag

and keyword vectors are, however, quite similar• Tags carry the variation of human judgement and

therefore can help identify social interests concisely and within finer granularity

Previously …• User-centric approach: relations forming online (e.g.

through blogging), difficult to extract (non-trivial)• Object-centric : locating common objects that different

users share through the network, but objects are non-descriptive and implicit to users

• Tagging techniques have already been used in social nets and blogs (often under descriptor “collaborative tagging”). There has also been proof of the power law obeyed by tag frequency in such nets. But novel idea here is to analyze co-occurrence of multiple tags, instead of single ones

Data collection/pre-processing

• Partial dump of del.icio.us database activity• All non-HTML and non-English objects discarded,

pages encoded to UTF-8 • Then pages filtered for stopwords (producing

keywords)• Then tags and keywords normalized with Porter

stemming algorithm• #Tag vocabulary ~ 300,000• #Keyword vocabulary ~ 4,000,000

Distribution of data

Distribution of tags (zipfian) is basically different from that of customers in online shopping systems

Tag analysis (1), VSM

Table shows intuitively that user-generated tags have a higher level abstraction of the content (initial observation) and are therefore more appropriate to represent also web page content

1. Use of the Vector Space Model for tf and idf calculation

2. Each URL is represented by two vectors, one in tag space and the other in keyword space

Tag analysis (2), statistical estimators

• Tag vocabulary coverage is up to 90% of URL keywords (satisfactory)

• Tag matching by URL is almost complete (the opposite)

• Total tag # that users generate is limited for a given page, no matter how popular it is

• When multiple tags are used together, they define a topic of interest. This topic corresponds to a virtual community of users (they may have no physical or online connection in the real world)

Proposed software architecture

Post stream p=(user, URL, tags), where (user, URL)=key

Topic Discovery (1)

• Problem: find a set of frequent tag patterns within a given set of posts (well studied in other domains e.g. supermarket transactions)

• Solution: classical association rule learning algorithms (e.g. Apriori)

• Another approach: probabilistic learning by EM algorithm (A. Plangprasopchok, K. Lerman - AAAI 2007)

Clustering (naive approach) (2)

Step 6 is computationally intensive. A prefix tree implementation over the merged topics can reduce complexity

Indexing (3)

• Kinds of queries executed by the system:

– For a given topic, a) list all URLs that contain this topic and b) list all users that are interested in this topic

– For given tags, list all topics containing the tags– For a given URL, list all topics this URL belongs to– For a given URL and topic, list all appropriate users

Evaluation (1)

• Metrics: compare intra- with inter- topic similarity (cosine) to see how well are clusters formed

• Tag-based topic clustering and similarity computation is simple and accurate and also computationally cost-effective, because the dimension of term vector space is significantly reduced

• Topic clustering is also accurate because it is based on multiple co-occurring tags

Evaluation (2)

• Topics discovered capture almost 90% of interests of users

• To evaluate the quality of URL clusters, a review by 4 human editors was conducted

• Cluster sizes follow power law distribution (few hot topics on internet capture a large amount of users)

• Each topic usually contains no more than 5 tags

Conclusions

• Paper justifies use of tags as more appropriate for representing user interest

• No information on the online or offline social connection among users was necessary

• Paper provides an inside view to document semantics (by comparing tags and keywords)

• Paper demonstrates extensive computational (in statistics) and graphical properties. Can easily be characterized as a complete report

• Any questions?

Thank you for your attention!

top related