Blogosphere: Research Issues, Tools, and Applications Nitin Agarwal and Huan Liu Sunil Bandla INF384H – Fall 2011
Dec 06, 2014
Blogosphere: Research Issues, Tools, and Applications
Nitin Agarwal and Huan Liu
Sunil BandlaINF384H – Fall 2011
Agenda Introduction Research issues Tools and Methods Case Study Blogosphere and Social Networks
Web 2.0 It is the reason behind surge of interest in
online communities Former consumers are now producers Collaborative environment User-generated content Collective wisdom Web 2.0 services:
Blogs, wikis, social networking sites, social tagging Wordpress, Wikipedia, Facebook, Youtube, Twitter,
Yelp
Social Networks “A social network is a social structure made
up of individuals connected by one or more types of interdependency, such as friendship, common interest…” – Wikipedia
Web 2.0 is enabling virtual social networks Size and connectedness varies across
networks Examples:
Friendship networks ( Facebook, Myspace ) Media sharing ( Flickr, Youtube )
Source: The New York Times
“The site, chock full of advertising, is a moneymaking machine – so much so that Ms. Armstrong and her husband have both quit their regular jobs.“The reason? The advertisers are eager to influence her 850,000 readers.
Arnold Kim, founder and senior editor of MacRumors.com.
“The site places MacRumors No. 2 on a list of the ‘25 most valuable blogs,’ …” What is the potential value? “Two of the other tech-oriented blogs on its list, …, were sold earlier this year, reportedly for sums in excess of $25 million.”
Slide Credit: Liu & Nitin
Blogosphere Blog sites Bloggers Blog posts Blogroll Permalinks Low barrier to publication Readers can comment instantly which gives
blogger a feeling of satisfaction Individual vs community blogs
Blogosphere Complex social networks Bloggers/blog posts/blog sites become nodes Relationships are represented by edges
between nodes Inlinks & Outlinks
Agenda Introduction Research issues Tools and Methods Case Study Blogosphere and Social Networks
Modeling the Blogosphere
Web Blogosphere
Web models assume dense graph structure
Blogosphere has a very sparse hyperlink structure
Not much interaction Interaction in the form of comments and replies
Static web pages Dynamic blog posts
Conventional web pages do not have tags
Blog posts have tags and categories
Helps in generating an artificial dataset to compare algorithms
Study patterns that could explain community discovery, spam blogs, influence, etc.
Key differences between Web and Blogosphere
Modeling the Blogosphere Web models:
Random graph Preferential attachment graph models Hybrid graph models
Blogosphere models: To study temporal patterns of blogosphere like
how often people create blog posts, how they are linked
Blogrolls to create a network of connected posts
Blog Clustering Automatic organization of the content Helps readers focus on interesting categories Keyword based:
Brooks and Montanez 2006, pick top 3 keywords to cluster blog posts
Li et al. 2007, assign different weights to title, body and comments of blog posts
Collective wisdom based: Agarwal et al. 2008 use category relation graph to
merge categories and cluster blogs
Blog Mining Valuable resources to track:
Consumers’ beliefs and opinions Initial reaction to a launch Trends and buzzwords
Blog conversations provide insights into how information flows and how opinions are shaped and influenced
Pulse uses a Naïve Bayes classifier trained on annotated sentences to classify unlabeled data
Attardi and Simi 2006, use opinionated words acquired from WordNet to improve blog retrieval
Community Discovery Content analysis and text analysis of the blog
posts to identify communities Kleinberg et al, cluster all the expert
communities together as authorities using an authority based approach
Kumar et al. extend it to include co-citations to extract all communities on the web
Some researchers studied community extraction using newsgroups and discussion boards
Influence in Blogs Influential bloggers:
Are potential market-movers Sway opinions in political campaigns Troubleshoot the problems of peer consumers Useful for “word-of-mouth” advertising of products
Finding influential blog sites is different from identifying influential bloggers
Agarwal et al, studied the influence of a blogger by modeling the blog site as a graph
Trust and Reputation Overwhelming amount of collective wisdom Difficult for reader to decide whom to trust Assess the reputation of influential members in the
community Not much work that deals with trust in Blogosphere Kale et al. 2007 mined sentiments about the cited
blog post using a window of words around the links They compute trust in a network of blog sites
Use comments on the blog post to judge a blogger’s trust
Filtering Spam blogs Splogs == Spam blogs Degrade search quality and waste network resources Initial researchers used web spam detection
techniques Kolari et al. 2006, use content and hyperlinks to train
a SVM based classifier to classify a blog post as spam Content on blog sites is dynamic so content based
spam filters are ineffective Lin et al. propose a self similarity based splog
detection algorithm based on patterns in posting times of splogs, content similarity and similar links in splogs
Agenda Introduction Research issues Tools and Methods Case Study Blogosphere and Social Networks
Tools and APIs Tools to simulate social networks to study
their properties Multi-agent simulation tools Analysis of social networks Visualization of social networks APIs:
Facebook StumbleUpon Del.icio.us
Methodologies Centrality measures Content analysis Link analysis Decision theoretic approaches Agent-based modeling
Datasets Nielsen Buzzmetrics dataset
About 14M blog posts from 3M blog sites Annotated with 1.7M blog-blog links Up to a half of the blog outlinks are missing Only 51% of the total blog posts are in English
Enron Email dataset Emails from about 150 users at Enron 0.5M messages Social networks between users were studied based on link
construction Email senders and recipients are used to construct links
Experiments and Performance Metrics Concepts like influence, trust, etc. in
Blogosphere are socio-psychological and subjective
Evaluating them is non-trivial Hard to compare different approaches since
there is no ground truth! Search engines’ ranking as the baseline for
most of the existing works Web 2.0 application i.e., Digg, was used to
evaluate the influence in blogosphere
Agenda Introduction Research issues Tools and Methods Case Study Blogosphere and Social Networks
Finding influential bloggers “A blogger can be influential if s/he has more
than one influential blog post” Properties that represent influential blog posts:
Recognition – An influential blog post is recognized by many
Activity Generation – Number of comments received and amount of discussion initiated
Novelty – Number of outlinks Eloquence – Length of a post
Data Collection The Unofficial Apple Weblog Crawled 10,000 posts
Results Top 5 bloggers according to TUAW and
proposed model Some bloggers are both active and influential Some of them are active but not influential Some influential bloggers are not active Inactive and non-influential bloggers
Verification Challenges:
No testing and training data Absence of ground truth
Use another Web2.0 site Digg to provide a reference point
A more liked post will have higher score on Digg
Digg returns top 100 voted posts Intersection of Digg 100 and top 20 from their
model
Verification Importance of each parameter Inlinks > comments > outlinks > blog post
length in decreasing order of importance to influence estimation
Agenda Introduction Research issues Tools and Methods Case Study Blogosphere and Social Networks
Blogosphere and Social NetworksBlogosphere Social Networks
Influential nodes have “been influencing”
Influential nodes “could influence”
To share ideas or opinions To stay in touch or make friends
Reputation is based on previous responses
Reputation is based on the number of connections
Person-to-group interaction Person-to-person interaction
Community experience Friendship experience
Loosely defined graph Strictly defined graph
Nodes could be bloggers, blog posts, blog sites
Nodes are members
Implicit links Predefined links
Directed graph Undirected graph
Conclusion Virtual communities and low barrier to
publication are helping the growth of blogosphere
A lot is yet to be done in terms of research specific to blogosphere
Need accurate ground truth data Experiments and evaluation plan should be
devised to have objective analysis of different algorithms
Thank you!
References http://
www.sigkdd.org/explorations/issues/10-1-2008-07/V10N1-Blogosphere.pdf
http://videolectures.net/kdd08_liu_briat/