Top Banner
A Short Walk in the Blogistan David de-Vilder Jonny Foss Kevin Gundooa
26

A Short Walk in the Blogistan David de-Vilder Jonny Foss Kevin Gundooa.

Mar 28, 2015

Download

Documents

Aidan Sullivan
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A Short Walk in the Blogistan David de-Vilder Jonny Foss Kevin Gundooa.

A Short Walk in the Blogistan

David de-Vilder

Jonny Foss

Kevin Gundooa

Page 2: A Short Walk in the Blogistan David de-Vilder Jonny Foss Kevin Gundooa.

Introduction• Blogistan is a word that is used to describe the

collection of blogs, or blogspace.• The paper’s contribution:

1. How emerging interests and patterns can be extracted from tracking a collection of blogs over time

2. The size and nature of the blogistan based on a recent collection of blogs

3. Inferences and observations on identifying blogs, spamming problems in blogs and how blogs are accessed

Page 3: A Short Walk in the Blogistan David de-Vilder Jonny Foss Kevin Gundooa.

Defining a Blog• Basically large queues with additions appearing at

the top of the page and older material getting moved downwards in the queue.

• Add many-many interaction to the internet… rather than 1 journalist to many readers – now many writers talk to many readers.

• Fastest growing part of WWW in past 2 years - every day 100,000 new blogs are created, with 1.3 million posts every day [BBC News, 8th November 2006]

Page 4: A Short Walk in the Blogistan David de-Vilder Jonny Foss Kevin Gundooa.

Blogs vs. Webpages

• Blogs are often a single page site• Blogs are generally authored by a single person• Navigation through blogs is typically easier – cross-

links and backtracking• Active blogs generally updated more frequently than

traditional webpages• “Blogroll” structure enables easy reading of newly

added information (HTTP Range request)

Page 5: A Short Walk in the Blogistan David de-Vilder Jonny Foss Kevin Gundooa.

Data Gathering/Filtering• Authors use their own URL examining techniques to

produce a seed collection of 10,000 blogs of varying popularity

• Each blog was visited 5 times a day for over a month (August – September 2003).

• Retrieved meta-information (via HEAD) and body (via GET) of each blog URL.

• Duplicate (and inactive/automatic) blogs were removed from the set of 10,000.

• This produced a set of 8679 working blogs

Page 6: A Short Walk in the Blogistan David de-Vilder Jonny Foss Kevin Gundooa.

Analysis - Emergence Patterns• URLs that are first referenced by a blog after the

study’s measurement start time are termed as new URLs. All other URLs are referred to as old URLs.

• Interesting URLs are identified by the number of references made to them – this is known as multiplicity

• Removing duplicates has significant impact of the “interest ranking” of the top new URLs

• Complications with blog-hosting sites and duplicate removal

Page 7: A Short Walk in the Blogistan David de-Vilder Jonny Foss Kevin Gundooa.

Distribution of Multiplicity of all URLs and of all new URLs

Traditional static hyperlink analysis unsuitable for analysing blogs

Number of references for new URLs is considerably lower than for old URLs at the time they are useful to mine

Page 8: A Short Walk in the Blogistan David de-Vilder Jonny Foss Kevin Gundooa.

Lifespan Distributions

Distribution of lifespans of new and old URLs

New URL references are much more short-lived than old URL references

25% and 80% of URL references lasted longer than 20 days for new and old URLs respectively

Page 9: A Short Walk in the Blogistan David de-Vilder Jonny Foss Kevin Gundooa.

The Blogistan

• Estimated size: 1-4 million

• More than a third of blog space not actively changed

• Each blog page was given a unique key to reduce duplicates

• Number of distinct blog URLs reduced by 40%

Page 10: A Short Walk in the Blogistan David de-Vilder Jonny Foss Kevin Gundooa.

Blog Domain Attributes• Around 180,000 unique domains and 14,000 second-

level domains

• Large gap between first and second level domains partly explained by blog-hosting sites

• DNS lookup of domain names produced around 12,000 unique IP addresses

• Surprisingly few IP addresses in this large area

Page 11: A Short Walk in the Blogistan David de-Vilder Jonny Foss Kevin Gundooa.

Http Protocol Issues• Only the newest entries need to be downloaded –

these are always at the top of the page because of the ‘blogroll’ structure.

• Can check if page has been modified using Last-Modified tag of HTTP HEAD response.

• Partial download can be achieved using the HTTP RANGE request

• Unfortunately not all web servers support the RANGE request – (only 40% in the test)

Page 12: A Short Walk in the Blogistan David de-Vilder Jonny Foss Kevin Gundooa.

Inferences from Analysis• Only preliminary observations based on the

study, do not necessarily hold true.

• Identifying a Blog– Three sets of URLs used to test the hypotheses– Popular websites have more references than blogs– Blogs have more unique references than less

popular websites– Blogs have more self references than webpages

Page 13: A Short Walk in the Blogistan David de-Vilder Jonny Foss Kevin Gundooa.

Inferences from Analysis• Anti-Spam Measures

– Replicating pages and inserting links to attract undue attention– Blogs inadvertently provide free space for spammers (Referrer field)– Many blogs allow others to place comments, another target for spamming– Spammers can place any links of their choice - boosts ranking on search

engines– Potential Solution: Automated distinguishing of spam & genuine references

• Server Logs and Popular Blogs– Web logs of two very popular, anonymous blogs– Most popular request was to top-level blog URL– A third of all external references were from a single site – news aggregator site– Search engine crawler tested on sites - no distinction between blogs & non-

blogs– Only two blogs tested, extensive testing needed to obtain meaningful results

Page 14: A Short Walk in the Blogistan David de-Vilder Jonny Foss Kevin Gundooa.

Applications & Further Work

• Methodology used to identify emerging interests can be applied to general approach to mine evolving interconnection networks.

• This can have applications beyond the Blogistan.

• By cancelling out repeated patterns, it is possible to identify new ones.

• Example application in a different realm is to study ISP level netflow data over time.

• Enable identification of bot attacks on hosts, detect new worms and predict flash crowds.

Page 15: A Short Walk in the Blogistan David de-Vilder Jonny Foss Kevin Gundooa.

Referenced Papers

Rate of Change and Other Metrics: a Live Study

of the WWW by F. Douglis et al.

• Investigated utility of Web Server Caches

• Mainly focused on large companies websites since there

were not many large community websites in 1997

• When pages change frequently – there's little point in a

web server cache...

Page 16: A Short Walk in the Blogistan David de-Vilder Jonny Foss Kevin Gundooa.

Rate of Change...

• Dynamic pages

– True user interaction, e.g. Amazon, eBay etc.

• Semi-Dynamic pages

– Frequently updated pages such as blogs

• Static pages

– Simple HTML pages – rarely updated

Page 17: A Short Walk in the Blogistan David de-Vilder Jonny Foss Kevin Gundooa.

Rate of Change varies with...

• Content-Type

– e.g. Is it HTML, .doc, .jpg

• Top Level domain

– e.g. http://www.warwick.ac.uk will probably change

more than:

http://www2.warwick.ac.uk/insite/newsandevents/

notices/xmascard/

Page 18: A Short Walk in the Blogistan David de-Vilder Jonny Foss Kevin Gundooa.

Web Server Caches• Almost useless for true dynamic pages• Very useful for static pages• Limited use for blogs.

Rate of Change... concludes that web server developers should consider whether individual pages should be cached

Page 19: A Short Walk in the Blogistan David de-Vilder Jonny Foss Kevin Gundooa.

Referenced PapersA Large-Scale Study of the Evolution of Web

Pages

by Fetterly et. al. (2003)

• Monitored 150,836,209 pages, over 11 weeks. Took

MD5Sum of page contents, and feature vectors to

monitor whether or not a page had changed recently

Page 20: A Short Walk in the Blogistan David de-Vilder Jonny Foss Kevin Gundooa.

Referenced PapersA Large-Scale Study of the Evolution of Web

Pages

by Fetterly et. al. (2003)

Concluded:

• Larger pages change more often than smaller pages!

– Something which the Rate of Change... paper

explicitly said didn't happen! (6 years ago)

Page 21: A Short Walk in the Blogistan David de-Vilder Jonny Foss Kevin Gundooa.

Referenced PapersOn the Bursty Evolution of Blogspace

by R. Kumar et. al. (Proc. WWW 2003)

• Studied 750,000 links between 25,000 blogs

• Introduced new tools which:

– Created time graphs based on when the links

between blogs appear, and how blogs interact

between each other.

Page 22: A Short Walk in the Blogistan David de-Vilder Jonny Foss Kevin Gundooa.

Recent Developments

Characterising the Splogosphere (WWE06)

• Splogs (Spam blogs) are now inundating blog search engines

• Their purpose is to host ads or raise the PageRank of target sites

• System to detect splogs up to an accuracy of 90% presented

• Aim is to facilitate development of effective techniques to weed out splogs from the blogosphere

Page 23: A Short Walk in the Blogistan David de-Vilder Jonny Foss Kevin Gundooa.

Recent Developments

Latent Weblog Communities (WWE 06) • Latent Weblog Community (LBC) concept proposed by Kazunari

Ishida (Tokyo University)

• "Weak Pair" algorithm to find connected clusters of blog posts

• Some success in identifying "whimsical links" and multiple blogs by the same author

• Aim is to organise and categorise these groups into a catalog (as a search engine alternative)

Page 24: A Short Walk in the Blogistan David de-Vilder Jonny Foss Kevin Gundooa.

Categorising Key Bloggers (WWE 05)

• 3 key blogger types identified by Shinsuke Nakajima (NAIST)

• Topic-finders, Agitators and Summarisers

• Aim is to use influential bloggers to complement mainstream websites and television.

• 500k blogs with 10m entries being tracked

Recent Developments

Page 25: A Short Walk in the Blogistan David de-Vilder Jonny Foss Kevin Gundooa.

• 37.3M blogs tracked by Technorati

• Blogosphere is multilingual and deeply international

• English has fallen to less than a third of all blog posts in April 2006

• Japanese and Chinese language blogging grown significantly

Blog Language Spread (Technorati Analysis Apr 06)

Recent Developments

Page 26: A Short Walk in the Blogistan David de-Vilder Jonny Foss Kevin Gundooa.

Thank you

Any Questions?