Sampling online communities: using triplets as basis for a (semi-) automated hyperlink web crawler. Yoann VENY Université Libre de Bruxelles (ULB) - GERME [email protected]This research is funded by the FRS-FNRS Paper presented at the 15th General Online Reasearch Conference, 4-6 march, Mannheim
Sampling online communities: using triplets as basis for a (semi-) automated hyperlink web crawler. Yoann VENY Université Libre de Bruxelles (ULB) - GERME [email protected] This research is funded by the FRS-FNRS - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Sampling online communities: using triplets as basis for a (semi-) automated hyperlink web crawler.
Yoann VENYUniversité Libre de Bruxelles (ULB) - GERME
• What is an online community? • “social aggregations that emerge from the Net when enough
people carry on those public discussions long enough, with sufficient human feeling, to form webs of personal relationship in cyberspace” » (Rheingold 2000)
• long term involvement (Jones 2006) • sense of community (Blanchard 2008)• temporal perspective (Lin et al 2006)
• Probably important … but the first operation should be to take into account the ‘hyperlink environment’
Graph analysis issue / SNA issue
Online Communities – A graphical definition (1)
• Community = more ties among members than with non-members
• three general classes of ‘community’ in graph partitioning algorithm (Fortunato 2010) :
– a local definition: focus on the sub-graphs (i.e.: cliques, n-cliques (Luce, 1950), k-plex (Seidman & Foster, 1978), lambda sets
(Borgatti et al, 1990), … ) – a global definition: focus on the graph as a whole (observed graph
significantly different from a random graph (i.e.: Erdös-Rényi graph)?)– vertex similarity: focus on actors (i.e.: euclidian distance & hierarchical
clustering, max-flow/min-cut (Elias et al, 1956; Flake et al, 2000)
Online communities – graphical definition (2)
• 2 main problems of graph partitionning in a hyperlink environment:
• 1) network size / and form (i.e. tree structure)
• 2) edges direction
• better discover communities with a efficient web crawler
Web crawling - Generalities
• The general idea for a web crawling process:
Source: Jacomi & Ghitalla (2007)
- We have a number of starting blogs (seeds)
- All hyperlink are retrieved from these seeds blogs
- For each new website discovered, decide wether this new site is accepted or refused
- If the site is accepted, it become a seed and the process is reiterated on this site.
Web crawling – constrain-based web crawler (1)
• Two problems of a manual crawler : • Number and quality of decision• Closure?
• A solution: taking advantage of local structural properties of a network:
Assume that a network is an outcome of the agregation of local social processes:
– Examples in SNA: • General philosphy of ERG Models (see f.e. : Robins et al 2007)• Local clustering coefficient (see f.e. : Watts & Strogatz, 1998)
Constrain the crawler to identify local social structures (ie: triangles, mutual dyads, transitive triads,…
Web crawling – constrain-based web crawler (2)
An example of a constrained web crawler based on identification of triangles
Let be the general graph of all the hyperlink environment, where are the vertices of the graph and be the edges of the graph
Let be the graph of the community, where are the vertices of the graph and be the edges of the graph.
For each
For each element b in the neighborhood of a defined as: {
Define a new subgraph of G :
Calculate: local network statistics vector in local network statistics vector in
If (any ( {Set }
}}
Generalisation
Experimental results - method
Let be the general graph of all the hyperlink environment, where are the vertices of the graph and be the edges of the graph
Let be the graph of the community, where are the vertices of the graph and be the edges of the graph.
For each
For each element b in the neighborhood of a defined as: {
Define a new subgraph of G :
Calculate: local network statistics vector in local network statistics vector in
If (any ( {Set }
}}
Y is the n x n adjacency matrix of a binary network with elements: