Top Banner
Mapping french Open Data actors on the web with Common Crawl [email protected] @glebourg
19

Mapping french open data actors on the web with common crawl

Dec 18, 2014

Download

Documents

data publica

 
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Mapping french open data actors on the web with common crawl

Mapping french Open Data actors on the web with Common [email protected]@glebourg

Page 2: Mapping french open data actors on the web with common crawl

Mining the Web at Data Publica

Different needs, different techniques● Scraping● Focused crawling● Prospective crawling

Page 3: Mapping french open data actors on the web with common crawl

Mining the Web at Data Publica

Scraping● Identified resources● Configured extractors● Structured content● Not scalable

Page 4: Mapping french open data actors on the web with common crawl

Mining the Web at Data Publica

Focused crawling● Identified entities● Fuzzy extraction● Structured content using text-mining● Scalable● Useful to get meta information on known

entities

Page 5: Mapping french open data actors on the web with common crawl

Mining the Web at Data Publica

Prospective crawling● No starting point● Fuzzy extraction● Structured content using text-mining● Very hard to scale● Heavy resources needed : CPU, RAM,

HDD

It makes your life easier to use a third-party !

Page 6: Mapping french open data actors on the web with common crawl

From a crawl to a map

Goal : build a map of the french open data actors on the web

● As a graph● Showing websites

Page 7: Mapping french open data actors on the web with common crawl

From a crawl to a map

Using Common Crawl● Large web crawl archives fully accessible● Good coverage of french web● Easy access via AWS / MapReduce jobs

Page 8: Mapping french open data actors on the web with common crawl

From a crawl to a map

Working on french web● Irrelevant to use tld .fr for detection● Detecting page language● Giving websites a "frenchness" score

○ Sw = amount of fr pages / total of pages○ Cutoff manually chosen via testing on french

websites

Page 9: Mapping french open data actors on the web with common crawl

From a crawl to a map

Working on Open Data websites● Building an Open Data "vocabulary"● Detecting if page speaks about Open

Data● Giving websites an "opendataness" score

○ Sw = amount of Open Data pages / total of pages○ Cutoff manually chosen via testing on Open Data

websites

Page 10: Mapping french open data actors on the web with common crawl

From a crawl to a map

Building graph● Inside our subset

○ Inlinks○ Outlinks

● Generating two files○ nodes.csv (list of websites with an id)○ edges.csv (directed links between websites)

Node AA inlink A outlink

A inlink

Page 11: Mapping french open data actors on the web with common crawl

From a crawl to a map

Building graph● Links tell a lot about websites

○ Authorities○ Hubs

Page 12: Mapping french open data actors on the web with common crawl

From a crawl to a map

Visualizing graph using Gephi● Load graph● Spatialize graph

○ links between websites create "attraction", to make them appear near each other

○ the more inlinks, bigger the node (= authority)○ categorizing web site for better understanding (a

color per category)■ Companies, Non profit/blogs, Governement

agencies○ communities can now appear !

Page 13: Mapping french open data actors on the web with common crawl

From a crawl to a map

Page 14: Mapping french open data actors on the web with common crawl

From a crawl to a map

Visualizing graph on the web● Sigma.js● Uses Gephi files● Gives better interactivity

Page 15: Mapping french open data actors on the web with common crawl

Analyze

● The final graph is a good way to understand interactions between actors○ Open Data is definitely initiated by a Non Profit

movement○ Companies are beginning to work on the subject○ French state only had some sporadic initiatives for

now● This graph is to be generated again in near

futur, to see changes in this ecosystem

Page 16: Mapping french open data actors on the web with common crawl

Results

● Large scale crawl made easy○ Easy to focus on mining the results instead of

finding/storing the data● Nice workflow from raw data to an

understandable visualisation● The final graph is a good way to understand

interactions between actors

Page 17: Mapping french open data actors on the web with common crawl

Feedback

● Common Crawl○ Common crawl doesn't have an exhaustive crawl of

the french web for now○ Data is not fresh as it could be○ It is missing an index to access at least domains,

and maybe pages in O(1)● Methodology

○ Opendataness scoring can put aside some websites not enough focused on open data even if relevant

Page 18: Mapping french open data actors on the web with common crawl

Resources

● http://webatlas.fr/tempshare/OpenDataActeursTypes.pdf○ poster by Franck Ghitalla

● http://french-opendata.data-publica.com/index.html○ dynamic visualisation of the results, by Data Publica

● http://fr.slideshare.net/willounet/a-sneak-peek-into-the-web-presentation,○ A sneak peek into the web, by GL

● http://french-opendata.data-publica.com/○ Project host page

Page 19: Mapping french open data actors on the web with common crawl

Mapping french Open Data actors on the web with Common [email protected]@glebourg