Top Banner
Crawling the Web Fabrizio Celli Rome, 25 th September 2014
30

SemaGrow demonstrator: “Web Crawler + AgroTagger”

Nov 12, 2014

Download

Technology

The webinar will present the SemaGrow demonstrator “Web Crawler + AgroTagger”, in order to collect feedback, ideas and comments about the status of the development and how the demonstrator helps to overcome data problems.

SemaGrow is a project funded by the Seventh Framework Programme (FP7) of the European Commission, aiming at developing algorithms, infrastructures and methodologies to cope with large data volumes and real time performance.

In this context, FAO is providing a component than can be used to crawl the Web, giving a meaning to discovered resources by using the AgroTagger, which can assign some AGROVOC URIs to resources gathered by a Web crawler.

The demonstrator is publicly available at https://github.com/agrisfao/agrotagger.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: SemaGrow demonstrator: “Web Crawler + AgroTagger”

Crawling the Web

Fabrizio Celli

Rome, 25th September 2014

Page 2: SemaGrow demonstrator: “Web Crawler + AgroTagger”

2

Outline

• Purpose of this Webinar• The Web Crawler• The AgroTagger• The AGRIS use case– What’s next?

Page 3: SemaGrow demonstrator: “Web Crawler + AgroTagger”

3

Purpose of this Webinar

• SemaGrow is a project funded by the Seventh Framework Programme (FP7) of the European Commission

• Algorithms, infrastructures and methodologies to cope with large data volumes and real time performance

• http://www.semagrow.eu• One of SemaGrow demonstrators is the component

“Web Crawler + AgroTagger”, objective of this Webinar

Page 4: SemaGrow demonstrator: “Web Crawler + AgroTagger”

4

The demonstrator

• It is based on two command line applications (no user interface):– Web Crawler– AgroTagger

• Goal: – discover resources on the Web– tag resources with AGROVOC URIs– filter only resources about agriculture and

interlink to AGRIS

Page 5: SemaGrow demonstrator: “Web Crawler + AgroTagger”

5

What we expect from the Webinar

• Comments, suggestions, opinions• Other real case scenarios for the

demonstrator• You can send your feedback at [email protected]

Page 6: SemaGrow demonstrator: “Web Crawler + AgroTagger”

6

THE WEB-CRAWLER

Page 7: SemaGrow demonstrator: “Web Crawler + AgroTagger”

7

Apache Nutch

• http://nutch.apache.org/• Highly extensible and scalable open source

Web crawler• Configurable• Input: a list of pre-selected URLs• Output: a list of discovered URLs

Page 8: SemaGrow demonstrator: “Web Crawler + AgroTagger”

8

How it works

• The user defines a list of Web sites (URLs)• Each URL is a ROOT• The user defines the “depth”: the number of

"hops" a discovered link is away from the ROOT– Links very "far away" from the ROOT are unlikely

to hold much information• Start to crawl the Web!

Page 9: SemaGrow demonstrator: “Web Crawler + AgroTagger”

9

Example: depth = 3ROOT (URL)

URL_1_1 URL_1_2 URL_1_ndepth = 1

depth = 2

depth = 3

URL_2_2_1 URL_2_2_m

URL_3_2_1_1 URL_3_2_1_p…

Page 10: SemaGrow demonstrator: “Web Crawler + AgroTagger”

10

The application

• https://github.com/agrisfao/agrotagger/tree/master/crawler/application

• Command line application• Provided with bash scripts to run in Linux environments• Example of usage:

– depth = 5– output directory = work/output– directory with source URLS = work/urls

crawler_exec.sh 5 work/output work/urls

Page 11: SemaGrow demonstrator: “Web Crawler + AgroTagger”

11

The outputURL:: http:/URL:: http://%20www.umabroad.umn.edu/students/healthsafety/emergency.phpURL:: http://10-29-2013-tfic-luncheon.eventbrite.com/URL:: http://1z8jbr3nz90837simd2d2fwoktj.wpengine.netdna-cdn.com/wp-content/uploads/2014/05/Nina-Hale-Inc-FactSheet.pdfURL:: http://2014.northernspark.org/URL:: http://2014.northernspark.org/project/chimera outlink: toUrl: http://media2.northernspark.org/wp-includes/wlwmanifest.xml anchor: outlink: toUrl: http://2014.northernspark.org/partners/arts-culture-and-the-creative-economy-program-of-the-city-of-minneapolis anchor: outlink: toUrl: http://2014.northernspark.org/project/bell-museum-staff anchor: URL:: http://aaea.execinc.com/edibo/JobMarketCandidates outlink: toUrl: http://www.aaea.org/ anchor: AAEA outlink: toUrl: http://aaea.execinc.com/edibo/LoginHelp anchor: Create an Account / Need Help Logging In outlink: toUrl: http://www.aaea.org/about-aaea/aaea-sections anchor: AAEA Sections outlink: toUrl: http://www.aaea.org/about-aaea/aaea-committees anchor: AAEA Committees outlink: toUrl: http://www.aaea.org/about-aaea/awards-and-honors anchor: Awards and Honors...

Page 12: SemaGrow demonstrator: “Web Crawler + AgroTagger”

12

THE AGROTAGGER

Page 13: SemaGrow demonstrator: “Web Crawler + AgroTagger”

13

AGROVOC

• FAO multilingual vocabulary• Over 32 000 concepts in up to 21 languages• Part of the LOD cloud• Extensively used by cataloguers for indexing

data in agricultural information systems• http://

202.45.139.84:10035/catalogs/fao/repositories/agrovoc

Page 14: SemaGrow demonstrator: “Web Crawler + AgroTagger”

14

The AgroTagger

• At a high level of abstraction, AgroTagger is a keyword extractor that uses the AGROVOC thesaurus to extract keywords from some URLs

• Or better… to extract URIs• It is based on MAUI

Page 15: SemaGrow demonstrator: “Web Crawler + AgroTagger”

15

MAUI

• Maui is named after the Polynesian mythological hero and demi-god, which would transform himself into different kinds of birds to perform many of his exploits

• Maui automatically identifies main topics in text documents

• It uses different kinds of algorithms (Kea and Weka, named after New Zealand native birds)

• https://code.google.com/p/maui-indexer

Page 16: SemaGrow demonstrator: “Web Crawler + AgroTagger”

16

How it works

• Input: – A text file with a list of URLs– The output file of an Apache Nutch crawler

• Output:– A set of triples<URL> dcterms:subject <AGROVOC_URI>

Page 17: SemaGrow demonstrator: “Web Crawler + AgroTagger”

17

The algorithm

• For each URL in the input file– Download the resource– Run the MAUI indexer trained with AGROVOC– Create a set of triples

• Multi-threaded• Currently, MAUI is trained only for English– It can be trained in other languages that use Latin

characters– Other solutions are needed for Chinese, Arabic,

Russian, etc.

Page 18: SemaGrow demonstrator: “Web Crawler + AgroTagger”

18

The application

• https://github.com/agrisfao/agrotagger• Command line application• Entirely based on JAVA• Provided with bash scripts • Example of usage:

– directory with source files = work/source– output directory = work/output– type of source files = nutchOutput – output format = rdfnt

taggerDir.sh /work/source /work/output nutchOutput rdfnt

Page 19: SemaGrow demonstrator: “Web Crawler + AgroTagger”

19

The outputInput

AgroTagger

Output

Page 20: SemaGrow demonstrator: “Web Crawler + AgroTagger”

20

THE AGRIS USE CASE

Page 21: SemaGrow demonstrator: “Web Crawler + AgroTagger”

21

AGRIS

• http://agris.fao.org• A collection of more than 7.8 million

bibliographic references in agriculture• AGRIS records come with AGROVOC descriptors• An RDF-aware system– the AGRIS database is publicly exposed as RDF– AGROVOC is the backbone to interlink to external

sources of information (statistics, distribution maps, country profiles, germplasm data…)

Page 22: SemaGrow demonstrator: “Web Crawler + AgroTagger”

22

Page 23: SemaGrow demonstrator: “Web Crawler + AgroTagger”

23

SemaGrow demonstrator

• The core idea is to harvest the Web– Input: pre-selected sources of information about

agriculture• Crawl and assign AGROVOC URIs– Store triples in the “crawler” database

• Definition of combinations between the “crawler” database and the AGRIS database

• New widget in AGRIS mashup pages!

Page 24: SemaGrow demonstrator: “Web Crawler + AgroTagger”

24

Related resources available on the Web

• http://...• https://...

Page 25: SemaGrow demonstrator: “Web Crawler + AgroTagger”

25

Current status

• The Web Crawler gathers data from the Web• The AgroTagger computes triples to assign

Agrovoc URIs to discovered URLs• A “crawler” triplestore is ready for computations

Page 26: SemaGrow demonstrator: “Web Crawler + AgroTagger”

26

What’s next

• Processing phase • Discover meaningful combinations between the

AGRIS core database and “crawler” database• A triplestore of combinations will be set up and

used by AGRIS to generate a widget in the mashup page

• Evaluation of the quality of the widget• What does “meaningful combinations” mean?

Page 27: SemaGrow demonstrator: “Web Crawler + AgroTagger”

27

Naïve Algorithm

• Just for testing purposes• Meaningful combinations = at least N common

AGROVOC URIs

Page 28: SemaGrow demonstrator: “Web Crawler + AgroTagger”

28

Example

• http://ageconsearch.umn.edu/ • 101,000 distinct Web resources discovered by the

WebCrawler (depth = 5)• ~1 million triples generated by the AgroTagger

(“crawler” database)Number of AGRIS records N: common AGROVOC URIs

between AGRIS and the output of the Crawler

Number of associations

900 K 3 17 MLN

900 K 4 3,2 MLN

1 MLN 5 0.6 MLN

Page 29: SemaGrow demonstrator: “Web Crawler + AgroTagger”

29

Your feedback

• Comments, suggestions, other real case scenarios

• Ideas about the meaning of “meaningful combinations”

• If you will test the application, any comments to improve it

• Can the demonstrator support to overcome data problems?

• You can send your feedback at [email protected]

Page 30: SemaGrow demonstrator: “Web Crawler + AgroTagger”

3030

谢谢

σας ευχαριστώ

Gracias