Top Banner
Cláudio Baptista, UFCG http://lsi.dsc.ufcg A Model for Geographic Knowledge Extraction on Web Documents Cláudio E. C. Campelo and Cláudio de Souza Baptista University of Campina Grande Computer Science Department Information Systems Laboratory http://www.lsi.dsc.ufcg.edu.br SECOGIS – ER 2009 Gramado – RS- Brazil, 13th November 2009
22

Cláudio Baptista, UFCG A Model for Geographic Knowledge Extraction on Web Documents Cláudio E. C. Campelo and Cláudio de Souza.

Dec 20, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Cláudio Baptista, UFCG  A Model for Geographic Knowledge Extraction on Web Documents Cláudio E. C. Campelo and Cláudio de Souza.

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br

A Model for Geographic Knowledge Extraction

on Web DocumentsCláudio E. C. Campelo andCláudio de Souza Baptista

University of Campina GrandeComputer Science Department

Information Systems Laboratoryhttp://www.lsi.dsc.ufcg.edu.br

SECOGIS – ER 2009Gramado – RS- Brazil, 13th November 2009

Page 2: Cláudio Baptista, UFCG  A Model for Geographic Knowledge Extraction on Web Documents Cláudio E. C. Campelo and Cláudio de Souza.

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br

2

Agenda

Introduction Main Challenges Detection of Geographic References The Geographic Scope GeoSEn Prototype

Architecture GUI

Experiments Conclusion and Future Work

Page 3: Cláudio Baptista, UFCG  A Model for Geographic Knowledge Extraction on Web Documents Cláudio E. C. Campelo and Cláudio de Souza.

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br

Introduction

Web: need for searching using the geographic context;

Traditional search engines: search based on keywords only;

Example: A Web document: “...With the arrival of the industry in

Gramado, one thousand of new jobs for Java programmers will be created...”;

User query: “Java programmer jobs Brazil”; The mentioned document will not be retrieved in

the previous query!

Page 4: Cláudio Baptista, UFCG  A Model for Geographic Knowledge Extraction on Web Documents Cláudio E. C. Campelo and Cláudio de Souza.

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br

Introduction

What is the Geographic Context of Web documents? The place where the information was created? The places mentioned in the document content? Where are people who are most interested in a particular

information? etc…

Several documents have this context: Research in Portugal in which only occurrence of names

of Portuguese cities was considered (308 in total): Total of about 4 millions pages analyzed. Occurrence of 2.2 references per document; 4% of the queries submitted had a reference to one of those

cities.

Page 5: Cláudio Baptista, UFCG  A Model for Geographic Knowledge Extraction on Web Documents Cláudio E. C. Campelo and Cláudio de Souza.

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br

Main Challenges

Detection of geographic references in the documents;

Modeling of geographic scope of documents;

Relevance ranking according to geographic context;

Need for efficient index techniques which cope with both textual and spatial dimensions

Development of user interfaces which provide usability to deal with both dimensions

Page 6: Cláudio Baptista, UFCG  A Model for Geographic Knowledge Extraction on Web Documents Cláudio E. C. Campelo and Cláudio de Souza.

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br

Detection of Geographic References Aim: to identify document features which may be

mapped to a geographic place name; Challenge: elimination of ambiguities, ex:

Place with a name of a thing; (Ex. Gramado, Canela) Place with name of a Person (Ex. Garibaldi); Places with same names and same types: (Ex.

Cachoeirinha-Pe e Cachoeirinha-Rs); Places with same names and different types (ex. city

of Rio de Janeiro and state of Rio de Janeiro Places and gentilics with the same names (ex. city of

Paulista-Pe and paulista (who is born in São Paulo)

Page 7: Cláudio Baptista, UFCG  A Model for Geographic Knowledge Extraction on Web Documents Cláudio E. C. Campelo and Cláudio de Souza.

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br

Detection of Geographic References Another example of ambiguity:

São Paulo as a State São Paulo as a City São Paulo as a football team São Paulo as the name of a hospital São Paulo as the Saint!

Page 8: Cláudio Baptista, UFCG  A Model for Geographic Knowledge Extraction on Web Documents Cláudio E. C. Campelo and Cláudio de Souza.

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br

Detection of Geographic References Explored detected points: page content, page

title, URL; Types of detected places: all of the spatial

hierarchy: (from city to region); Types of detected references: place names,

postal code, telephone code area, gentilic.

Page 9: Cláudio Baptista, UFCG  A Model for Geographic Knowledge Extraction on Web Documents Cláudio E. C. Campelo and Cláudio de Souza.

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br

Definitions

Confidence Rate (CR) represents the probability of a given reference be a valid place name.

Confidence Factor (CF) a measure associated to each analyzed feature during the detection of geographic reference.

CR

CF

1

N

Page 10: Cláudio Baptista, UFCG  A Model for Geographic Knowledge Extraction on Web Documents Cláudio E. C. Campelo and Cláudio de Souza.

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br

Confidence Factor

CFST – analyzes the occurrence of special terms associated to geographic references; Examples of STs include: “in" (e.g. “in Gramado); "city"

(e.g. "city of São Paulo"); “ZIP” (e.g. “ZIP: 58109-000”); Storage of special terms:

Term; Type of geographic reference (zip code, telephone area

code, place name, etc,); Type of place (city, state, region); Minimum distance (DMIN);

Maximum distance (DMAX);

Maximum confidence grade (CMAX).

Page 11: Cláudio Baptista, UFCG  A Model for Geographic Knowledge Extraction on Web Documents Cláudio E. C. Campelo and Cláudio de Souza.

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br

Confidence Factor CFTS – considers the probability of a term be a

geographic reference using a traditional search engine;

Page 12: Cláudio Baptista, UFCG  A Model for Geographic Knowledge Extraction on Web Documents Cláudio E. C. Campelo and Cláudio de Souza.

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br

Confidence Factor

CFCROSS : analyzes the occurrence of cross references based on

topological relationships (inside, contains, etc);

CFFMT – evaluates the syntax used to describe the geographic references; Abbreviation of place names (R. de Janeiro, RJ); The use of uppercase in the place names; Telephone format ( 083)-999-3456; Postal code format 58.104-867

Page 13: Cláudio Baptista, UFCG  A Model for Geographic Knowledge Extraction on Web Documents Cláudio E. C. Campelo and Cláudio de Souza.

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br

13

Modeling of the Geographic Scope A document may be associated to one or more

places; A geographic scope may have places that are

not mentioned directly in a document (geographic expansion)

Each place which is part of the scope has an associated relevance value;

Page 14: Cláudio Baptista, UFCG  A Model for Geographic Knowledge Extraction on Web Documents Cláudio E. C. Campelo and Cláudio de Souza.

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br

14

Geographic Dispersion Rate

(a) (b)

Another factor used in the composition of the geographic relevance value;

Hypothesis: references dispersed may characterize regions that share common features (e.g. cultural, economic, social);

Page 15: Cláudio Baptista, UFCG  A Model for Geographic Knowledge Extraction on Web Documents Cláudio E. C. Campelo and Cláudio de Souza.

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br

15

GeoSEn – an overview

Geographic Search Engine: Indexes a subset of the Brazilian Web; Deals with 6,291 places in Brazil, which are

organized in a five-levels hierarchy: from city to region.

Region: ex. South State: ex. Rio Grande do Sul MesoRegion: ex. Metropolitana de Porto Alegre MicroRegion: ex. Gramado-Canela Municipality: ex. Gramado

Page 16: Cláudio Baptista, UFCG  A Model for Geographic Knowledge Extraction on Web Documents Cláudio E. C. Campelo and Cláudio de Souza.

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br

16

GeoSEn - Architecture

Page 17: Cláudio Baptista, UFCG  A Model for Geographic Knowledge Extraction on Web Documents Cláudio E. C. Campelo and Cláudio de Souza.

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br

Page 18: Cláudio Baptista, UFCG  A Model for Geographic Knowledge Extraction on Web Documents Cláudio E. C. Campelo and Cláudio de Souza.

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br

Query Example

Example of query using a user defined area of interestSELECT idFROM places plc1WHERE within(plc1.geometry, specified_geometry)AND NOT EXISTS ( SELECT id FROM places plc2 WHERE within(plc2.geometry, specified_geometry) AND within(plc1.geometry, plc2.geometry))

Page 19: Cláudio Baptista, UFCG  A Model for Geographic Knowledge Extraction on Web Documents Cláudio E. C. Campelo and Cláudio de Souza.

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br

Experiments Experiments using 66,531 indexed

documents; 5 classes: .edu, .gov, blogs, tourism, arts; Detection of terms:

Documents from the Web manually analyzed; Documents with strong ambiguities created for the

test bed;

Page 20: Cláudio Baptista, UFCG  A Model for Geographic Knowledge Extraction on Web Documents Cláudio E. C. Campelo and Cláudio de Souza.

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br

20

Conclusion

We have presented a heuristic based approach to implement a GIR system.

The techniques presented may be combined with others already known.

Precomputed relevance values may be used aiming to simplify the search process;

Page 21: Cláudio Baptista, UFCG  A Model for Geographic Knowledge Extraction on Web Documents Cláudio E. C. Campelo and Cláudio de Souza.

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br

21

Future Work

Retrieval of georeferenced images and videos;

Recognition of other kinds of places; Integration of other data sources; Evaluation using large data set collections.

Page 22: Cláudio Baptista, UFCG  A Model for Geographic Knowledge Extraction on Web Documents Cláudio E. C. Campelo and Cláudio de Souza.

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br

22

Thank you very much!

Questions?