Top Banner
Geographical Information Retrieval Instituto Superior Técnico - INESC-ID Data Management and Information Retrieval Group (DMIR) - TagusPark Por Bruno Martins ([email protected])
25

Geographical Information Retrieval Instituto Superior Técnico - INESC-ID Data Management and Information Retrieval Group (DMIR) - TagusPark Por Bruno Martins.

Apr 01, 2015

Download

Documents

Jamel Burgher
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Geographical Information Retrieval Instituto Superior Técnico - INESC-ID Data Management and Information Retrieval Group (DMIR) - TagusPark Por Bruno Martins.

Geographical Information Retrieval

Instituto Superior Técnico - INESC-IDData Management and Information Retrieval Group (DMIR) - TagusPark

Por Bruno Martins ([email protected])

Page 2: Geographical Information Retrieval Instituto Superior Técnico - INESC-ID Data Management and Information Retrieval Group (DMIR) - TagusPark Por Bruno Martins.

Motivation for Geographic IR

Geo-information associates things and events with places.

Geo-information is abundant on the Web and on Digital Libraries. Collections of geo-referenced photographs. Newsfeeds. General databases of geo-referenced information. Around 80% of Web pages contain references to places.

Many information needs are related to a given geographical context. Find me the nearest restaurants. Find me news about Lisboa. Find me photographs taken in Sintra. ... Around 20% of Web searches are “local” in nature.

Geographic information is part of our everyday lives!

Page 3: Geographical Information Retrieval Instituto Superior Técnico - INESC-ID Data Management and Information Retrieval Group (DMIR) - TagusPark Por Bruno Martins.

Existing Geographical IR Systems

Web search engines with “local search” Yahoo! Local, Google Local, ... Integration with navigation mechanisms. Mostly explore “Yellow-pages” information.

Web-based GIS platforms (virtual globes) Google Earth, ... Explore databases of georeferenced info. OGC standards for Web-GIS

Photo repositories with “local search” Flickr geo-tagging interface, ... Explore automatic “GPS” geo-referencing.

Many more location-based services Advertisement, discussion communities, ... Location is everywhere in information systems.

Page 4: Geographical Information Retrieval Instituto Superior Técnico - INESC-ID Data Management and Information Retrieval Group (DMIR) - TagusPark Por Bruno Martins.

Challenges for Geographical IR• Very few systems explore information on the Web directly.

– They instead used databases of georeferenced information.

• Geographic context embedded in natural language descriptions.

– This presents problems to automated processing.

– Place names are ambiguous and get confused with names of organizations, people, buildings and streets.

• Web queries depend on exact match of text terms.

– Handling structured queries (e.g. “concept, relation, location”).

– Intelligent interpretation of spatial relationships (“near”, “west” etc).

– Ranking results against some measure of geographic relevance.

Page 5: Geographical Information Retrieval Instituto Superior Técnico - INESC-ID Data Management and Information Retrieval Group (DMIR) - TagusPark Por Bruno Martins.

Geographical Information Retrieval (GIR)

Geographic information retrieval (GIR) is concerned with the retrieval of geographically referenced information objects.

Information objects can be maps, images, digital geographic data or even textual (web) documents.

New multidisciplinary field Combines techniques from database systems, information retrieval, digital libraries, user interfaces, geographical information systems, ...

GeographicInformation

Systems

InformationRetrieval

KnowledgeManagement

GeographicIR

Page 6: Geographical Information Retrieval Instituto Superior Técnico - INESC-ID Data Management and Information Retrieval Group (DMIR) - TagusPark Por Bruno Martins.

The difference among GIR and GIS

• GIS is concerned with exact spatial representations and complex analysis at the level of the individual spatial object or field.– Users are experts, information is structured and unambiguous!

• GIR is concerned with retrieving geo-referenced information resources that may be relevant to a geographic query region.– Unstructured and ambiguous information, everyday applications!

• Similar to the difference between search engines and relational database systems!

Page 7: Geographical Information Retrieval Instituto Superior Técnico - INESC-ID Data Management and Information Retrieval Group (DMIR) - TagusPark Por Bruno Martins.

Geo-referencing and GIR• Information objects can be geo-referenced by either place

names or by geographic coordinates (i.e. longitude & latitude)– Geographic coordinates represent exact physical location– Placenames are ambiguous (main problem of GIR)

• Spatial relations may be either:– Geometric: distance and direction measured on a continuous scale.– Topological: spatially related but not directly measurable.

YY

XX

Page 8: Geographical Information Retrieval Instituto Superior Técnico - INESC-ID Data Management and Information Retrieval Group (DMIR) - TagusPark Por Bruno Martins.

The typical steps involved in GIR

Page 9: Geographical Information Retrieval Instituto Superior Técnico - INESC-ID Data Management and Information Retrieval Group (DMIR) - TagusPark Por Bruno Martins.

Anatomy of a Geographical IR System

Textual

Spatial

IndexesSpatialTextual

SearchEngine

RelevanceRanking

RankedResults

Search Request + Query footprint

UnrankedResults

Ontologya.k.a.

GazetteerUserInterface

Broker

RankedResults

Query disambiguation

Geo-tagging

Textual

Spatial

Info.Resources

Document Footprints

Text Indexing

Query footprint

Mapping

Page 10: Geographical Information Retrieval Instituto Superior Técnico - INESC-ID Data Management and Information Retrieval Group (DMIR) - TagusPark Por Bruno Martins.

Gazetteers / Geographic Ontology

• Database containing placenames, the spatial relationships among them and the associated geographical footprints.

• Support for geo-referencing with basis on the place names over text.

• Many problems in using traditional gazetteers for GIR.

Page 11: Geographical Information Retrieval Instituto Superior Técnico - INESC-ID Data Management and Information Retrieval Group (DMIR) - TagusPark Por Bruno Martins.

Roles of the Gazetteer in GIR

User Interface

Query Disambiguation

Geo-Tagging

Metadata Extraction

document collection

document footprints

Relevance Ranking

Relevance Ranking

Spatial Index

documentfootprints

Search Component

Query Expansion(query footprint)

gazetteer

Page 12: Geographical Information Retrieval Instituto Superior Técnico - INESC-ID Data Management and Information Retrieval Group (DMIR) - TagusPark Por Bruno Martins.

Challenges to using Gazetteers in GIR

• To be useful in GIR the gazetteer should support– Different locations and boundary changes, integrating data

from multiple sources.– Synonymous and variant names with differing locations for

the same entity.– Different relationships among concepts.– Names in multiple languages.– “Fuzzy” regions and intra-urban place names.

• More than gazetteers, we need an ontology!

Page 13: Geographical Information Retrieval Instituto Superior Técnico - INESC-ID Data Management and Information Retrieval Group (DMIR) - TagusPark Por Bruno Martins.

Existing Gazetteer Systems/Services

• Alexandria Digital Library (ADL) Gazetteer.– ~6 million entries– Has tried to standardize the

format, description, and distribution of gazetteer data.

– Has a published, detailed schema.

– Basis for OGC standard.

• Geonames website.– Integrates information from

multiple sources.– Publishes OWL ontology.– ~6 million entries

• EuroGeoNames project.

Page 14: Geographical Information Retrieval Instituto Superior Técnico - INESC-ID Data Management and Information Retrieval Group (DMIR) - TagusPark Por Bruno Martins.

GeoTagging = GeoParsing+GeoCoding

Geo-parsing Recognizing geographic references, ignoring non-geographic uses of place terminology

Geo-codingAttaching a unique quantitative location (footprint) to the extracted geographic references

Page 15: Geographical Information Retrieval Instituto Superior Técnico - INESC-ID Data Management and Information Retrieval Group (DMIR) - TagusPark Por Bruno Martins.

GeoParsing Textual Documents• The presence of placenames can be recognised with the help

of gazetteers/geo-ontologies (i.e. lists of names)

• Some types of place references given over text:

– the name of the place : Coimbra

– an address: INESC-ID, Rua Alves Redol, 9

Lisboa

– an address fragment: “Manuel lived near Largo do Rato in Lisboa”

– a postcode / zip code: 2840-137

– a phone number : most Lisbon phone numbers start with +351 21

Page 16: Geographical Information Retrieval Instituto Superior Técnico - INESC-ID Data Management and Information Retrieval Group (DMIR) - TagusPark Por Bruno Martins.

Ambiguity in GeoParsing Documents

Examples of false place references:

• Personal names Smedes York,Jack London

• Business names Dorchester Hotel,York Properties..

• Street names Oxford Street, London Road…

• Common words bath, battle, derby, over, well, ……

• Approach for handling ambiguity:

– Look for patterns in surrounding context!!!

– One reference per discourse.

Page 17: Geographical Information Retrieval Instituto Superior Técnico - INESC-ID Data Management and Information Retrieval Group (DMIR) - TagusPark Por Bruno Martins.

GeoCoding place references in text

Many different places with the same name (referent ambiguity) Newport, Cambridge, Springfield, Lisboa………

• Use context to decide: references to parent or nearby places.• Choose most important one: by population or place type.

• Optional step taken by some GIR approaches:• Finding a document’s encompassing geographic scope.

– Combine all place references given in the document.– Use heuristics to guide the process.

Page 18: Geographical Information Retrieval Instituto Superior Técnico - INESC-ID Data Management and Information Retrieval Group (DMIR) - TagusPark Por Bruno Martins.

Document Indexing for Geographic IR

• Different indexing strategies are possible:

– Index documents with basis on gazetteer ids.– Use documents scopes to create document

footprints (point, bounding rectangle, ...) and use footprints to index documents.

• Strategy for handling queries:– Convert query to a query footprint/gazetteer id.– Match query footprint to document footprints/ids.– Rank documents according to “relevance”.

Page 19: Geographical Information Retrieval Instituto Superior Técnico - INESC-ID Data Management and Information Retrieval Group (DMIR) - TagusPark Por Bruno Martins.

Handling queries in GIR systems

Page 20: Geographical Information Retrieval Instituto Superior Técnico - INESC-ID Data Management and Information Retrieval Group (DMIR) - TagusPark Por Bruno Martins.

Data structures for indexing in GIR

• Typical strategy is to have separate indexes.– Inverted index for text.– R-tree for footprints.

• Access spatial index with query footprint/gazetteer id.

• Access text index with query terms.

• Merge results and find the intersection.

D1

D2

D3

D4

D6

D7

D8

D5

D10

D11

D12

D13

D14

D15

D9 D16

R

R1

R3

R2

R4

Term1 D1, D2, D23, …

Term2 D9, D11, D100, …

Term3 D27, D85, ..

Page 21: Geographical Information Retrieval Instituto Superior Técnico - INESC-ID Data Management and Information Retrieval Group (DMIR) - TagusPark Por Bruno Martins.

Ranking search results in GIR

• Spatial similarity can indicate relevance – Documents whose spatial content is more similar to the

spatial content of query should appear first.

• But we need to consider both the:– Thematic relevance: BM25, TF-IDF, ...– Geographic relevance: proximity, containment, ...

• Geometric (e.g. distance) and non-geometric (e.g. topology)

– Other importance metrics: PageRank

• State of the art consists of doing a linear combination.

Page 22: Geographical Information Retrieval Instituto Superior Técnico - INESC-ID Data Management and Information Retrieval Group (DMIR) - TagusPark Por Bruno Martins.

Existing GIR systems : MetaCarta

The MetaCarta system

– Pioneer system addressing all aspects given in this talk.

– Conducts geo-parsing and geo-coding of text documents, and sends back possible location references with relative strength scores.

– Uses Natural Language Processing (NLP) to find possible location references.

– Contains a gazetteer of ~14 million entries.

Page 23: Geographical Information Retrieval Instituto Superior Técnico - INESC-ID Data Management and Information Retrieval Group (DMIR) - TagusPark Por Bruno Martins.

Other GIR Systems : Research projects• Prototype system from the SPIRIT EU project

– Spatially-aware information retrieval on the Internet.– Geo-tagging of Web documents with basis on geo-ontology.

• Alexandria Digital Library– Digital library of geo-referenced materials.– Focus on development of a large gazetteer.

• GREASE, GIPSY, Web-a-Where, GeoXWalk, ...– Many more research projects addressing GIR aspects individually.

– GeoCLEF evaluation contest similar to TREC.

• Project DIGMAP under development at IST– Digital library for old maps and historical cartography resources– Indexing metadata records for geographic retrieval.

Page 24: Geographical Information Retrieval Instituto Superior Técnico - INESC-ID Data Management and Information Retrieval Group (DMIR) - TagusPark Por Bruno Martins.

Current Challenges in Geographic IR

Improve “conventional GIR” components and methods

Geo-tagging, spatio-textual indexing and geo-relevance ranking.

Improved understanding of spatial natural language terminology.

Principled approaches for integration and evaluation of GIR.

Better user interfaces for exploration of GIR results.

Integration of geographical with temporal aspects.

Everything we do happens in space and time!

Creation of rich place ontologies with world-wide coverage.

Fuzzy regions and intra-urban placenames present challenges

Open GeoInformation Web services and Geospatial Semantic Web.

Page 25: Geographical Information Retrieval Instituto Superior Técnico - INESC-ID Data Management and Information Retrieval Group (DMIR) - TagusPark Por Bruno Martins.

Where To Find More Information

• Georeferencing: The Geographic Associations of Information– By Linda L. Hill (Author), MIT Press

• Proceedings of the Workshops on Geographical IR– Edited by Chris Jones and Ross Purves (4th edition in 2007, Lisbon)

• Talk to me using the email address [email protected]