Geographical Information Retrieval
Instituto Superior Técnico - INESC-IDData Management and Information Retrieval Group (DMIR) - TagusPark
Por Bruno Martins ([email protected])
Motivation for Geographic IR
Geo-information associates things and events with places.
Geo-information is abundant on the Web and on Digital Libraries. Collections of geo-referenced photographs. Newsfeeds. General databases of geo-referenced information. Around 80% of Web pages contain references to places.
Many information needs are related to a given geographical context. Find me the nearest restaurants. Find me news about Lisboa. Find me photographs taken in Sintra. ... Around 20% of Web searches are “local” in nature.
Geographic information is part of our everyday lives!
Existing Geographical IR Systems
Web search engines with “local search” Yahoo! Local, Google Local, ... Integration with navigation mechanisms. Mostly explore “Yellow-pages” information.
Web-based GIS platforms (virtual globes) Google Earth, ... Explore databases of georeferenced info. OGC standards for Web-GIS
Photo repositories with “local search” Flickr geo-tagging interface, ... Explore automatic “GPS” geo-referencing.
Many more location-based services Advertisement, discussion communities, ... Location is everywhere in information systems.
Challenges for Geographical IR• Very few systems explore information on the Web directly.
– They instead used databases of georeferenced information.
• Geographic context embedded in natural language descriptions.
– This presents problems to automated processing.
– Place names are ambiguous and get confused with names of organizations, people, buildings and streets.
• Web queries depend on exact match of text terms.
– Handling structured queries (e.g. “concept, relation, location”).
– Intelligent interpretation of spatial relationships (“near”, “west” etc).
– Ranking results against some measure of geographic relevance.
Geographical Information Retrieval (GIR)
Geographic information retrieval (GIR) is concerned with the retrieval of geographically referenced information objects.
Information objects can be maps, images, digital geographic data or even textual (web) documents.
New multidisciplinary field Combines techniques from database systems, information retrieval, digital libraries, user interfaces, geographical information systems, ...
GeographicInformation
Systems
InformationRetrieval
KnowledgeManagement
GeographicIR
The difference among GIR and GIS
• GIS is concerned with exact spatial representations and complex analysis at the level of the individual spatial object or field.– Users are experts, information is structured and unambiguous!
• GIR is concerned with retrieving geo-referenced information resources that may be relevant to a geographic query region.– Unstructured and ambiguous information, everyday applications!
• Similar to the difference between search engines and relational database systems!
Geo-referencing and GIR• Information objects can be geo-referenced by either place
names or by geographic coordinates (i.e. longitude & latitude)– Geographic coordinates represent exact physical location– Placenames are ambiguous (main problem of GIR)
• Spatial relations may be either:– Geometric: distance and direction measured on a continuous scale.– Topological: spatially related but not directly measurable.
YY
XX
The typical steps involved in GIR
Anatomy of a Geographical IR System
Textual
Spatial
IndexesSpatialTextual
SearchEngine
RelevanceRanking
RankedResults
Search Request + Query footprint
UnrankedResults
Ontologya.k.a.
GazetteerUserInterface
Broker
RankedResults
Query disambiguation
Geo-tagging
Textual
Spatial
Info.Resources
Document Footprints
Text Indexing
Query footprint
Mapping
Gazetteers / Geographic Ontology
• Database containing placenames, the spatial relationships among them and the associated geographical footprints.
• Support for geo-referencing with basis on the place names over text.
• Many problems in using traditional gazetteers for GIR.
Roles of the Gazetteer in GIR
User Interface
Query Disambiguation
Geo-Tagging
Metadata Extraction
document collection
document footprints
Relevance Ranking
Relevance Ranking
Spatial Index
documentfootprints
Search Component
Query Expansion(query footprint)
gazetteer
Challenges to using Gazetteers in GIR
• To be useful in GIR the gazetteer should support– Different locations and boundary changes, integrating data
from multiple sources.– Synonymous and variant names with differing locations for
the same entity.– Different relationships among concepts.– Names in multiple languages.– “Fuzzy” regions and intra-urban place names.
• More than gazetteers, we need an ontology!
Existing Gazetteer Systems/Services
• Alexandria Digital Library (ADL) Gazetteer.– ~6 million entries– Has tried to standardize the
format, description, and distribution of gazetteer data.
– Has a published, detailed schema.
– Basis for OGC standard.
• Geonames website.– Integrates information from
multiple sources.– Publishes OWL ontology.– ~6 million entries
• EuroGeoNames project.
GeoTagging = GeoParsing+GeoCoding
Geo-parsing Recognizing geographic references, ignoring non-geographic uses of place terminology
Geo-codingAttaching a unique quantitative location (footprint) to the extracted geographic references
GeoParsing Textual Documents• The presence of placenames can be recognised with the help
of gazetteers/geo-ontologies (i.e. lists of names)
• Some types of place references given over text:
– the name of the place : Coimbra
– an address: INESC-ID, Rua Alves Redol, 9
Lisboa
– an address fragment: “Manuel lived near Largo do Rato in Lisboa”
– a postcode / zip code: 2840-137
– a phone number : most Lisbon phone numbers start with +351 21
Ambiguity in GeoParsing Documents
Examples of false place references:
• Personal names Smedes York,Jack London
• Business names Dorchester Hotel,York Properties..
• Street names Oxford Street, London Road…
• Common words bath, battle, derby, over, well, ……
• Approach for handling ambiguity:
– Look for patterns in surrounding context!!!
– One reference per discourse.
GeoCoding place references in text
Many different places with the same name (referent ambiguity) Newport, Cambridge, Springfield, Lisboa………
• Use context to decide: references to parent or nearby places.• Choose most important one: by population or place type.
• Optional step taken by some GIR approaches:• Finding a document’s encompassing geographic scope.
– Combine all place references given in the document.– Use heuristics to guide the process.
Document Indexing for Geographic IR
• Different indexing strategies are possible:
– Index documents with basis on gazetteer ids.– Use documents scopes to create document
footprints (point, bounding rectangle, ...) and use footprints to index documents.
• Strategy for handling queries:– Convert query to a query footprint/gazetteer id.– Match query footprint to document footprints/ids.– Rank documents according to “relevance”.
Handling queries in GIR systems
Data structures for indexing in GIR
• Typical strategy is to have separate indexes.– Inverted index for text.– R-tree for footprints.
• Access spatial index with query footprint/gazetteer id.
• Access text index with query terms.
• Merge results and find the intersection.
D1
D2
D3
D4
D6
D7
D8
D5
D10
D11
D12
D13
D14
D15
D9 D16
R
R1
R3
R2
R4
Term1 D1, D2, D23, …
Term2 D9, D11, D100, …
Term3 D27, D85, ..
Ranking search results in GIR
• Spatial similarity can indicate relevance – Documents whose spatial content is more similar to the
spatial content of query should appear first.
• But we need to consider both the:– Thematic relevance: BM25, TF-IDF, ...– Geographic relevance: proximity, containment, ...
• Geometric (e.g. distance) and non-geometric (e.g. topology)
– Other importance metrics: PageRank
• State of the art consists of doing a linear combination.
Existing GIR systems : MetaCarta
The MetaCarta system
– Pioneer system addressing all aspects given in this talk.
– Conducts geo-parsing and geo-coding of text documents, and sends back possible location references with relative strength scores.
– Uses Natural Language Processing (NLP) to find possible location references.
– Contains a gazetteer of ~14 million entries.
Other GIR Systems : Research projects• Prototype system from the SPIRIT EU project
– Spatially-aware information retrieval on the Internet.– Geo-tagging of Web documents with basis on geo-ontology.
• Alexandria Digital Library– Digital library of geo-referenced materials.– Focus on development of a large gazetteer.
• GREASE, GIPSY, Web-a-Where, GeoXWalk, ...– Many more research projects addressing GIR aspects individually.
– GeoCLEF evaluation contest similar to TREC.
• Project DIGMAP under development at IST– Digital library for old maps and historical cartography resources– Indexing metadata records for geographic retrieval.
Current Challenges in Geographic IR
Improve “conventional GIR” components and methods
Geo-tagging, spatio-textual indexing and geo-relevance ranking.
Improved understanding of spatial natural language terminology.
Principled approaches for integration and evaluation of GIR.
Better user interfaces for exploration of GIR results.
Integration of geographical with temporal aspects.
Everything we do happens in space and time!
Creation of rich place ontologies with world-wide coverage.
Fuzzy regions and intra-urban placenames present challenges
Open GeoInformation Web services and Geospatial Semantic Web.
Where To Find More Information
• Georeferencing: The Geographic Associations of Information– By Linda L. Hill (Author), MIT Press
• Proceedings of the Workshops on Geographical IR– Edited by Chris Jones and Ross Purves (4th edition in 2007, Lisbon)
• Talk to me using the email address [email protected]