Geographic Information Retrieval From Disparate Data Sources

Post on 10-May-2015

1910 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

Transcript

Geographic Information Retrieval from Disparate Data SourcesIan Turton, Anuj Jaiswal, Mark Gahegan

GeoVISTA Center, School of Geography, Pennsylvania State University

ijt1,arj135,mng1@psu.edu

Summary

Information Retrieval? Geographic? Disparate Data Sources? Does it work? Semantics and Ontologies, do they help? Further work? Conclusions

Information Retrieval

Information retrieval (IR) is the science of searching for information in documents, searching for documents themselves, searching for metadata which describe documents, or searching within databases, whether relational stand-alone databases or hypertextually-networked databases such as the World Wide Web.

Wikipedia

OR more simply

Is there some way I can avoid reading all 19,000 of those articles about measles and still sound like I know what I’m talking about at the next conference?

Geography

Well we all know that geography is important. Depending on who you ask more than 80% of

all information contains a geographic element.

Explicit: Has a map coordinate

Implicit: Has a place name

Disparate Data Sources

Large collections of text containing implicit geographic references about Avian Flu and Measles: PubMed abstracts News Feeds (RSS) WHO incident reports

Building the System

Acquire data Extract geographic information Extract semantic and ontological information Present in a form that allows easy exploration

by users.

Acquire Data

First extract abstracts from PubMed http://eutils.ncbi.nlm.nih.gov/entrez/eutils/ ((avian OR bird) AND (influenza OR flu)) OR

H5N1 Returns a structured XML file with citation

data and abstract for selected papers. Process XML into PostGIS database

Extract Geographic Entities

Use FactXtractor (http://julian.mine.nu/snedemo.html)

Uses GATE to detect and extract Named Entities and Entity Relationships

Usually finds People, Places and Organizations

Returned as an OWL encoded ontology In this case we just make use of places

<rdf:RDF xml:base="http://ist.psu.edu/sna/ontology#"> <owl:Class rdf:ID="Location"/><owl:Class rdf:ID="Organization"/><owl:Class rdf:ID="Person"/><owl:DatatypeProperty rdf:ID="counts"/> <Location rdf:ID="Africa"> <counts>1</counts> <mentioned_in> <_Article rdf:ID="InputString0">

</_Article> </mentioned_in> </Location> <Location rdf:ID="Asia"> <counts>1</counts> <mentioned_in rdf:resource="#InputString0"/> </Location> <Location rdf:ID="Vietnam"/> <Location rdf:ID="South_East"/> <Location rdf:ID="Europe"> <counts>1</counts> <mentioned_in rdf:resource="#InputString0"/> </Location></rdf:RDF>

GeoLocation

Converting a place name into a location State College, PA -> (40.7934, -77.86) Call the GeoNames web service to carry out

a gazetteer lookup on the name.

Disambiguation

Which London did you mean?

Types of Ambiguity

Geo/Geo London, UK vs London, Ontario South Wales, UK vs New South Wales, Au Paris, France vs Paris, Texas

Geo/Non Geo Washington, DC vs George Washington Van, Turkey vs delivery van West Nile, Egypt vs West Nile Virus

Sort of Ambiguous avian A/Mallard/Pennsylvania/10218/84 (H5N2) influenza

virus strains

Disambiguating Multiple PlacesChoose A if A is a Political Entity and B is not,Choose B if B is a Political Entity and A is not,Choose A if A is a Region and B is not,Choose B if B is a Region and A is not,Choose A if A is an Ocean and B is not,Choose B if B is an Ocean and A is not,Choose A if A is a Populated Place and B is not,Choose B if B is a Populated Place and A is not,Choose A if A's population is greater than B's,Choose B if B's population is greater than A's,Choose A if A is an Administrative Area and B is not,Choose B if B is an Administrative Area and A is not,Choose A if A is a Water Feature and B is not,Choose B if B is a Water Feature and A is not,Choose A.

Solving Geo/Non Geo Ambiguity Stop word lists – hand crafted by experience Province, valley, way, hill, Children, Children's, new, cross, red,

clinic, general, côte, ii, iii, bas, pays, chem, northern region, eastern region, central region, southern region, region, off, square, census, islands, city, district, park, USA, State, Virology, Microbiology, Immunology, Medical, Science, Employee, Surveillance, Disease, Biochemistry, Prevention, for, and, mail, natl, dept, dev, agr, Rural, inst, mil, med, coll, Internal, Publ, Bur, Hosp, Jude, Childrens, Chai, yan, Virol, Dis, Div, Enter, Cent, lab, Univ, res, ist, prevent, roc, prod, Roche, vet, castle, peak, stat, garden, Atl, Anim, mar, queen, central, Director, LAT, AC-EIA, register, north, east, south, west, northern, southern, eastern, western

Concept Extraction

Automatically extract keywords or tags from article abstracts by Selecting keywords which exceed a preset

frequency. Passing text through Yahoo! tagging service,

returns key phrases using latent semantic indexing.

Store everything in a big database Open up PostGIS and stuff in all the data

keyed by article id. Article

Citation data – authors, title, abstract, journal, volume, issue, etc

Places Name, Country, Latitude, Longitude, etc

Concepts Key phrase or word

Provide Intuitive Front End for Users Tag Cloud

Popularized on many web 2.0 sites such as Flickr, del.icio.us, citeUlike.org etc.

Place Cloud

Author Cloud

Choose a tag

Choose a place

Select a child of the place

Tag limited by place

Implementation

Initially implemented as a java servlet using JDBC link to PostGIS

Reimplemented using Ruby on Rails in last week using ActiveRecord to PostGIS

In page mapping OpenLayers WMS map client to GeoServer over PostGIS.

Semantics and Ontologies

Geographic ontology is provided by GeoNames semantic web service.

A query allows the look up of parent, children and nearby features for most features.

Results are cached in PostGIS database to save processing time and load on server.

WordNet Ontology

Conclusions

It is possible to construct a useful system to ingest arbitrary text and extract place names.

A sufficiently good automated location disambiguation system can be built for a specific domain to process 80-90% of places correctly.

Semantic expansion and narrowing of searches appears useful in early experiments.

Providing users with a familiar, and highly linked, interface seems to aid exploration of the document space.

top related