Top Banner
Geographic Information Retrieval from Disparate Data Sources Ian Turton, Anuj Jaiswal, Mark Gahegan GeoVISTA Center, School of Geography, Pennsylvania State University ijt1,arj135,[email protected]
28

Geographic Information Retrieval From Disparate Data Sources

May 10, 2015

Download

Technology

Ian Turton
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Geographic Information Retrieval From Disparate Data Sources

Geographic Information Retrieval from Disparate Data SourcesIan Turton, Anuj Jaiswal, Mark Gahegan

GeoVISTA Center, School of Geography, Pennsylvania State University

ijt1,arj135,[email protected]

Page 2: Geographic Information Retrieval From Disparate Data Sources

Summary

Information Retrieval? Geographic? Disparate Data Sources? Does it work? Semantics and Ontologies, do they help? Further work? Conclusions

Page 3: Geographic Information Retrieval From Disparate Data Sources

Information Retrieval

Information retrieval (IR) is the science of searching for information in documents, searching for documents themselves, searching for metadata which describe documents, or searching within databases, whether relational stand-alone databases or hypertextually-networked databases such as the World Wide Web.

Wikipedia

Page 4: Geographic Information Retrieval From Disparate Data Sources

OR more simply

Is there some way I can avoid reading all 19,000 of those articles about measles and still sound like I know what I’m talking about at the next conference?

Page 5: Geographic Information Retrieval From Disparate Data Sources

Geography

Well we all know that geography is important. Depending on who you ask more than 80% of

all information contains a geographic element.

Explicit: Has a map coordinate

Implicit: Has a place name

Page 6: Geographic Information Retrieval From Disparate Data Sources

Disparate Data Sources

Large collections of text containing implicit geographic references about Avian Flu and Measles: PubMed abstracts News Feeds (RSS) WHO incident reports

Page 7: Geographic Information Retrieval From Disparate Data Sources

Building the System

Acquire data Extract geographic information Extract semantic and ontological information Present in a form that allows easy exploration

by users.

Page 8: Geographic Information Retrieval From Disparate Data Sources

Acquire Data

First extract abstracts from PubMed http://eutils.ncbi.nlm.nih.gov/entrez/eutils/ ((avian OR bird) AND (influenza OR flu)) OR

H5N1 Returns a structured XML file with citation

data and abstract for selected papers. Process XML into PostGIS database

Page 9: Geographic Information Retrieval From Disparate Data Sources

Extract Geographic Entities

Use FactXtractor (http://julian.mine.nu/snedemo.html)

Uses GATE to detect and extract Named Entities and Entity Relationships

Usually finds People, Places and Organizations

Returned as an OWL encoded ontology In this case we just make use of places

Page 10: Geographic Information Retrieval From Disparate Data Sources

<rdf:RDF xml:base="http://ist.psu.edu/sna/ontology#"> <owl:Class rdf:ID="Location"/><owl:Class rdf:ID="Organization"/><owl:Class rdf:ID="Person"/><owl:DatatypeProperty rdf:ID="counts"/> <Location rdf:ID="Africa"> <counts>1</counts> <mentioned_in> <_Article rdf:ID="InputString0">

</_Article> </mentioned_in> </Location> <Location rdf:ID="Asia"> <counts>1</counts> <mentioned_in rdf:resource="#InputString0"/> </Location> <Location rdf:ID="Vietnam"/> <Location rdf:ID="South_East"/> <Location rdf:ID="Europe"> <counts>1</counts> <mentioned_in rdf:resource="#InputString0"/> </Location></rdf:RDF>

Page 11: Geographic Information Retrieval From Disparate Data Sources

GeoLocation

Converting a place name into a location State College, PA -> (40.7934, -77.86) Call the GeoNames web service to carry out

a gazetteer lookup on the name.

Page 12: Geographic Information Retrieval From Disparate Data Sources

Disambiguation

Which London did you mean?

Page 13: Geographic Information Retrieval From Disparate Data Sources

Types of Ambiguity

Geo/Geo London, UK vs London, Ontario South Wales, UK vs New South Wales, Au Paris, France vs Paris, Texas

Geo/Non Geo Washington, DC vs George Washington Van, Turkey vs delivery van West Nile, Egypt vs West Nile Virus

Sort of Ambiguous avian A/Mallard/Pennsylvania/10218/84 (H5N2) influenza

virus strains

Page 14: Geographic Information Retrieval From Disparate Data Sources

Disambiguating Multiple PlacesChoose A if A is a Political Entity and B is not,Choose B if B is a Political Entity and A is not,Choose A if A is a Region and B is not,Choose B if B is a Region and A is not,Choose A if A is an Ocean and B is not,Choose B if B is an Ocean and A is not,Choose A if A is a Populated Place and B is not,Choose B if B is a Populated Place and A is not,Choose A if A's population is greater than B's,Choose B if B's population is greater than A's,Choose A if A is an Administrative Area and B is not,Choose B if B is an Administrative Area and A is not,Choose A if A is a Water Feature and B is not,Choose B if B is a Water Feature and A is not,Choose A.

Page 15: Geographic Information Retrieval From Disparate Data Sources

Solving Geo/Non Geo Ambiguity Stop word lists – hand crafted by experience Province, valley, way, hill, Children, Children's, new, cross, red,

clinic, general, côte, ii, iii, bas, pays, chem, northern region, eastern region, central region, southern region, region, off, square, census, islands, city, district, park, USA, State, Virology, Microbiology, Immunology, Medical, Science, Employee, Surveillance, Disease, Biochemistry, Prevention, for, and, mail, natl, dept, dev, agr, Rural, inst, mil, med, coll, Internal, Publ, Bur, Hosp, Jude, Childrens, Chai, yan, Virol, Dis, Div, Enter, Cent, lab, Univ, res, ist, prevent, roc, prod, Roche, vet, castle, peak, stat, garden, Atl, Anim, mar, queen, central, Director, LAT, AC-EIA, register, north, east, south, west, northern, southern, eastern, western

Page 16: Geographic Information Retrieval From Disparate Data Sources

Concept Extraction

Automatically extract keywords or tags from article abstracts by Selecting keywords which exceed a preset

frequency. Passing text through Yahoo! tagging service,

returns key phrases using latent semantic indexing.

Page 17: Geographic Information Retrieval From Disparate Data Sources

Store everything in a big database Open up PostGIS and stuff in all the data

keyed by article id. Article

Citation data – authors, title, abstract, journal, volume, issue, etc

Places Name, Country, Latitude, Longitude, etc

Concepts Key phrase or word

Page 18: Geographic Information Retrieval From Disparate Data Sources

Provide Intuitive Front End for Users Tag Cloud

Popularized on many web 2.0 sites such as Flickr, del.icio.us, citeUlike.org etc.

Page 19: Geographic Information Retrieval From Disparate Data Sources

Place Cloud

Page 20: Geographic Information Retrieval From Disparate Data Sources

Author Cloud

Page 21: Geographic Information Retrieval From Disparate Data Sources

Choose a tag

Page 22: Geographic Information Retrieval From Disparate Data Sources

Choose a place

Page 23: Geographic Information Retrieval From Disparate Data Sources

Select a child of the place

Page 24: Geographic Information Retrieval From Disparate Data Sources

Tag limited by place

Page 25: Geographic Information Retrieval From Disparate Data Sources

Implementation

Initially implemented as a java servlet using JDBC link to PostGIS

Reimplemented using Ruby on Rails in last week using ActiveRecord to PostGIS

In page mapping OpenLayers WMS map client to GeoServer over PostGIS.

Page 26: Geographic Information Retrieval From Disparate Data Sources

Semantics and Ontologies

Geographic ontology is provided by GeoNames semantic web service.

A query allows the look up of parent, children and nearby features for most features.

Results are cached in PostGIS database to save processing time and load on server.

Page 27: Geographic Information Retrieval From Disparate Data Sources

WordNet Ontology

Page 28: Geographic Information Retrieval From Disparate Data Sources

Conclusions

It is possible to construct a useful system to ingest arbitrary text and extract place names.

A sufficiently good automated location disambiguation system can be built for a specific domain to process 80-90% of places correctly.

Semantic expansion and narrowing of searches appears useful in early experiments.

Providing users with a familiar, and highly linked, interface seems to aid exploration of the document space.