Finding Haystacks with Needles: Ranked Search for Data Using Geospatial and Temporal Characteristics V.M. Megler David Maier Portland State University Acknowledgements: This work is supported by NSF award OCE-0424602. We thank the staff of the Center for Coastal Margin Observation and Prediction for their support. We also thank the students and professionals who were willing to take part in the user study.
29
Embed
Finding Haystacks with Needles: Ranked Search for Data Using Geospatial and Temporal Characteristics V.M. Megler David Maier Portland State University.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Finding Haystacks with Needles: Ranked Search for Data Using Geospatial and
Temporal Characteristics
V.M. MeglerDavid MaierPortland State University
Acknowledgements: This work is supported by NSF award OCE-0424602. We thank the staff of the Center for Coastal Margin Observation and Prediction for their support. We also thank the students and professionals who were willing to take part in the user study.
2
Haystacks
• Many environmental sensors deployed in last decade• Each sensor collects environmental observations
– Sometimes many per second
• Each observation has: – a time; – a location; – observed variables
• Observational data stored in many formats, many datasets
3
Needles
• Scientists at CMOP name “finding data relevant to their research” as one of their biggest problems2
• Example query: – “Any observations near the Astoria bridge in June 2009”
2. Center for Coastal Margin Observation and Prediction RIG Meeting, July 15 2010
OriginalObservations
Bounding Box
Needle
May … June
4
Problem: Finding Haystacks that Contain Needles
• Problem: Which datasets contain relevant data?– Many scientific datasets have no metadata
1. Create hierarchical metadata to represent dataset contents
2. Query over metadata
3. Rank query results
May … June June … July
5
Current Approaches / Related Work (1)
• Search via data visualization– Given a specific dataset and data ranges,
display the (large amount of) data
– Most common approach so far
• But: How does the scientist identify relevant datasets and ranges for visualization?
Example of visualization approach[Howe et al. 2009]
6
• Metadata search– Text search of manually-added metadata
• E.g. “Salinity, Columbia River”
– Boolean search on time and location (rare)• Some advanced geoportals provide spatial tests:
– E.g. dataset intersects or completely contains query area
• But: – Boolean search: No matches: no results (1)– Search results not ranked (2)
(1) (2)
Current Approaches / Related Work (2)
7
Current Approaches / Related Work (3)• In Information Retrieval:
– Ranked retrieval of unstructured text documents
• But text retrieval techniques not suited to searching the contents of scientific datasets
Asynchronous Indexing Interactive Query
DocumentCache
Documents
Indexes
RankedResults
Parsing
Scoring and Ranking
Feature Extraction
User Query Interface
8
Research Questions
How can we rank datasets?
Does the ranking approach resonate with users?
What features should we extract from scientific datasets …
… that would allow us to perform real-time search over the extracted features?
Spatial and temporal features selected for initial case study
9
Research Contributions
Proposed a mental model of how scientists perceive dataset similarity for space and time characteristics
Tested mental model in a user study
Developed hierarchical metadata to represent dataset contents
Extracting features at multiple granularities
Developed a prototype query engine with real-time response
10
Space-Time Ranking: Mental Model
• Example Query: “Observations within ½ km of point ‘P’, in June 2009”• Each dataset A, B, … represented by its time extent A(t), B(t), … and
its geospatial extent A(g), B(g), …
• Relative “weight” of space to time given by the “range” of each query term
Query T
A(t)
B(t)
D(t)E(t)C(t)
P
Query G
F(g)
Time
B(g)
SpaceC
(g) D(g) H(g)
K(g)
J(g)
Too far Far Not Close Close Quite close Here Quite close Close Not Close Far Too far
January February March April May June July August September October November
E(g)A(g) r
F(t)
4.5 km 3.5 km 2.5 km 1.5km 1.5 km 2.5 km 3.5 km 4.5 km
11
Scoring Datasets (1)• Score each dataset using formulae that quantify the model
• Given a geospatial query G, calculate spatial-relevance score dGs for dataset d
• Spatial relevance is approximated by: – ½ (min distance + max distance) / radius
– Apply scoring function to the result
P
Query G
r
D(g)
Max distance
Min distance
X
K(g)
X
Min distance
Max distance
dGs
dGs
A(g)
12
Scoring Datasets (2)
• Given a time query T, calculate a time-relevance score dTs for dataset d
• Calculated scores can range from 100 for an exact match to query terms to negative numbers for datasets “too far” from query
Scoring Function S
ß 0 à 10r 10r5r5r
Query Q
A(t)100
B(t)95F(t)
75
D(t): 25
E(t)-25
15r 15r
50
100
Distance
Sco
re
13
Ranking Datasets
• Overall relevance score dscore for each dataset d is composed using the geospatial and temporal scores:
• Datasets are then ranked by decreasing relevance score.
2/)( TsGsscore ddd
14
Ranking
• Tested relevance ranking with a user study:– Proposed relevance measure appears to approximate user
expectations– Relevance-measure “tuning” may further improve match with user
expectations• “Closest edge” has more weight than “centroid” or “farthest edge”
• Scoring/ranking approach assumes appropriate indexes over which to operate– Query terms should relate to indexed features – Features represent metadata used to describe dataset content
15
Creating Metadata: Extracting Features for Space and Time
Geometry Mintime Maxtime Parent
May 2009, Point Sur
Polygon [bounding box]
5/19/2009 6/10/2009 <null>
May 2009, Point Sur, 2009-05-19
Polyline(p1, p2, p3, p4)
5/19/2009, 00:00
5/19/2009, 23:59
May 2009, Point Sur
May 2009, Point Sur, 2009-05-19, Segment 1
Line(p1, p2) 5/19/2009, 00:00
5/19/2009, 06:14
May 2009, Point Sur, 2009-05-19
May 2009, Point Sur, 2009-05-19, Segment 2
Line(p2, p3) 5/19/2009, 06:15
5/19/2009, 14:23
May 2009, Point Sur, 2009-05-19
May 2009, Point Sur, 2009-05-19, Segment 3
Line(p3, p4) 5/19/2009, 14:24
5/19/2009, 15:01
May 2009, Point Sur, 2009-05-19
….
DNH Metadata Table
• Transform observations into features – Extract at multiple granularities
– Model features as “footprints”
– E.g.: 1 million observations over 3 weeks
Original CruiseObservations
Bounding Box(derived)
Line per day(derived)
Individual line segments (derived)
May … June
16
Metadata: Adaptive Hierarchy
2010-07, 08
201
0-0
8 (part)
20
10
-07
20
09
-10
20
09
-09
20
09
-08
20
09
-06
200
9-0
5 (part)
2009-05-17 – 2009-11-212008-02-19 – 2008-08-20
20
08
-08
20
08
-05
20
08
-02
Data files (directly downloadable); bottom level of metadata hierarchy
Parent (lifetime) metadata record
20
09
-11
2007-10-30 ... 2010-08-12
20
07
-11
20
09
-07
200
7
Second level of metadata hierarchy
• Multiple depths of hierarchy are accommodated simultaneously
• Curation decision(s) made once per kind of data/dataset
Ranking scientific datasets in response to a spatio-temporal query
Automatically extracting hierarchical metadata from scientific datasets …
… and searching over the extracted features
Providing real-time response times for queries over ¼ billion observations in a multi-terabyte data repository
21
Current Research
Evaluation of metadata scalability
Add elevation / depth: 4-dimensional search 2+1+1 versus 3+1
Add additional search criteria: Observational variables … “with oxygen below 3 mg/liter, where Myrionecta Rubra are present”
Backup Material
23
References1. Geospatial One Stop (GOS), http://gos2.geodata.gov/wps/portal/gos.2. Global Change Master Directory Web Site, http://gcmd.nasa.gov/.3. The Google Maps Javascript API V3, http://code.google.com/apis/maps/
documentation/javascript/.4. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. ACM Press New York (1999).5. Douglas, D.H., Peucker, T.K.: Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. Cartographica. 10, 2,
112–122 (1973).6. Egenhofer, M.J.: Toward the semantic geospatial web. Proceedings of the 10th ACM international symposium on Advances in geographic information systems.
pp. 1–4 (2002).7. Evans, M.P.: Analysing Google rankings through search engine optimization data. Internet Research. 17, 1, 21–37 (2007).8. Goodchild, M.F., Zhou, J.: Finding geographic information: Collection-level metadata. GeoInformatica. 7, 2, 95–112 (2003).9. Goodchild, M.F.: The Alexandria Digital Library Project: Review, Assessment, and Prospects, http://www.dlib.org/dlib/may04/goodchild/05goodchild.html,
(2004).10. Goodchild, M.F. et al.: Sharing Geographic Information: An Assessment of the Geospatial One-Stop. Annals of the AAG. 97, 2, 250-266 (2007).11. Grossner, K.E. et al.: Defining a digital earth system. Transactions in GIS. 12, 1, 145–160 (2008).12. Herring, J.R. ed: OpenGIS® Implementation Standard for Geographic information - Simple feature access - Part 1: Common architecture, (2010).13. Hey, T., Trefethen, A.: e-Science and its implications. Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical and
Engineering Sciences. 361, 1809, 1809 (2003).14. Hey, T., Trefethen, A.E.: The Data Deluge: An e-Science Perspective. Grid Computing: Making the Global Infrastructure a Reality (eds F. Berman, G. Fox and T.
Hey). pp. 809-824 John Wiley & Sons, Ltd, Chichester, UK (2003).15. Hill, L.L. et al.: Collection metadata solutions for digital library applications. J. of the American Soc. for Information Science. 50, 13, 1169–1181 (1999).16. Howe, B. et al.: Scientific Mashups: Runtime-Configurable Data Product Ensembles. Scientific and Statistical Database Management. pp. 19–36 (2009).17. Kobayashi, M., Takeda, K.: Information retrieval on the web. ACM Comput. Surv. 32, 144–173 (2000).18. Lewandowski, D.: Web searching, search engines and Information Retrieval. Information Services and Use. 25, 3, 137-147 (2005).19. Lord, P., Macdonald, A.: e-Science Curation Report, http://www.jisc.ac.uk/uploaded_documents/e-ScienceReportFinal.pdf, (2003).20. Manning, C.D. et al.: An introduction to information retrieval. Cambridge University Press (2008).21. Maron, M.E., Kuhns, J.L.: On relevance, probabilistic indexing and information retrieval. Journal of the ACM (JACM). 7, 3, 216–244 (1960).22. Miller, C.C.: A Beast in the Field: The Google Maps mashup as GIS/2. Cartographica. 41, 3, 187-199 (2006).23. Miller, H.J., Wentz, E.A.: Representation and Spatial Analysis in Geographic Information Systems. Annals of the AAG. 93, 3, 574-594 (2003).24. Montello, D.: The geometry of environmental knowledge. Theories and methods of spatio-temporal reasoning in geographic space. 136–152 (1992).25. Perlman, E. et al.: Data Exploration of Turbulence Simulations Using a Database Cluster. Proceedings of the 2007 ACM/IEEE conference on Supercomputing-
Volume 00. pp. 1–11 (2007).26. Sharifzadeh, M., Shahabi, C.: The spatial skyline queries. Proc. of VLDB. p. 762 (2006).27. Stolte, E., Alonso, G.: Efficient exploration of large scientific databases. Proc. of VLDB. p. 633 (2002).
24
Example Sensor Types and Associated Data Characteristics
Water SamplesTime: “point in time”Location: x,y,z pointQuantity: hundredsObservations per: 1
Cruises Time: weeksLocation: hundreds of milesQuantity: ~ 4 per yearObservations per: millions
Gliders, Autonomic Unmanned Vehicles Time: hours / daysLocation: miles; x,y,zQuantity: 10s per yearObservations per: million