Top Banner
1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.
33

1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

Mar 26, 2015

Download

Documents

Natalie Bell
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

1

There Goes the Neighborhood!Spatial (or N-Dimensional) Search

in a Relational World

Jim Gray, Microsoft

Alex Szalay, Johns Hopkins U.

Page 2: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

2

Page 3: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

3

Background• I have been working with

Astronomy community to build the World Wide Telescope: all telescope data federated in one internet-scale DB

• A great Web Services app• The work here

joint with Alex Szalay

• SkyServer.Sdss.Org is first installment,

• SkyQuery.Netis second installment (federated web services)

Page 4: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

4

Outline

• How to do spatial lookup:– The old way: HTM– The new way: zoned lookup

• A load workflow system

• Embedded documentation:literate programming for SQL DDL

Page 5: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

5

Spatial Data Access – SQL extension

Szalay, Kunszt, Brunner http://www.sdss.jhu.edu/htm

• Added Hierarchical Triangular Mesh (HTM) table-valued function for spatial joins

• Every object has a 20-deep Mesh ID• Given a spatial definition,

routine returns up to 10 covering triangles

• Spatial query is then up to 10 range queries

• Fast: 1,000 triangles / second / Ghz2,2

2,1

2,0

2,3

2,3,0

2,3,12,3,2 2,3,3

2,2

2,1

2,0

2,32,2

2,1

2,0

2,3

2,3,0

2,3,12,3,2 2,3,3

2,3,0

2,3,12,3,2 2,3,3

Page 6: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

6

A typical call-- find objects within 1 arcminute of (60,20)select objID, ra, dec from PhotoObj as p, fHtmCover(60,20,1) as triangle where p.htmID between triangle.startHtmID and triangle.endHtmID and <geometry test on (ra,dec) – (60,20) < 1 arcmin>

-- or better yetselect objID, ra, dec, distance from dbo.fGetNearbyObjEq(60,20,1)

careful distance test rejects

false positives

Coarse distance test

Coarse filterCoarse filter Correct filterCorrect filter

Page 7: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

7

Integration with CLR Makes it Nicer

• Peter Kukol converted 500 lines of external stored procedure “glue code” to 50 lines of C# code.

• Now we are converting library to C#

• Also, Cross Apply is VERY useful select objID, count(*)from PhotoObj p cross apply dbo.fGetNearbyObjEq(p.ra, p.dec, 1)

Page 8: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

8

But…

• Wanted a faster way to do this: some computations were taking toooooo long (see below).

• Wanted to define areas in relational form.

• Wanted a portable way that works on any relational system.

• So, developed a “constraint database” approach – see below.

Page 9: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

9

The Idea:Equations Define Subspaces

• For (x,y) above the lineax+by > c

• Reverse the space by-ax + -by > -c

• Intersect 3 half-spaces: a1x + b1y > c1

a2x + b2y > c2

a3x + b3y > c3

x

y

x=c/a

y=c/b

ax + by = c

x

y

Page 10: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

10

The Idea:Equations Define Subspaces

a1x + b1y > c1

a2x + b2y > c2

a3x + b3y > c3x

y

select count(*)from convex where a*@x + b*@y < c

3

2

22

1 1

1

select count(*)from convex where a*@x + b*@y > c

x

y

0

1

11

2 2

2

Page 11: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

11

Domain is Union of Convex Hulls

• Simple volumes are unions of convex hulls.

• Higher order curves also work

• Complex volumes have holes and their holes have holes. (that is harder).

Not a convex hull

+

Page 12: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

12

Now in Relational Termscreate table HalfSpace (

domainID int not null -- domain name foreign key references Domain(domainID), convexID int not null, -- grouping a set of ½ spaces halfSpaceID int identity(), -- a particular ½ space a float not null, -- the (a,b,..) parameters b float not null, -- defining the ½ space c float not null, -- the constraint (“c” above) primary key (domainID, convexID, halfSpaceID)

(x,y) inside a convex if it is inside all lines of the convex(x,y) inside a convex if it is NOT OUTSIDE ANY line of the convex

Convexes containing point (@x,@y):select convexID -- return the convex hullsfrom HalfSpace -- from the constraintswhere (@x * a + @y * b) < c -- point outside the line?group by all convexID -- insist no line of convexhaving count(*) = 0 -- is outside (count outside ==

0)

Page 13: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

13

All Domains Containing this Point

• The group by is supported by the domain/convex index, so it’s a sequential scan (pre-sorted!).

select distinct domainID -- return domains

from HalfSpace -- from constraints

where (@x * a + @y * b) < c -- point outside

group by all domainID, convexID -– never happens having count(*) = 0 -- count outside == 0

Page 14: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

14

The Algebra is Simple (Boolean)@domainID = spDomainNew (@type varchar(16), @comment varchar(8000))@convexID = spDomainNewConvex (@domainID int)@halfSpaceID = spDomainNewConvexConstraint (@domainID int, @convexID int, @a float, @b float, @c float)@returnCode = spDomainDrop(@domainID)

select * from fDomainsContainPoint(@x float, @y float) Once constructed they can be manipulated with the Boolean operations.@domainID = spDomainOr (@domainID1 int, @domainID2 int, @type varchar(16), @comment varchar(8000))@domainID = spDomainAnd (@domainID1 int, @domainID2 int, @type varchar(16), @comment varchar(8000))@domainID = spDomainNot (@domainID1 int, @type varchar(16), @comment varchar(8000))

Page 15: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

15

What! No Bounding Box?

• Bounding box limits search.A subset of the convex hulls.

• If query runs at 3M half-space/sec then no need for bounding box, unless you have more than 10,000 lines.

• But, if you have a lot of half-spaces then bounding box is good.

Page 16: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

16

OK: “solved” Areas Contain Point? What about: Points near point?

• Table-valued function find points near a point

– Select * from fGetNearbyEq(ra,dec,r)• Use Hierarchical Triangular Mesh www.sdss.jhu.edu/htm/

– Space filling curve, bounding triangles…– Standard approach

• 13 ms/call… So 70 objects/second.• Too slow, so pre-compute neighbors:

Materialized view.• At 70 objects/sec: takes 6 months

to compute materialized view on billion objects.

Page 17: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

17

Zone Based Spatial Join• Divide space into zones• Key points by Zone, offset

(on the sphere this need wrap-around margin.)

• Point search look in a few zonesat a limited offset: ra ± ra bounding box that has

1-π/4 false positives• All inside the relational engine• Avoids “impedance mismatch” • Can “batch” all-all comparisons• 33x faster and parallel

6 days, not 6 months!

r ra-zoneMax

√(r2+(ra-zoneMax)2)cos(radians(zoneMax))

zoneMax

x

Ra ± x

Page 18: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

18

In SQL: points near point

select o1.objID -- find objectsfrom zone o1 -- in the zoned tablewhere o1.zoneID between -- where zone #

floor((@dec-@r)/@zoneHeight) and -- overlaps the circleceiling((@dec+@r)/@zoneHeight)

and o1.ra between @ra - @r and @ra + @r -- quick filter on ra and o1.dec between @dec-@r and @dec+@r -- quick filter on dec and ( (sqrt( power(o1.cx-@cx,2)+power(o1.cy-@cy,2)+power(o1.cz-@cz,2))))

< @r -- careful filter on distance

Eliminates the ~ 21% = 1-π/4False positives

Bounding box

Page 19: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

19

Quantitative Evaluation: 7x faster than external stored proc:

(linkage is expensive)time vs. radius for neighbors function @ various zone heights. Any small zone height is adequate.

time vs. best time @ various radius. A zoneHeight of 4” is near-optimal

Rows vs elapsed timefit is 1.46+2.2e-4*r^2 ms/asec

1

10

100

1000

10 100 1000

r (asec)

tim

e (

se

c)

7.5 asec

15 asec30 asec1 amin

2 amin4 amin

64 aminr 2̂ fit

Relative time vs zone height (asec)4 minute zone is near optimal

2 & 8 minute are slower

1.00

1.10

1.20

1.30

1.40

1.50

1.60

1.70

1.80

1.90

2.00

10 100 1000r (asec)

tim

e vs

bes

t

7 asec15 asec30 asec1 amin2 amin4 amin16 amin1 degree

Page 20: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

20

All Neighbors of All points(can Batch Process the Joins)

• A 5x additional speedup (35x in total)for @deltaZone in {-1, 0, 1} example ignores some spherical geometry details in paper

insert neighbors -- insert one zone's neighbors select o1.objID as objID, -- object pairs

o2.objID as NeighborObjID, .. other fields elided

from zone o1 join zone o2 -- join 2 zones on o1.zoneID-@deltaZone = o2.zoneID -- using zone number and ra

and o2.ra between o1.ra - @r and o1.ra + @r -- points near rawhere -- elided margin logic, see paper. and o2.dec between o1.dec-@r and o1.dec+@r -- quick filter on dec and sqrt(power(o1.x-o2.x,2)+power(o1.y-o2.y,2)+power(o1.z-o2.z,2))

< @r -- careful filter on distance

Page 21: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

21

Spatial Stuff Summary• Easy

– Point in polygon– Polygons containing points– (instance and batch)

• Works in higher dimensions

• Side note: Spherical polygons are – hard in 2-space– Easy in 3-space

Page 22: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

22

Spatial Stuff Summary• Constraint databases are in

– Streams (data is query, query is in DB)– Notification: subscription in DB, data is query– Spatial: constraints in DB, data is query

• You can express constraints as rows

• Then You – Can evaluate LOTS of predicates per second– Can do set algebra on the predicates.

• Benefits from SQL parallelism

• SQL == Prolog // DataLog?

Page 23: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

23

References

• Representing Polygon Areas and Testing Point-in-Polygon

Containment in a Relational Database http://research.microsoft.com/~Gray/papers/Polygon.doc

• A Purely Relational Way of Computing Neighbors on a Sphere, http://research.microsoft.com/~Gray/papers/Neighbors.doc

Page 24: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

24

Outline

• How to do spatial lookup:– The old way: HTM– The new way: zoned lookup

• A load workflow system

• Embedded documentation:literate programming for SQL DDL

Page 25: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

25

Loading consists of Many Tasks• Very simply:

capture, analyze to produce catalog info, convert to sql,

validate, import,

index,

• Each of these steps has many sub-steps• We learned from the TerraServer that

(1) LOADING IS WHERE THE TIME GOES.(2) You get to load it again when you discover better data you discover a bug in the data

you discover a better design.(3) Essentially, you are always loading or preparing to load.

• Everyone “knows this” but you have to experience it to grasp it.

TelescopeObservation

GenerateCatalogs

Validatecatalogs

Load toDatabase

Page 26: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

26

The SkyServer Load Manager• Built a workflow engine with SQL Agent

(our batch job scheduler) and DTS• State machine is in database

visible to all worker nodes• Workers at each node pull work • Step is a stored procedure

– logs to load monitor database – 3 levels of logging: job, step, phase– Logs and reporting are VERY useful

• Automatic some manual backout to fix problems

• Demo http://skyserver.pha.jhu.edu/admin/tasklist.asp?all=1

Page 27: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

27

Outline

• How to do spatial lookup:– The old way: HTM– The new way: zoned lookup

• A load workflow system

• Embedded documentation:literate programming for SQL DDL

Page 28: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

28

How do you document your schema?

Page 29: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

29

Knuth’s Literate Programming

• Put documentation in the program• Tool generates manual from program..

In the new world: tool generates online hypertext from program.

• We Annotated every table, view, function, .. with tags:unitscommentreference to other tables (flags, star schema).

• Program scans DDL, generates web site.• http://skyserver.sdss.org/en/help/docs/browser.asp

http://search.barnesandnoble.com/booksearch/isbnInquiry.asp?userid=1BHUJQ3VNI&isbn=0937073806&itm=1

Page 30: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

30

CREATE FUNCTION fGetNearbyObjXYZ (@nx float, @ny float, @nz float, @r float)---------------------------------------------------------------/H Returns table of primary objects within @r arcmins of an xyz point (@nx,@ny, @nz).---------------------------------------------------------------/T There is no limit on the number of objects returned, but there are about 40 per sq arcmin.--/T <p>returned table: --/T <li> objID bigint PRIMARY KEY, -- Photo primary object identifier--/T <li> run int NOT NULL, -- run that observed this object --/T <li> camcol int NOT NULL, -- camera column that observed the object--/T <li> field int NOT NULL, -- field that had the object--/T <li> rerun int NOT NULL, -- computer processing run that discovered the object--/T <li> type int NOT NULL, -- type of the object (3=Galaxy, 6= star, see PhotoType in DBconstants)--/T <li> cx float NOT NULL, -- x,y,z of unit vector to this object--/T <li> cy float NOT NULL,--/T <li> cz float NOT NULL,--/T <li> htmID bigint, -- Hierarchical Triangular Mesh id of this object--/T <li> distance float -- distance in arc minutes to this object from the ra,dec.--/T <br> Sample call to find PhotoObjects within 5 arcminutes of xyz -.0996,-.1,0--/T <br><samp>--/T <br>select *--/T <br>from dbo.fGetNearbyObjXYZ(-.996,-.1,0,5) --/T </samp> --/T <br>see also fGetNearbyObjEq, fGetNearestObjXYZ, fGetNearestObjXYZ------------------------------------------------------------- RETURNS @proxtab TABLE ( objID bigint PRIMARY KEY,

Page 31: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

31

CREATE TABLE Chunk (---------------------------------------------------------------------------------/H Contains basic data for a Chunk----/T A Chunk is a unit for SDSS data export. --/T It is a part of an SDSS stripe, a 2.5 degree wide cylindrical segment --/T aligned at a great circle between the survey poles. --/T A Chunk has had both strips completely observed. Since --/T the SDSS camera has gaps between its 6 columns of CCDs, each stripe has --/T to be scanned twice (these are the strips) resulting in 12 slightly --/T overlapping narrow observation segments. <P>--/T Only those parts of a stripe are ready for export where the observation --/T is complete, hence the declaration of a chunk, usually consisting of 2 runs. -------------------------------------------------------------------------------

chunkNumber int NOT NULL, --/D Unique chunk identifierstartMu int NOT NULL , --/D Starting mu value --/U arcsecendMu int NOT NULL , --/D Ending mu value --/U arcsec[stripe] int NOT NULL , --/D Stripe numberexportVersion varchar(32) NOT NULL , --/D Export Version

)

Page 32: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

32

SkyServer Object Browserhttp://skyserver.sdss.org/en/help/docs/browser.asp

• VB program scans DDL, generates web site.

• It is VERY usefulFree Text Search VERY useful

Page 33: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

33

Outline• How to do spatial lookup:

– The old way: HTM– The new way: zoned lookup

• A load workflow system

• Embedded documentation:literate programming for SQL DDL

• Not shown: A C# web service (http://skyservice.pha.jhu.edu/SdssCutout) or better yet!

http://skyservice.pha.jhu.edu/dr1/imgcutout/chart.asp