Top Banner

of 11

222p187 Bentley

May 29, 2018

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 8/9/2019 222p187 Bentley

    1/11

    K -d Trees for Semidynamic Point Sets

    Ion Louis Bentley

    AT&T Bell Labo~ttoriesMurray Hill, NJ 07974

    ABSTRACT

    A K-d tree represents a set of N points in K-dimensionalspace. Operations on a semidynamic tree may delete andundelete points, but may not insert new points. 'ntis papershows that several operations that require O(log N) expectedtime in general K-d trees may be performed in constantexpected time in sernidynamic trees. These operationsinclude deletion, undeletion, nearest neighbor searching, andfixed-radius near neighbor searching (the running times ofthe first two are proved, while the last two are supported byexperiments and heuristic arguments). Other new techniquescan also be applied to general K-d trees: simple samplingreduces the time to buil d a tree from O( KN log N) toO(KN + N log N), and more advanced sampling builds arobu st tree in the same time. The methods are straightfor-ward to implement, and lead to a data structure that issigrfifieently faster and less vulnerable to pathological inputsthan ordinar y K-d trees.

    1. Introduction

    A K- dimensional binar y search tree (abbreviated K-d

    tree) represents a set of points in K-dimensional space.Many kinds of searches can be performed on a K-d tree,including exact-match, partial-match, and range queries. Thediscussion of K.d trees in man y textbooks concentrates onthese query types, which are important in database applica-tions (see, for instance, Mehlh orn [1984, Section 2.1],Preparata and Shames [1985, Section 2.3.2], and Sedgewick[1988, Chapter 26]).

    This paper is concerned with K-d t~ees that support twokinds of proximity queries:

    Nearest neighbor qtw, y. Report the nearest neighbor toa give n query point, under a specified metric.

    Fixed-radius near neighbor query. Report all pointswithin a fixed radius of the given query point, under aspecified metric.

    Permission to copy without fee all or part of this material is grantedprovided that the copies are not made or distributed for direct com-mercial advantage, the ACM copyright notice and the title of thepublication and its date appear, and notice is given that copying isby permission of the Association for Computing Machinery. To copyotherwise, or to republish, requires a fee and/ or specificpermission.

    1990 ACM 0-89791 -362-0 /90/000 6/0187 $1.50

    Addit ionally, we will focus on sem/dyno~nic point sets. Anoriginal set of N points is known when the K-d tree is built;subsequent operations may delete points from the set orundelete points that have been previously deleted. Newpoints may not be inserted.

    Res~icting K-d trees to proximity queries on semi-dynamic point sets leads to a particularly efficient implemen-tation. Because the points are known in advance, bottom-upalgorithms reduce the expected time for operations on a tree

    from O(log N) to O(1) . Bentley [1990b] uses the datastructure to solve a number of closeness problems in compu-tational geometry, such as computing minimum spanningtrees, matchings, and ma ny kinds of traveling salesman tours.

    Section 2 of this paper reviews previous work on K-dtrees. Section 3 defines the abstract data type implementedby the algorithms described later. Section 4 introducesbottom-up algorithms for the simple delete and undeleteoperations, and Sections 5 and 6 describe bottom-upsearches. The algor ithms in Section 7 for build ing K-d treesare also applicable to general K-d trees. Conclusions areoffered in Section 8.

    2 . P r e v i o u s Wo r k

    The multidi mensional binary search tree introduced byBentley [1975] generalizes the standard one-dimensionalbinary search tree. The first part of this section will reviewthe variant described by Friedman, Bentley and Finkel[1977], which distinguishes between two kinds o f nodes:internal nodes partition the space by a cut plane defined by avalue in one of the K dimensions, and external nodes (orbuckets) store the POM~ m the resulting hyperrectangles ofthe partition. Figure 2.1 shows the partition given by a 2-dtree representing 300 points dr awn unif ormly fTom the unitsquare (each bucket has at most 3 points).

    A node in a K-d eee can be represented by the follow-ing C++ structure (for informa tion on C++, see Stroustrup[1986]):

    s t r u c t k d n o d e (i n t b u c k e t ;i n t c u t d i m ;f l o a t c u t v a l ;k d n o d e * l o s o n , * h i s o n ;in t iopt , h ip t ;

    );

    The variable b u ck et is 1 if the node is a bucket and zero ifit is an internal node. In an internal node, cu td i m givesthe dimension being partitioned, c ut v al is a value in that

    18 7

  • 8/9/2019 222p187 Bentley

    2/11

    i HI -

    I 1 . I I

    Fig ure 2 .1 . A 2 -d t r ee o f 300 po in t s.

    d imens ion , and l o s o n an d h i s o n are pointers to its tw osubtrees (conta ining, respect ively, points no t greater than andno t le s s t han cu t v a l i n t ha t d imens ion ) . In a bucke t node ,l o p t a n d h i p t a r e i n d ic e s i n t o t h e g l ob a l p e r m u t a t i o n v e c -

    t o r p e r m [ n ] ; t he h~t-lopt+l integers inp e r m [ l o p t . . h i p t ] g i v e t h e i n d ic e s o f t h e p o in t s i n t h ebucke t ( t hus pe rm i s j u s t a conven ien t s et r epre sen ta t ion ) .

    As the l iee is being bui l t , two pointers in top e r mrepresent a subset of the points . Th e t ree is bui l t by th iscode :

    fo r ( i = O; i < n ; i++ )

    p e r m [ i ] = i ;r o o t = b u i l d ( O , n - l ) ;

    The ma in work i s done by the r ecu r s ive func t ionb u i l d ,shown in P r o g r a m 2.1 . Th e code is wri t ten in C++; forb rev i ty, dec l a ra t ions have been omi t t ed . The func t ion

    k d n o d e * b u i l d ( i n t i , i n t u ){ p = n e w k d n o d e ;

    i f ( u- l + l < = c u t o f f) {p - > b u c k e t = i ;

    p - > l o p t = I ;p - > h i p t = u ;

    } e l s e {p - > b u c k e t = O ;

    p - > c u t d i m = f i n d m a x s p r e a d ( l , u ) ;m = ( i + u ) / 2 ;s e l e c t ( l , u , m , p - > c u t d i m ) ;p - > c u t v a l = p x ( m , p - > c u t d i m ) ;p - > l o s o n = b u i l d ( l , m ) ;p - > h i s o n = b u i l d ( m + l , u ) ;

    }r e t u r n p ;

    }

    Pr og ra m 2 .1 . Bu i ld ing a K-d t r ee.

    e x t e r n i n t n n t a r g e t , n n p t n u m ;e x t e r n f l o a t n n d i s t ;

    i n t n n ( i n t j )

    { n n t a r g e t = j ;n n d i st = H U G E ;

    r n n ( r o o t ) ;r e t u r n n n p t n u m ;

    )

    v o i d r n n ( k d n o d e * p){ i f ( p - > b u c k e t ) {

    f o r ( i = p - > l o p t ; i < = p - > h i p t ; i + + ) {t h i s d i s t = d i s t ( p e r m [ i ] , n n t a r g e t )i f ( t h i s d i s t < n n d i s t ) {

    n n d i s t = t h i s d i s t ;n n p t n u m = p e r m [ i ] ;

    )}

    } e l s e {val ffi p- > cut val ;

    t h i s x = x ( n n t a r g e t , p - > c u t d i m ) ;i f ( t h i s x < v a l ) {

    r n n ( p - > l o s o n ) ;i f ( t h i s x + n n d i s t > v a l )

    r n n ( p - > h i s o n ) ;} e l s e {

    r n n ( p - > h i s o n ) ;i f ( t h i s x - n n d i s t < v a l )

    r n n ( p - > l o s o n ) ;}

    Pr og ra m 2 .2 . Nea res t ne ighbor s ea rch ing .

    f i n d m a x s p r e a d re turns the dimension wi th largest d i ffer-ence be tween min imum and max im um among the po in t s i np e r m [ l . . u ] . The func t ion p x ( i , j ) accesses the j - thcoord ina t e o f po in tp e r m [ i ] . The func t ion s e l e c t per-m u t e s p e r m [ l . . u ] s u c h t h a t p e r m [ m ] c on t a in s a p oi n t

    t h a t is n o t greater in thep - > c u t d i m - t h d imens ion than anypoint to i t s lef t , and is s imi la r ly not less than the point s toit sr ig h t . F u n c t io n b u i l d r u n s i n O ( K N l o g N ) t i m e.

    The nn func t ion in P rogram 2 .2 computes t he nea re s tne ighbor t o a po in t . The ex te rna l (g loba l ) va r i ab l e s a r e u sedby the r ecu r s ive p rocedurer n n , which does the work . A t abucke t node , rn n pe r fo rms a s equen ti a l nea re s t ne ighborsearch. A t an in ternal node, the search f irs t proceeds downthe c loser son, and then searches the far ther son only i fnecessary. (The funct ion x ( i , j ) accesses the j - t h dimen-s ion of point i. ) These funct ions are correct for any metr icin which the di fference between point coordinates in any s in-g l e d imens ion does no t exceed the me t r i c d i s tance ; t he Min-kowsk i L 1, L 2 and L . me t r i c s a l l d i sp l ay th is p rope r ty.

    The pe r fo rmance o f s ea rch ing a lgo r i thms in K-d t r ee shas proven di ff icul t to analyze. Lee and Wo ng [1977] g ive awors t - case bound on the cos t o f r eg ion sea rch ing . F r i edman ,Ben t l ey and F inke l [1977] expe r imen ta l ly inves t iga t e t he runt ime of neares t neigh bor searching in K-d t rees. Sect ion 5presents several of the exper iments . Th ey observe that theexpec ted run t ime , fo r f i xed K , g rows a s O( lo g N) fo r manyinput d is t r ibut ions , and give heur is t ic arguments suggest ingwhy tha t shou ld be the case . Zo ln owsky [1978] p re sen t s avar iant of K-d t ree where the wors t -case cost of searching for

    1 8 8

  • 8/9/2019 222p187 Bentley

    3/11

    all nearest neighbors in a set of N points in K-space isO(N log s N), and gives a point set that realizes this timebound. Zol nows ky's result implies that the amortized cost ofa nearest neigh bor search is O (log Jr N).

    Sproull [1988] describes several ways in which the K-dtree algorithms and data structure can be improved. Thenearest neighbor search function r n n already incorporatesone of Spronll's ideas: it removes the bounds array that con-sumed most of the CPU time in Friedman, Bentley andFinkel's [1977] program (we will return to this topic in Sec-tion 5). For the commo n case of the Euclidean metric,Spronll also observes that one need not compute the (expen-sive) true distance to the nearest neighbor. Computing onlythe square of the distance is clearly sufficient within abuckeL This code fragment shows how the square suffices atinternal nodes:

    d i f f = x ( n n t a rg e t , d i m) - p - > c u t v a l ;if (diff < 0) {

    r n n ( p - > l o s o n ) ;i f ( n n d i s t2 > = d i f f * d i f f ) r n n ( p - > h i s o n ) ;

    } e l s e {r n n ( p - > h i s o n ) ;i f (n n d i st 2 > = d i f f * d i f f ) r n n ( p - > l o s o n ) ;

    )

    The new global variable n n d i s t 2 represents the square ofn n d i s t .

    Sproull describes several other improvements to K-dtrees. While standard K-d trees use partition planes that areorthogonal to a coordinate axis, Sproull uses arbitrary parti-tion planes. In particular, he uses a plane found from theprincipal eigenvector of the covariance matrix of the currentsubset of points. Sproull also offers a numbe r of " codin gtricks " that speed up a program i mplementing K-d trees (twoof his suggestions were ment ioned earlier).

    3 . T h e A b s t ra c t D a t a Ty p e

    The basic operations on a semid ynamic K-d tree forproximity problems are defined in this C++ class:

    c l a s s k d t r e e {p u b l i c :

    v o i dv o i din t

    ) ;

    k d t r e e ( p o i n t s e t * p t s e t ) ;d e l e t e p t ( i n t p t n u m ) ;u n d e l e t e p t ( i n t p t n u m ) ;n n ( i n t p t n u m ) ;

    The first operation creates a tree given a point set. Functi on

    de le te pt removes point pt nu m from the set; its dualu n de ie t ep t returns it to the set. Function nn returns theindex of the nearest neighbor to its argument (not the argu-ment itself); ties are broken arbitrarily. In later sections wewill add other operations to the class.

    Many closeness problems in computational geometrycan be solved using these primitive operations. Here, forinstance, is a C++ fun ction to store a nearest neighbor travel-ing salesman tour of a set in the vector t o u r I n ]

    v o i d n n t o u r ( p o i n t s e t * p t se t ,in t s t a r tp t ,int *tour)

    { t r e e = n e w k d t r e e ( p t s e t ) ;t o u r [ 0 ] = s t a r t p t ;t r e e - > d e l e t e p t ( t ou r [0 ] ) ;for ( i = I; i < n; i++ ) {

    t o u r[ i ] = t r e e - > n n ( t o u r [ i - 1 ] ) ;t r e e - > d e l e t e p t ( t ou r [ i] ) ;

    )d e l e t e t r e e ;

    }

    H e r e i s a ( p a rt i al ) ~ s t o f o t h e r c l o s e n e s s p r o b l e m s t h a t B e n t -

    l e y [1990b] solves using the primitive operations in thisclass:

    M i n i m u m Spanni ng Trees: true MST, degree-constrained MST.

    Matchings: greedy matching, 2-opting a matching.

    Travelin g salesman problem heuristics: nearest neighbor,min imu m spanni ng tree, approximate Christofides'heuristic, multiple fragment (greedy), nearest insertion,

    farthest insertion, random insertion, nearest addition,farthest addition, random addition.

    Local improvements to traveling salesman tours: t w o -opt, two-and-a-half-opt, three-opt.

    Most of these algorithms have (experimentally observed)rurming times of O(N log N); the K-d trees of this paperplay a crucial role in the rapid implementations.

    4. Bot tom -Up Delet ion an d Un deletion

    To support semidynami c point sets, we will add the newfield em p t y to each node in the tree. The field is 1 if allpoints in the subtree have been deleted and is 0 otherwise.

    Search procedures are modified to return immediately whenthey visit an empty node. To delete a node from the tree wefirst delete it fIom its bucket and then (as needed) turn onempty bits on the path to the root. To delete a point from abucket we swap that point in the p e r m vector with thecurrent h i p t , and decrement the latter. This strategy doesnot attempt to rebalance the tree as elements are deleted; wewill see in Section 5 that deletions do not have a strongnegative effect on search time.

    A recursive top-down deletion algorithm starts at theroot, searches down to the proper bucket, swaps and decre-ments, and (as needed) turns on empty flags on the way up.There is, however, the bothersome complication that a pointwith a value equal to the discriminator might be in eitherson. Not only does this clutter the code, but deleting one ofM identical points must sometimes invol ve investigating allM points.

    A bottom-up version of the function is cleaner and fas-ter. It requires two chang es to data structures: each node inthe tree contains a f at h er pointer, and the array elementbu ck et pt r[ ] points to the bucket that contains node .Both of these changes are straightforward to incorporate intothe b u i l d function. Here is the resulting deletion code:

    1 8 9

  • 8/9/2019 222p187 Bentley

    4/11

    v o i d d e l e t e ( i n t p o i n t n u m ){ p = b u c k e t p t r [ p o i n t n u m ] ;

    j = p - > l o p t ;w h i l e ( p e r m [ j ] != p o i n t n u m )

    j + + ;

    p e r m s w a p ( J , ( p - > h i p t )- - ) ;i f (p - > l o p t > p - > h i p t ) {

    p - > e m p t y = i ;w h i l e ( (p = p - > f a t h e r )

    & & p - > l o s o n - > e m p t y& & p - > h i s o n - > e m p t y )

    p - > e m p t y = I ;)

    )

    A n y s i n gl e d e l et i o n r e qu i r es a t m o s t O ( l o g N ) t i m e , b e c a u s e

    t h e t r e e h a s t h a t d e p t h . T h e t o t a l c o s t o f s e qu e n t i a l ly d e l e t-

    i n g every point in a set is O(N); the analysis is isomoxphicto the time of building a heap bottom-up (see Theorem 3.5of Aho, Hopcrofft and Ulh nan [1974]). The amortized costof a deletion is therefore O(1). (The linear-time functionde le t ea l l removes all points in a set with less overheadby traversing the tree.)

    The u n d e l e t e function has a similar structure: it per-forms a sequential search within the bucket, increments

    hi pt and performs a perm swa p, then finally moves up thetree, turning off as many em pt y flags as necessary. As withdeletion, a single undelet ion requires at most O(log N) time,and the total cost of sequentially undeleting every point in aset is O(N), for constant amortized cost.

    Some sequences of O(N) deletions and undeletions canrequire time proportional to N log N. The first NI2 opera-tions delete all the elements in one sublzee of the root in0 (N) time, then N subsequent pairs of operations delete andundelete an arbitrary element of that subtree; each of theoperations must proceed from the node to the root, so thecost of each is proportional to log N. Fortunately, manysequences of operations that occur in geometric algorithmscan be shown to require O(N) lime.

    5 . B o t t o m - U p S e a r c h i n g

    Before we consider bottom-up search algorithms, wewill briefly study the run time of the top-down nearest neigh-bor search algorithm (given by functions nn and rnn inSection 2). Figure 5.1 summarizes 100 experiments with Nranging from 100 to 100,000, unifor mly on a logarithmicscale. For each value, we generated N points uniformly onthe unit square, built a 2-d tree for the point set (with thebucket cutoff set to 5), and performed a search to find thenearest neighbor of each point in the set. The circles in thegraph plot the average number of distance calculations persearch made within the buckets, and the crosses plot the

    average number of internal nodes visited by each search.Figure 5.1 shows several aspects of top-down nearest

    neighbor search. Both variables display a cyclic behavior,periodic at powers of two. Because the bucket cutoff isfixed (5 in this case), the average number of nodes in abucket increases and then decreases periodically at powers oftwo, which is the rate at which new levels are added to thetree. Notice the tradeoff between distance calculations and

    2 0 - -

    15--

    10--

    x

    / xy:/:/=;V'/:iV=II I I I

    100 , 1000 10000 100000

    N

    Nodes

    Dis~

    Fig ure 5.1. All nearest neighbors, top-down search,bucket cutoff 5.

    nodes visited: when there are more points in each bucket, thealgorithm makes more distance calculations and visits fewer

    nodes. The li ne at lg N + 4 shows that the number ofnodes visited is growing logarithnfically. The numb er of dis-tanee calculations appears to be approaching oscillationaround a constant near ten.

    Our next set of experiments uses an inefficient variantof K-d tree to simplify analysis of the daUc the bucket cutoffis set to one. For each N a power of two from 2s =3 2 to217= 131072, we generated 10 sets of N points at random onthe unit square, built a 2-d tree, and then searched for allnearest neighbors in the point set. The bucket cutoff of onedecreases the nmber of distance calculations but increasesthe number of nodes visited. Figure 5.2 shows that the aver-age number of distance calculations appears to approach a

    constant near 5. The num ber of nodes visited appears toapproach the line on the graph at lg N + 14.

    3 0 - -

    20- -

    10--

    | | 0 o o o o o

    I I I I100 1000 10000 100000

    N

    Nodes

    Dis~

    Figu re 5.2. All nearest neighbors, top-down search,bucket cutoff 1.

    1 9 0

  • 8/9/2019 222p187 Bentley

    5/11

    20 i Nodesi n t n n ( i n t j ){ n n t a r g e t = j ;

    n n d is t 2 = H U G E ;p = b u c k e t p t r [ n n t a r g e t ] ;rn n (p) ;w h i l e ( I) {

    l a s t p = p ;p ffi p- > fa th er ;

    i f ( p = = 0 ) b r e a k ;d i f f = x ( n n t a r g e t , p - > c u t d i m )

    - p - > c u t v a l ;i f ( n n d i st 2 > = di f f * d i f f ) {i f ( l a s t p = = p - > l o s o n )

    r n n (p - > h i s o n ) ;

    e l s er n n (p - > l o s o n ) ;

    )i f ( b a t 1 2 i n b o u n d s ( p - > b n d s ,

    n n t a r g e t , n n d i s t 2 ) )b r e a k ;

    )r e t u r n n n p t n u m ;

    )

    P r o g r a m 5 . I . B o t t o m - u p n e a re s t n e i g hb o r s e ar c hi n g.

    T h e s e e x p e ri m e n ts s h o w u s h o w a t o p - d o w n s e a r c h t y p -

    i c a l ly p r o c e e d s : i t v i s it s e x a c t l y I g N i n t e r n a l n o d e s o n t h eway to the correct bucket, rummages around a constantnumber of neighboring buckets, then recursively returns tothe root. Friedman, Bentley and Finkel [1977] g ive proba-bilistic arguments to support this description. The bottom-upsearch algorithm in Program 5.1 reduces the O(log N) timeto O(1 ) by starting at the correct bucket. It then works itsway up the tree, using the recursive r n n function to searchas many " far ther " son s as needed. To halt the upwardclimb we store at each node in the tree a bounds array bndsthat describes the hyperrectangle that the node represents(defined by cut planes above it in the tree; details are inFriedman, Bentley and Finkel [1977]). We stop the searchwhen the bounds for the node contain the current nearest-

    neighbor ball (defined to be a ball centered at the searchpoint with radius equal to the distance to the nearest neigh-bor). Function ba 11 2 in bo un ds is passed a boundsarray, a point index, and a distance (squared). In O(K) t ime,it returns 1 if the ball centered at the point with that(squared) radius is contained in the hyperrectanglerepresented by the bounds array.

    Figure 5.3 shows the next set of experiments: bottom-upnearest neighbor searches o n exactly the trees summarized inFigure 5.2. A simple analysis (which we'll see shortly) indi-cated that the number of nodes visited might grow asa - bN -11z, for appropriate constants a and b. A weightednon- linear least squares program was used to fit the numberof nodes visited to the model a + b N c + d lg N, wherethe logarithmic term can account for work going up the tree.The estimates were a=19.17, b=-25.88, c=-.386, andd= -. 00 16 . Because the standard error of the estimate of dis .0546 (an order of m agnitude larger than the estimate), wemay safely assume that d is zero. A second fit with d= 0showed that the number of nodes visited is accuratelydescribed by the functi on 19.14 - 26.01 N -'39, which isplotted on the graph. A similar weighted least squares fit

    1 5 -

    1 0 -

    _

    I I I I

    100 1000 10000 100000

    N

    Fig ur e 5..3. All nearest neighbors, bottom-up search.

    showed that the number of distance calculations is approxi-mately 5.11 - 6.18 N -~s3 ' which is also plotted in Figure5.3 (the values are about one percent greater than the numberused by the top-down algorithm).

    The data supports this conjecture.

    Conjecture 5.1: The expected running time of abottom-up nearest neighbor search in a K-d tree for apoint set uniform on the unit square is O(1).

    The conjecture is also supported (but not proved) by severalanalytical arguments.

    1. Bentley, Weide and Yao [1980] show that nearestneighbor search in a cell data structure requires constanttime for many distributions of points. As Figure 2.1suggests, K-d trees yield a cell-like partition for uniformdata; thus their theorem supports the conjecture.

    2. Consider the root of the K-d tree. Arguments in Section3 of Bentley, Weide and Yao [1980] show that withhigh probability, bottom-up searches for at most

    O( N 1/2 log N) of the points will have to proceed ashigh as the root of the tree. If those searches requireO(log N) time, then the total cost of performing all Nsearches obeys the recurrence

    T(N) = 2T(N/2) + O(N : /z log 2 N)

    which has solution T(N) = O(N), and the amortizedcost of a search is constant.

    3. We turn now to a more precise analysis of nodesvisited. Other arguments of Bentley, Weide and Yao[1980] suggest that, on the average, bottom-up searchesfor only 2~ of the points will proceed as high as theroot. If we assume that the cost of each of thosesearches is lg N, then the total cost of N searches isgiven by the recurrence

    T ( N ) = 2 T ( N / 2 ) + 2,/-ff lg N

    with the boundary condition T( 1) = 1. We can investi-gate the behavior of this recurrence at powers of two bydefining Ci = T(2t)/ 2~; Ci can be viewed as the aver-age cost of a bottom-up nearest neighbor search in atree of height i. The recurrence for T becomes

    1 9 1

  • 8/9/2019 222p187 Bentley

    6/11

    2 0 -19 - 26N - '4

    x x x x . . . . T ( N ) / N1 5 -

    x

    1 0 - ~I I I I

    100 1000 10000 100000

    N

    F i g u r e 5 . 4. S o l u t i o n o f t h e r e c u r r e n c eT(N) = 2T(N/2) + 2 ~ / N l g N , T ( 1 ) = 1 .

    C i = C l _ , + i 2 z - " z w i t h t h e b o u n d a r y C o = 1 .T h i s r e c u r r e n c e c a n b e t e l e s c o p e d i n t o t h e s u m

    C , = I + 2 ~ , ~ i ( 2 - 11 2 ) s

    U s i n g th e i d en t it y ~ , ~ , / x t = xl (1- x) 2, the series

    c o n v e rg e s t o

    C . = I + ~ ' / ( 1 - 1 / ~ ' ) 2 = 1 7 . 48 5 3

    F i g u r e 5 . 4 s h o w s t h e g r o w t h o f t h e r e c u r r e n c e , t o g e t h e rw i t h t h e p r e d i c to r f u n c t i o n 1 9 . 1 4 - 2 6 . 0 1 N - '3 9 f r o mF i g u r e 5 . 3 , w h i c h i s s h o w n a s a s o l i d l i n e . T h erecurrence i s cons is ten t ly about two less than the predic-tor, which accura te ly descr ibes the exper imenta l da ta .T h e d i f f e r e n c e o f t w o c o u l d b e a d j u s t e d b y u s i n g a d i f -f e r e n t b o u n d a r y c o n d i ti o n , s u c h a s T ( 3 2 ) = 1 2 . 3 .

    T h e s e a rg u m e n t s a r e o n l y h e u r i st i c , b u t t h e y r e i n f o r c e t h ee x p e r i m e n t a l o b s e r v a t io n s .

    A l l e x p e r i m e n t s r e p o r t e d s o f a r h a v e b e e n c o n d u c t e d o ns ta t ic poin t se t s . To s tudy the e ffec t of de le t ions , F igure 5 .5s h o w s t h e p e r f o r m a n c e o f t h e b o t t o m - u p s e a r c h w h e n a p p l i e dt o t h e n e a r e s t n e i g h b o r t o u r ( fu n c t i o n n n t o u r ) o f S e c t i o n 3 .T h e n e a r e s t n e i g h b o r t o u r w a s c o m p u t e d f o r t e n p o i n t s e t su n i f o r m o n t h e u n i t s q u a r e a t e a c h p o w e r o f t w o f r o m 3 2 t o1 3 1 0 7 2 . T h e b u c k e t c u t o f f w a s o n e . B o t h c u r v e s i n t h eg r a p h h a v e t h e s a m e c h a r a c t e r a s t h o s e i n F i g u r e 5 . 3 , a n d t h edi ffe rences a re indeed s l ight : the neares t ne ighbor tour v is i t sa b o u t 6 % m o r e K - d t r e e n o d e s t h a n c o m p u t i n g a l l n e a r e s tn e i g h b o r s , b u t u s e s a b o u t 2 0 % f e w e r d i s t a n c e c a l c u l a t i o n s .We i g h t e d l e a s t s q u a r e s r e g r es s i o n s s h o w e d t h a t t h e n u m b e ro f n o d es v i s it e d w a s 2 0 . 4 1 - 3 7 . 8 7 N - ' u a n d t h e n u m b e ro f d i s t a n ce c a l c u l at i o n s w a s 4 . 2 2 - 8 . 7 0 N- a S S ; b o t h f u n c -t i o n s a r e pl o t t e d i n t h e g r a p h . B e n t l e y [ 1 9 9 0 b ] o b s e r v e s t h a ts i m i l a r f u n c t i o n s c h a r a c t e r i z e t h e c o s t o f n e a r e s t n e i g h b o rs e a r c h i n g i n m a n y g e o m e t r i c a l g o r i t h m s .

    C a l l i n g t h e b a l l 2 i n b o u n d s f u n c t i o n a t e v e r yn o d e v i s i t e d d u r i n g t h e s e a r c h i s r e l a t i v e l y e x p e n s iv e . W et h e r e f o r e s to r e a p o i n t e r t o a b o u n d s a r r a y o n l y a t n o d e s i nt h e t r e e w i t h d e p t h c o n g r u e n t t o z e r o m o d u l o t h e v a r i a b l eb n d s l e v e l ( t h r e e w a s d i s c o v e r e d t o b e a n e f f e c t iv e c h o i c e,a n d i s t h e v a l u e u s e d i n e x p e r i m e n t s h e n c e f o r t h i n t h i sp a p er ) . W e t h e n m o d i f y f u n c t io n n n t o p e r f o r m t h eb a l l 2 i n b o u n d s t e s t o n l y if t h e b n d s p o i nt e r i sn o n ze r o . B e c a u s e a b o u n d s a r r ay is r e p re s e nt e d b y 2 K

    2 0 -

    15

    10

    _

    I I I I

    1 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0

    N

    N o d e s

    D i s t s

    F i g u r e 5 . 5 . N e a r e s t n e i g h b o r T S P t o u r, b o t t o m - u p se a r ch .

    f loa t ing-poin t numb ers , th i s a l so reduces the space requi redby the t ree .

    W e w i l l n o w t u r n o u r a t t e n t io n t o C P U t i m e s . F i g u r e5 . 6 s h o w s t h e r e s u l t s o f a n e x p e r i m e n t o n 1 0 0 u n i f o r m p o i n ts e ts , w i th N v a r y i n g f r o m 1 0 0 to 1 0 0 0 0 0 . T h e p r o g r a m s

    w e r e i m p l em e n t e d i n C + + o n a VA X 8 5 5 0 . T h e c i r cl ess h o w t h e C P U t i m e ( m i c r o s e c o n d s p e r p o i n t ) f o r c o n s t r u c t i n ga n e a r e s t n e i g h b o r t o u r u s i n g b o t t o m - u p d e l e t i o n a n d s e a r c h -i n g ; t h e c r o s s e s s h o w t h e C P U t i m e f o r t h e t o p - d o w n o p e r a -t i o n s ( t h e y d o n o t i n c l u d e t h e t i m e t o b u i l d t h e K - d t r e e ) .F o r a l l e x p e r im e n t s , t h e b u c k e t c u t o f f w a s s e t to 5 . T h el e a s t sq u a r e s r e g r e s s i o n l i n e t h r o u g h t h e t o p - d o w n o p e r a ti o n sis 34 .2 lg N + 51 micro seconds . A leas t squares regres-s i o n f o r t h e t h e b o t t o m - u p l i m e s f it th e d a t a t o t h e m o d e la + bNC; the resu l t ing f i t o f 363 - 53 8N - '29 i s a l so p lo t -ted.

    S o f a r w e h a v e s t u d i e d t h e p e r f o r m a n c e o f t h e b o t t o m -u p s e a r c h a l g o r i t h m o n l y o n d a t a u n i f o r m o n t h e u n i t s q u a r e .S o m e a l g o r i t h m s t h a t p e r f o r m w e l l o n u n i f o r m d a t a s e t s p e r -f o r m p o o r l y i n r e a l a p p l i c a t i o n s ; K - d t r e e s w e r e d e s i g n e d t oa v o i d t h is p r o b l e m b y a d a p t i n g t o t h e i n pu t d a ta . To s e eh o w w e l l t h e y a c c o m p l i s h t h i s , w e w i l l s t u d y a n e x p e r i m e n t

    6 0 0 -

    4 0 0 -

    2 0 0 -

    x

    x

    x x x ~ o_ o _ 0 X X 0X x x X

    x ~ S ~ : : ' ~ * x ~ . * 0 _ _ ~ j .

    ~~DO~ - 0 0o o o

    I I I I

    1 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0

    N

    F i g u r e 5 . 6 . N e a r e s t n e i g h b o r T S P t o ur ,m i c r o s e c o n d s p e r p o i n t .

    To p -D o w n

    B o t t o m -

    Up

    1 9 2

  • 8/9/2019 222p187 Bentley

    7/11

    using ten input distributions that are very nonunifor m. Thedistributions are described in this table (the abbreviationU[0,1] signifies data uniform on the closed interval [0,1]and Normal (s) signifies normally distributed data with mean0 an d standard deviation s):

    NAME DESO~rI'IONuniannulus

    arithballclusnormcubediarncubcedgecomersgridnormalS~O~CS

    uniform within the unit square (U[0,1] 2)uniform on the edge of a circle (width zero annulus)

    x0 = 0, 1, 4, 90 16, ... (arithmetic differences); x] = 0uniform inside a circleNormal(O.05) at 10 points from U[0,1] 2x0 =xt = U[0,1]Xo = U[0,1] ;x, =0U[0,1] a at (0,0), (2,0), (0,2), (2,2)N points chosen from a square grid of 1.3N pointseach dimension from Normal(l)N/2 points at (U[0,1], I/2); NI2 at (1/2, U[O,1])

    Some of these distributions have served as worst-case eoun-terexamples in papers on geometric algorithms, while othersmodel inputs in various geometric applications.

    Figure 5.7 shows the efficiency of bottom-up searchingin computing nearest neighbor tours on point sets drawnfrom these distributions. Five sets of size 10,000 weredrawn from each distribution, and the nearest neighbor tourwas computed for each. For all experiments, the bucket cut-off was set to 5. Figure 5.7 reports the average numbe r ofdistance calculations per search (circles) and the averagenumber of nodes visited (crosses) per search. The uniformdistribution (leftmost on the graph) provides a benchmark:most distributions have very similar performance, the one-dimensional distributions are more efficient (annulus, arith+cubeedge and cubediam), while on e distribution, spokes, isnoticeably slower. In Section 7 we'll see how a modifiedK-d tree gracefully handles the spokes distribution. It iscomforting, though, to see that the performance of the algo-rithms is not substantially slower on nonunif orm data. Bent-ley [1990b] reports that bottom-up searching displays similarperformance when employed in a variety of geometric algo-rithms.

    6 . S p h e r i c a l Q u e r i e s

    In this section we will examine two types of queriesthat deal with spheres. The first type is a fixed-radius nearneighbor query: it asks which points in the set are containedin a giv en sphere. The second type is the dual query: eachpoint in the set has an associated radius (and therebyrepresents a sphere), and a query asks which spheres con taina given point.

    A fixed-radius near neighbor search is defined by anadditional member of the C++ kd t r ee class:typ ede f void (*PFIV) (int i) ;vo id f rnn( in t p tnum, f loa t rad, PFI V f ) ;v o i d s e t f r n n r a d ( f l o a t s m a l l e rr a d ) ;

    The f r n n func tion performs the search: it calls he functionf for each point within radius ra d of point pt nu m. Thefunction parameter uses the declaration Pglv; the abbrevia-

    1 9 3

    30- -

    20- -

    10- -o

    x x x It : x 8

    0 0

    uni atith clu mm ca grid spokesromulus ball cubccdgc cc~'nc rs n~ ma l

    Distribution

    Nodes

    Dists

    Fig ure 5.7. Nearest neighbor TSP tour, N = 10,000.

    tion stands for Pointer to Function of Integer returning Void.The ftmction s e t f r n n r a d shrinks the radius in the middleof a search; shrinking the radius to zero stops the search.

    Program 6.1 gives the recursive function r f r nn fortop-down near neighbor searching; it is very similar to the

    r n n function in Section 2. The functi on always visits thenearer son first, and thus tends to report the points in increas-ing order from the target. The bottom-up algorithm f r n n inProgram 6.1 uses the r f r n n ftmction; it is quite similar tothe bottom-up nearest neighbor search function in Section 5.The global variables are the integer n n t a r g e t , the floatsnn di st and nnd ist 2 , and the function nnf unc.

    The run ning time of the search algorithm depends, ofcourse, on m~ay factors. In some applications, though, thequery spheres tend to include some constant number ofpoints. Bentley [1990a] describes how this condition holdsfor 2-opting a traveling salesman tour; it also holds forsearching for the M nearest neighbors to a query point. Insuch applications the algorithm appears to require constanttime; its operation is similar to the bottom-up nearest neigh-bor search in the previous section.

    We will turn now to the dual problem of "ball search-ing": each point in the set has an associated radius (andthereby represents a sphere), and a query asks which spherescontain a given poinL We augment the k dt r ee classdefinition with these functions:

    void se t rad( in t p tnum, f loa t rad) ;v o i d b a l l s e a r c h ( i n t pt n u m , P F I V f) ;

    Bentley [1990b] uses these function to implement the"Nearest Insertion", "Farthest Insertion" and "RandomInsertion" traveling salesman heuristics.

    To implement a ball search we will add to each bucketnode a link ed list of all balls that intersect it. Theb a l l s e a r c h function proceeds to its bucket (in constanttime) and checks all bails on the list to see whether theyintersect the query point; it calls the function f on all ballsthat do. The s e t r a d function makes two passes: the firstpass deletes the old radius from the lists that contain it, andthe second pass inserts the new radius in the appropriate

  • 8/9/2019 222p187 Bentley

    8/11

    void f rnn( i n t p tnum, f loa t rad , PFIV f ){ n n t a rg e t = p t n u m ;

    nndis t = rad ;n n d i s t 2 = r a d * r a d ;nnfunc = f ;p = b u c k e t p t r [ n n t a r g e t ] ;r f r n n ( p ) ;w hi le ( I ) {

    las tp = p ;p = p - > f a t h e r ;i f (p == 0) break ;d i f f = x ( n n t a r g e t , p - > c u t d im )

    - p - > c u t v a l ;i f ( l as t p = = p - > l o s o n ) (

    i f ( n n di s t > = - d i f f)r f r n n ( p - > h i s o n ) ;

    ) e l se {i f (nndis t >= d i ff )

    r f r n n ( p - > l o s o n ) ;)

    i f ( ( p - > b n d s ! = 0 ) &&b a l l i n b o u n d s ( p- > b n d s ,

    n n t a rg e t , n n d i s t ) )b r e a k ;

    })

    v o i d r f r n n ( k d n o d e * p){ i f (p -> empty) re turn ;

    i f (p -> bucket ) {f o r ( i = p - > l o p t ; i < = p - > h i p t ; i + + )

    i f ( d i s t s q r d ( p e r m [ l ] , n n t a rg e t )< = n n d i s t 2 )

    (*nnfunc) (perm[ l ] ) ;) e l se {

    d i f f = x ( n n t a r g e t , p - > c u t d im )- p - > c u t v a l ;

    i f (d i ff < 0 .0) {r f r n n ( p - > l o s o n ) ;i f (nndis t >= -d i ff )

    r f r n n ( p - > h i s o n ) ;} e l s e {

    r f r n n ( p - > h i s o n ) ;i f (nndis t >= d i ff )

    r f r n n ( p - > l o s o n ) ;)

    ))

    Pro gra m 6.1. Bottom-up fixed-radiusnear neighbor searching.

    lists. The work is done by the top-down recursive proeedurein Program 6.2. The bottom-u p version of this function issimilar to the bottom-up versions of previous functions.

    The run ning time of these algorithms depends on man yfactors. In applications in which the spheres tend to include

    a constant number of points, both the se t r ad andb a l l s e a r c h functions appear to require constant time.Several alternative schemes were tested, but increased therun ning time of the functions. One technique, for instance,tested nodes to ensure that their bounds intersected the ballin the current s et r ad operation. This decreased the totalnumber of nodes visited, but increased the cost of visitingeach node; the overall run time was slightly higher.

    v o i d r s e t r a d ( k d n o d e * p)( i f (p -> bucket ) {

    / / i n s e r t o r d e l e t e p o i n t/ / n u m b e r i n p - > b a l l l i st

    } e l s e {d i f f = x ( s e t p t n u m , p - > c u t d i m )

    - p - > c u t v a l ;i f (d i ff < 0 .0) {

    r s e t r a d ( p - > l o s o n ) ;i f ( s e t r a d i u s > = - d i ff )

    r s e t r a d ( p - > h i s o n ) ;} e l s e {

    r s e t r a d ( p - > h i s o n ) ;i f ( se t rad ius >= d i ff )

    r s e t r a d ( p - > l o s o n ) ;)

    )}

    Pro g ram 6~. T op- dow n rad ius edjusanent .

    7. Bui lding The Tree

    In this section we will apply sampling techniques toalgorithms for build ing K-d trees. The first two applicationsreduce the time for building the tree, while the third applica-tion leads to a tree that is more eff icient to search. These

    techniques apply to genera] K-d trees, as wel l as to trees forsemidynamic point sots.

    The recursive b u i l d function in Section 2 uses thefunction f i n d m a x s p r e a d t o find the dimension with larg-est spread among the points in pe rm [1 .. u] . When thebu il d function is called on N points, fin dm ax sp re adtakes O(KN) time while the rest of b u il d requires O(N)time. The function b u i ld therefore has the rectaxence

    T ( N ) = 2 T ( N / 2 ) + O ( K N )

    with the boundary T(1)=O(K); the solution isT(N) = O(KN log N). We will reduce the time of b u i ldby finding the dimension of max imum spread in a sample ofthe point sot of size ~ ; the cost of partitioning after thesample is O(N). The recurrence then becomes

    T(N) = 2T(N/2) + O(K4-ff) + O(N)

    with the same boundary T(1)=O(K); the solution is there-fore T(N) = O(KN + N log N).

    To test the effect of this change, an experiment built aK-d tree for ten sets of 100,000 points generated uniformlyon the unit square. The average time required to build thetree dropped from 32 seconds to 20 seconds (a decrease of35%, which would be larger for larger values of K).Profiling showed that in the original program, 14.0 secondswere spent in computing spreads. 15.5 seconds were spent inselecting the median, and 2.5 soconds were spent the b u i l d

    functio n itself. Computing the spread on a sample reducedthe 14.0 seconds to 2.0 seconds. (The average time for abottom-up nearest neighbor search in the slightly less bal-anced tree increased by about 1 percent.)

    The bulk of the time of building a tree is now spent inthe s e l e c t function (which uses a median-of-three parti-tion, which is a sample of size 3). We could reduce thattime by using the selection algorithm of Floyd and Rivest

    1 9 4

  • 8/9/2019 222p187 Bentley

    9/11

    [1975] , which uses a sample of s ize ( roughly) ~ /N to f ind them e d i a n i n ( r o u g h l y ) 3N/2 c o m p a r i s o n s . I n s t ea d , w e w i l lc o m p u t e t h e t r u e m e d i a n o f a s a m p l e o f s i z e ~ e l e m e n t s ,a n d p a r t i t i o n a r o u n d t h a t v a l u e ( w h i c h a p p r o x i m a t e s t h em e d i a n o f t h e se t ) in ju s t N c o m p a r i s o n s . I n b u i l d i n g t e nt rees for uni form po in t se t s of size 100,000, th i s reduces the

    t ime to bui ld a t ree f rom 2 0 seconds to 12 seconds . (Thee x p e r i m e n t s a l s o s h o w e d , h o w e v e r, t h a t t h e a v e r a g e s e a r c ht ime increased by about 2 percent ; s ince many appl ica t ionss p e n d m u c h m o r e t i m e s e a r c h i n g t h a n b u i l d i n g t h e t r e e , t h ed e f a u l t b e h a v i o r o f t h e p r o g r a m i s n o t t o u s e t h is k i n d o fs a m p l i n g . )

    S o f a r w e h a v e c o n c e n t r a t e d o n f a s te r w a y s o f b u i l d i n gt h e s a m e k i n d o f K - d t r ee ; w e w i l l n o w c o n s i d e r b u i ld i n g abe t te r K-d t ree . Th e t rees a re adapt ive in choos ing thed i m e n s i o n i n w h i c h t o c u t t h e p o i n t s e t ; t h e y a l w a y s , h o w -e v e r, c h o o s e to cu t n e a r t h e m e d i a n . T h e " s p o k e s " d i s tr i b u -t i o n o f t h e S e c t i o n 5 s h o w e d w h y m e d i a n c u t s c a n b e a p o o ridea . In tha t d is t r ibut ion ,N/ 2 p o i n t s a r e u n i f o r m b e t w e e n( 1 / 2 , 0 ) a n d ( 1 / 2 , 1 ) a n dN/ 2 p o i n t s a r e u n i f o r m b e t w e e n

    ( 0 , 1 / 2 ) a n d ( 1 , 1 / 2 ) . I n o t h e r w o r d s , t h e p o i n ts a r e d i st r i-b u t e d u n i f o r m l y o v e r a p lu s s i g n " + " c e n t e r ed i n t h e u n i tsquare . I f the root cu t of the K- d t ree i s a ver t ica l l ine ( tha tis, cutdim=O and cutval= 1/2) , then the NI2 neares t ne igh-bor searches for poin ts a long tha t l ine mu st a l l p roceed to theroot ; a s imi lar s i tua t ion occu rs for a hor izonta l cu t l ine .E a c h o f t h e N/ 2 searches has logar i th mic cos t , and the aver-a g e s e a r c h co s t f o r t h e p o i n t s et g r o w s a s l o g N .

    To f i n d a g o o d c u t p l a n e w e w i l l u s e a t e c h n i q u ei n s pi r e d b y B e n t l e y a n d S h a m o s [ 1 9 7 6] ( a l s o d e sc r i b e d b yP r e p a r a t a a n d S h a m o s [ 1 9 8 5 , S e c t i o n 5. 4 ]) . T h e i r a l g o r i t h mfinds a l l pa i rs of poin ts w i th in 5 o f one ano ther in a sparsepoin t se t , us ing the fo l lowing def in i t ion and theorem.

    D e f i n i t i o n 7.1 : A po in t se t i s (8 , C)-sparse ff no rec t i l -i n e a r l y o r i e n t e d h y p e r e u b e o f e d g e l e n g t h 2 8 c o n t a i n sm o r e t h a n C p o i n t s .

    T h e o r e m 7 . 2 : [ B e n t l e y a n d S h a m o s ] G i v e n a ( 8 , C ) -sparse se t of N poin ts in K-space , there ex is t s a cu tp lane P perpendicular to a coordina te ax is wi th theseproper t ies :

    1 . At leas tN/(4K) p o i n t s a r e o n e a c h s i d e o f P.

    2 . T h e r e a r e a t m o s tKCN 1-1/K points with in d is tance8 o f P.

    B e n t l e y a n d S h a m o s ' s d i v i d e - a n d - c o n q u e r a l g o r i t h m u s e s t h etheorem to f ind a cu t p lane , recurs ive ly f inds near ne ighborpai rs tha t a re on the sam e s ide of the p lane , and then f inds

    the re la t ive ly few pa i rs tha t a re on oppos i te s ides of theplane . I t s run t im e i sO( N l o g N ) , f o r f ix e d K .

    W e w i l l n o w u s e t h e i r t ec h n i q u e t o b u i l d K - d t r e e s th a ta r e m o r e r o b u s t t o p a t h o l o g i c a l in p u ts . P a p a d i m i t r i o u a n dB e n t l e y [ 1 9 8 0 ] s h o w t h a t s o m e p o i n t s e t s d o n o t h a v e g o o dneares t ne ighbor cu t p lanes , so th is technique does not havewor s t -case guarantees ; i t does , however , per form wel l on thep a t h o l o g i c a l p o i n t s e t s d e s c r i b e d b y B e n t l e y a n d S h a m o s[ 1 9 7 6] a n d Z o l n o w s k y [ 1 9 7 8 ] . W e w i l l c h o o s e a c u t p l a n eby process ing a sample , S , o f M po in ts in the or ig ina l se t of

    s ize N. W e f i r s t recurs iv e ly bui ld a K-d t ree for S , then f ifor each poin t in S i t s neares t ne ighbor in S ; th i s def ines c o l l e c t io n o f M b a i l s i n s p a c e. W e t h e n c o n s i d e r e a c h o f thK dimens ions , and t ry to choose a cu t p lane tha t in te rsecre la t ive ly few ba l l s . Each ba l l is represented in the pro jet ion by three sca lars : i t s center and two endpoin ts (p lus anmin us the rad ius) . Af ter sor t ing the 3M values , we scat h r o u g h t h e m , k e e p i n g t r a c k o f t h e n u m b e r o f b a l ls c u r r e nin tersec ted ; th i s requi res O( K M log M ) t irne.%

    To tes t var iab le-cut p lanes , an exper iment genera ted teuni form poin t se t s and ten spoke poin t se t s , each of s iz10 ,000. This tab le repor ts the avera ge CPU t im es (seconds) for bui ld ing and searching the t rees .

    MEDIAN CUTS ' VARIABI2.U T SINr,Lr'r Build Search Total Build Search Total

    Unifo rm 1.01 2.97 3.98 1.68 3.03 4.71Spokes 1.00 6.49 7.49 1.61 1.75 3.36

    Var iab le-cut t rees require a b o u t 6 5 % m o r e C P U t i m e t o

    b u i l d t h a n m e d i a n c u t p la n e s . F o r u n i f o r m d i st r ib u t io n s , tsearch t ime in a var iab le-cut t ree i s , as expected , the same in a med ian t ree . Fo r the spokes d ist r ibut ion , though, ts e a r c h t i m e d r o p s f r o m b e i n g v e r y b a d f o r m e d i a n t r e e s b e i n g v e r y g o o d f o r v a r i a b l e - c u t t r e e s .

    8 . C o n c l u s i o n s

    T h e a l g o r i t h m s i n t h i s p a p e r a r e e a s y t o i m p l e m e n t a nare e ff ic ien t in prac t ice . Al th oug h the descr ip tions in thb o d y o f t h e p a p e r h a v e b e e n f o r t h e p l a n a r c a s e ( K = 2Appendix 2 descr ibes exper iments for h igher d imens ionT h e t e c h n i q u e s u n d e r l y i n g t h e a l g o r i t h m s a p p e a r t o b e genera l in te res t .

    Semidynamic Data Structures. Point se t s tha t suppor tde le t ions and undele t ions (but not inser t ions) a re generenough to be usefu l in a broad c lass of appl ica t ions but spc ia l ized enough to a l low eff ic ien t opera t ions .

    Bottom-Up Operations. Both s e a r c h e s a n d m a i n t e n a n c eo p e r a t i o n s ( d e l e t i o n a n d u n d e l e t i o n ) c a n b e p e r f o r mb o t t o m - u p b y a u g m e n t i n g t h e t r e e d a t a s t r u c t u r e w i t h bu ck et pt r vector and fa th er pointers.

    Sampling. Section 7 describes three applications of sam -p i in g i n b u i l d i n g t r ee s . T h e r u n t i m e w a s r e d u c e d b y f i n dit h e s pr e a d o f a s a m p l e a n d b y p a r t i ti o n i n g a r o u n d t h e m e d io f a s a m p l e . T h e t r e e w a s m a d e m o r e r o b u s t b y f i n d in gcut p lane tha t e ffec t ive ly d iv ides a sample . Al l the tecniques a re appl icable to genera l t rees, no t jus t t rees th

    r e p r e se n t s e m i d y n a m i c s et s. T h e t e c h n i q u e in S e c t i o n 5 s tor ing the bounds ar ray only every few leve ls in the t ree cb e v i e w e d a s c h o o s i n g a s a m p l e o f l e v el s .

    t A few implementation details about variable-cut planes. The planesfound on ly ff the subset is sufficiently arge (N~1000)', the sam ple is of sM = 10N*'*. The cut plane is chosen so that at leastN/(4IO points are oneach side of it. We choo se to cutat the point with minimum score, where hscore of a point is d efined to be the num ber of ba lls intersected plus number o f points between that point and the m edian times the penalty faof N-,,x.

    1 9 5

  • 8/9/2019 222p187 Bentley

    10/11

    Caching. Appendix 1 descr ibes how caches reduce therun t ime of two operat ions that are associa ted wi th K-d t rees :bui ld ing the t rees and comput ing dis tances .

    Acknowledgments

    I am g ra t e fu l fo r t he he lpfu l comment s o f Ken C la rkson ,Dav id Johnson , Br i an Kern ighan , Co l in Mal lows , DougMcI l roy, Sa l ly M cKee , Rav i Se th i , and Chr i s Van W yk .

    Refe rences

    Aho, A. V. , J . E. Hoperof t and J . D. Ul iman [1974] .Th eDesign and Analysis of Computer Algorithms, Addison-We s l e y.

    Ben t l ey, J. L . [1975] . "Mul t id imens iona l b ina ry sea rch t r eesused fo r a s soc i a t ive sea rch ing" ,Communications of the AC M18 , 9, September 1975, pp. 509-517.

    Ben t l ey, J. L . [1990a ]. "Expe r imen t s on t r ave l ing sa l e smanheur i s t i c s " , Proceedings First Symposium on Discrete Algo-rithms, pp. 91-99.

    Ben t l ey, J. L . [1990b] . "Fa s t a lgo r i thms fo r geomet r i c t r av -e l ing s~esman p rob lems" , i n p repa ra t ion .

    Ben t l ey, J . L . and M. I . Shamos [1976] . "Div id e and co r t -quc r i n mu l t id imens iona l space" ,Proceedings Eighth ACMSymposium on the Theory of Computing, pp.220-230 .

    Ben t l ey, J . L . , B . W. Weide and A . C . Yao [1980] ."Op t ima l expec ted - t ime a lgo r i thms fo r c lo ses t po in t p rob -l e m s " , AC M Transactions on Mathematical Software 6,4,pp. 563-580.

    F loyd , R . W. an d R . L . R ives t [1975]. "Expec ted t imebounds fo r s e l ec t ion" ,Coraraunications of the ACM 18, 3,March 1975, pp. 165-172.

    F r i edman , J. H . , J . L . Ben t l ey and R . A . F inke l [1977]. "A nalgor i thm for f inding best matches in logar i thrnie expected

    t i m e " , ACM Transactions on Mathematical Software 3, 3,pp. 209-226.

    Lee , D . T. and C . K . W ong [1977]. "W ors t - ease ana lysi sfo r r eg ion and pa r t i a l r eg ion sea rches in mu l t id imens iona lb ina ry sea rch t r ee s and ba l anced quad t r ee s" ,Acta Informa-tica 9, pp. 23-27.

    Meh lhom, K . [1984] . Data Structures and Algorithms 3:Multi-dimen sional Searching and Computational Geom etry,Spr inge r-Ver l ag .

    Papad imi t r iou , C . H . and J . L . Ben t l ey [1980]. "A wors t -case ana lys i s o f nea re s t ne ighbor s ea rch ing by p ro j ec t ion" ,in Ca rneg ie -Mel lon Un ive r s i ty Compute r Sc i ence Repor tC M U - C S - 8 0 - 1 0 9 .

    P repa ra t a , F. P. and M. I . Shamos [1985] .ComputationalGeometry, Springer-Verlag.

    Sedgewick, R. [1988] . Algorithms, Second Edi t ion,A d d i s o n - We s l e y.

    Sproul l , R. L. [1988] . "Ref inemen ts to neares t -neighb orsea rch ing in k -d t r ee s" , Su the r l and , Sp rou l l and Assoc ia t e sSSA PP #184 , Ju ly 1988.

    Stroust rup, B. [1986] . The C++ Programming Language,Addison-Wes ley.

    Zoln owsky , J . E. [1978] . Topics in Com putat ionalGeom etry, Ph. D. Thesis , Stanford Univers i ty, M ay 1978,STAN-CS-78-659 , 53 pp .

    floatfloatfloatkd t reevoid

    };

    A ppen dix 1: C a c h i n gThis appendix descr ibes two kinds of caching that are

    use fu l i n many p rog ram s tha t ope ra te on K-d t r ees . Fo r con -cre teness , we wi l l assume that a point se t i s represented by aC++ class wi th (a t leas0 these operat ions:

    c lass po in tse t {public:

    po in tse t (char *f i lename) ;poin tse t (po in tse t *p tse t ,

    int subsetn,int *subsetvec);

    x (int i , int j) ;dist(int i , int j) ;dists q rd(in t i , int j) ;*grab t ree( ) ;re lease t ree( ) ;

    A po in t s e t i s c r ea t ed e i ther by read ing f rom a f il e(specif ied by a s t r ing) or by taking a subset of an exis t ingp o i n t s e t . Note tha t once apoin tse t i s created, therei s no w ay to change any coord ina t e o f any po in t . The func -t ion x ( i , j ) a ccesses t he j - t h coo rd ina t e o f po in t i. Thefunc t ion d i s t ( i , j ) r e tu rns the d i s t ance f rom po in t i t opo in t j , and d i s t s q rd r e tu rns the squa re o f t he d i st ance.

    S t r a igh t fo rward imp lemen ta t ions o f many a lgo r i thmsrebui ld the same K-d t ree several t imes for a g iven point se t .Fo r i n s t ance , a t r ave l ing sa l e sman p rog ram m igh t u se onetree to compute a s tar t ing tour and then use another t ree toimprove the tou r by 2 -op t ing . A po in t s e t a s soci a t es ak d t r ee wi th each po in t s e t. The t r ee i s s e i zed by theg r a b t r e e fur tc t ion (which ensures that a l l points areundele ted) and is re tunaed by re le ase t ree . This cachedt ree reduces the t ime to rebui ld a t ree f romO ( K N + N l o g N ) t oO(N/B) , where B is the bucket s ize .

    A bot t leneck in many appl icat ions is comput ing a dis-tanee funct ion. Because there areN ( N - 1)/2 in terpoint d is-tances , it is usual ly impract ical to s tore them al l . Let M bethe sma l l e s t power o f 2 g rea t e r t han o r equa l t o N ; no te t ha tM < 2 N . T h i s cached dis tance funct ion uses M integers andM f loa ting po in t numb ers :

    in t caches ig [m] ;float cachev al [m] ;

    float dist(int i , int j){ if ( i > j) sw ap(&i, &j);

    ind = i ^ j;if (cachesig [ ind] != i) {

    c a c h es i g [ i n d ] = i;cacheva l [ ind] = computed is t ( i , j ) ;

    }

    r e t u r n c a c h e v a l [ i n d ] ;)

    The values i and j are fi rs t norm al ized so thati_

  • 8/9/2019 222p187 Bentley

    11/11

    4 0 ~

    30

    20

    10--

    Nodes

    DistsI I I I

    100 1000 10000 100000

    N

    F i g u r e A2.1. All nearest neighbors, bottom-up search, K = 3.

    by the ^ operator) to produce the hash index ind. Theinteger cachesig[ind] is a signature of the ( i , j ) pai rrepresented in cacheval[ind] (because i n d = C j , we canrecover j by ind ' i ) . The cache algori thm therefore checksthe signature entry cachesig[ind] and, if necessary, updatescachesig[ind] and recomputes the value entry cacheval[ind].It finally returns the correct value.

    This cache will prove effective if the underlying algo-rithm displays a great deal of locality in its use of the dis-tance function. In 2- opting large traveling salesman tours,for instance, the cache hit rate is typically near 75%.Accessing the cache takes just one third the time of comput-ing the distance from scratch, so the total time spent in com-puting distances is halved.

    A p p e n d i x 2 : E x p e r i m e n t s i n H i g h e r D i m e n s i o n s

    All experiments in the body of the paper were for K=2;the algofithrns have similar performance for higher dimen-

    sions. In this section we briefly report on their performancewhen K=3 .

    Figure A2.1 summarizes the first set of experiments forK= 3. For each N a power of two from 25=3 2 to217=131072, we generate 10 sets of N points at random onthe unit cube, build a 3-d tree with bucket size of one, andthen search for all nearest neighbors in the point set. Thecircles show the average number of distance calculations andthe crosses show the average number of internal nodesvisited. Weighted non-lin ear least squares fits showed thatthe number of nodes visited is approximately49.14 - 66.84N - '= and the number of distance calcula-tions is approximately 12.63 - 18.66N-'33; both functionsare plotted on the graph. Notice that the functions approach

    their asymptotic values much more slowly for K= 3 than forK=2; the growth is is slower yet for larger K, which is whyFigure A2.1 stops at K=3.

    We will tom next to an experiment on the highlynonuniform distributions described in this table, which arethe multid imension al generalizations of the distributions inSection 5:

    NA ME DESCRIFI~ON

    uniannulusarithballclusnormcubediam

    cubeedgec o m e r s

    gridnormalspokes

    uniform w ithin the hypercube (U[ O,I] ~)

    x0 , xl uniform on a circle; .... r_t fro m U[ O,I ]

    x0 =0 , 1,4, 9, 16, ...; t .. ... x x_1 = 0

    uniform inside sphere

    Normal(O.05) at ten points on U [ O,I ] rx0 =x, . .. .. x~_, = U[O,I ]x0 = U[0,1];X, =...=xsr-, =0U[0,1] 2 at (0,0), (2,0). (0,2), (2,2),x2 .... xr-i from U[0.1]N points from a grid hypereube with 1.3N pointseach dimension from Normal(l)N /K at (U[0 ,1], 1/2 ..... 1/2),N/K at (1/2, U[0,1] ,..., I/2) . ..

    Figure A2.2 shows the efficiency of bottom-up search-

    ing in computing nearest neighbor tours on point sets drawnfrom these distributions for K=3 . Five sets of sizeN=IO,000 were drawn from each distribution, and thenearest neighbor tour was computed for each. For all experi-ments, the bucket cutof f was set to 5 and variable-cut planeswere employed. The graph reports the average number ofdistance calculations (circles) and the average number ofnodes visited (crosses). The graph is similar to Figure 5.7,except the nasty behavi or of the spokes distribution has beentamed (by variable-cut planes) and the grid distributionsuffers more in higher dimensio ns. Once again, the perfor-mance of the algorithms is not substantially slower on highlynonu nifor m data.

    3 0 -

    2 0 -

    10--

    M

    N !

    e e e O e

    N

    " i'l"l t&l'j'ni afith clu onu cub gr id spokes

    annulus ball cubeedgc ctmaers normalDistribution

    Nodes

    Dists

    Fig ure A2.2. Nearest neighbor TSP tour, N = 10,000, K = 3.

    1 9 7