Efficient Processing of k Nearest Neighbor Joins using MapReduce.
Post on 22-Dec-2015
220 Views
Preview:
Transcript
Efficient Processing of Efficient Processing of k Nearest Neighbor k Nearest Neighbor
Joins usingJoins usingMapReduceMapReduce
INTRODUCTIONINTRODUCTION
• k nearest neighbor join (kNN join) is a special type of join that combines each object in a dataset R with the k objects in another dataset S that are closest to it.
• As a combination of the k nearest neighbor (kNN) query and the join operation, kNN join is an expensive operation.
• Most of the existing work rely on some centralized indexing structure such as the B+-tree and the R-tree, which cannot be accommodated in such a distributed and parallel environment directly.
AN OVERVIEW OF KNN JOIN AN OVERVIEW OF KNN JOIN USING MAPREDUCEUSING MAPREDUCE• basic strategy:R=U1≤i≤N Ri, where Ri∩Rj =
∅, i ≠ j; each subset Ri is distributed to a reducer. S has to be sent to each reducer to be joined with Ri; finally R∝S = U1≤i≤N
Ri∝ S. |R|+N·|S|.• H-BRJ: splits both R and S into √n R=U1≤i≤
√n Ri S=U1≤i≤ √nSi. • Better strategy: Ri∝S=Ri∝Si and
R∝S=U1≤i≤NRi∝Si. |R|+α·|S|
• In summary, for the purpose of minimizing the join cost, we need to
1. find a good partitioning of R; 2. find the minimal set of Si for each
Ri ∈ R, given a partitioning of R.※ The minimum set of Si is Si =U1≤j≤|Ri|
KNN(ri, S). However,it is impossible to find out the k nearest neighbors for all ri apriori.
AN OVERVIEW OF KNN JOIN AN OVERVIEW OF KNN JOIN USING MAPREDUCEUSING MAPREDUCE
HANDLING KNN JOIN USING HANDLING KNN JOIN USING MAPREDUCEMAPREDUCE
DATA PREPROCESSINGDATA PREPROCESSING
• A good partitioning of R for optimizing kNN join should cluster objects based on their proximity.
• Random Selection• Farthest Selection• k-means Selection※ It is not easy to find pivots.
First MapReduce JobFirst MapReduce Job• perform data partitioning and collect
some statistics for each partition.
Second MapReduce JobSecond MapReduce Job• Distance Bound of kNN
ub(s,PiR) = U(Pi
R) + |pi,pj| + |pj,s|
θi= max∀s∈KNN(PiR,S)|ub(s, Pi
R )| ①
Second MapReduce JobSecond MapReduce Job• Finding Si for Ri
lb(s, PiR ) = max{0, |pi, pj| − U(Pi
R ) − |s, pj |} ②
if (lb(s, PiR )>θi) ③
then s KNN(PiR,S)
LB(PjS,Pi
R) = |pi, pj|- U(Pi
R ) -θi
if (|s,pj| ≥LB(PjS,Pi
R))
then s KNN(PiR,S)
s ∈ [LB(PjS,Pi
R),U(PjS)]
Second MapReduce JobSecond MapReduce Job• In this way, objects in each partition of R
and their potential k nearest neighbors will be sent to the same reducer. By parsing the key value pair (k2, v2), the reducer can derive the partition Pi
R and subset Si that consists of Pj1
S , . . . ,PjMS
• ∀r ∈ PiR , in order to reduce the number of
distance computations, we first sort the partitions from Si by the distances from their pivots to pivot pi in the ascending order.
※ compute θi ← max∀s∈KNN(PRi,S)|ub(s,PRi )|
※ Refine θi but I think it is useless.
Second MapReduce JobSecond MapReduce Job• define d(o,HP(pi, pj)) =
. | pj pi,| 2
|pj o,||,| 2
2pio
if d(o,HP(pi, pj)) > θthen ∀ q∈Pi
R |o,q|> θ
if max{L(PiS), |pi, q| −
θ} ≤ |pi,o| ≤ min{U(Pi
O ), |pi, q|+ θ}then |q, o| ≤ θ
MINIMIZING REPLICATION OF MINIMIZING REPLICATION OF SS• |s, pj| ≥ LB(Pj
S, PiR ) => large LB(Pj
S, PiR) keep
small |s, pj|
=>split the dataset into finer granularity and the bound of the kNN distances for all objects in each partition of R will become tighter.
• R =U1≤i≤N Gi, Gi ∩ Gj = ∅, i = j.
s is assigned to Si only if |s, pj| ≥ LB(PjS, Gi ).
where LB(PjS, Gi ) = min ∀Pi
R G∈ i LB(PjS, Pi
R )
RP(S) =∑∀Gi∑∀PjS|{s|s ∈ Pj
S∧ |s, pj| ≥ LB(PjS ,Gi)}|
MINIMIZING REPLICATION OF MINIMIZING REPLICATION OF SS• Geometric Grouping
• Greedy Groupingminimize the size of RP(S,Gi ∪ {Pj
R}) − RP(S,Gi)
but it is rather cost, so ∃s ∈ PSl , |s, pj| ≤ LB(Pj
S ,Gi)
RP(S,Gi) ≈∀PjS⊂S{Pj
S |LB(PjS ,Gi) ≤ U(Pj
S )}
EXPERIMENTAL EXPERIMENTAL EVALUATIONEVALUATION
EXPERIMENTAL EXPERIMENTAL EVALUATIONEVALUATION
EXPERIMENTAL EXPERIMENTAL EVALUATIONEVALUATION
EXPERIMENTAL EXPERIMENTAL EVALUATIONEVALUATION
EXPERIMENTAL EXPERIMENTAL EVALUATIONEVALUATION
The End!The End!ThanksThanks
top related