Top Banner
Cheng, Chen, Chen, Xie Evaluating Probability Threshold k-Nearest-Neighbor Queries over Uncertain Data Reynold Cheng (University of Hong Kong) Lei Chen (Hong Kong University of Science &Tech) Jinchuan Chen (Hong Kong Polytechnic University) Xike Xie (University of Hong Kong) International Conference on Extending Database Technology 2009
34

Cheng, Chen, Chen, Xie Evaluating Probability Threshold k- Nearest-Neighbor Queries over Uncertain Data Reynold Cheng (University of Hong Kong) Lei Chen.

Dec 16, 2015

Download

Documents

Gary Anthony
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Cheng, Chen, Chen, Xie Evaluating Probability Threshold k- Nearest-Neighbor Queries over Uncertain Data Reynold Cheng (University of Hong Kong) Lei Chen.

Cheng, Chen, Chen, Xie

Evaluating Probability Threshold k-Nearest-Neighbor Queries over Uncertain Data

Reynold Cheng (University of Hong Kong)Lei Chen (Hong Kong University of Science &Tech)Jinchuan Chen (Hong Kong Polytechnic University)Xike Xie (University of Hong Kong)

International Conference on Extending Database Technology 2009

Page 2: Cheng, Chen, Chen, Xie Evaluating Probability Threshold k- Nearest-Neighbor Queries over Uncertain Data Reynold Cheng (University of Hong Kong) Lei Chen.

Cheng, Chen, Chen, Xie

Agenda

1. Introduction 2. Problem Definition 3. Basic Solution 4. Efficient Solution 5. Results

Page 3: Cheng, Chen, Chen, Xie Evaluating Probability Threshold k- Nearest-Neighbor Queries over Uncertain Data Reynold Cheng (University of Hong Kong) Lei Chen.

Cheng, Chen, Chen, Xie

Data Uncertainty

Inherent in various applicationsLocation-based services (e.g., using GPS,

RFID) [TDRP98, SSDBM99]Natural habitat monitoring with sensor

networks [VLDB04a]Biomedical and biometric databases[ICDE06,

ICDE07]

Page 4: Cheng, Chen, Chen, Xie Evaluating Probability Threshold k- Nearest-Neighbor Queries over Uncertain Data Reynold Cheng (University of Hong Kong) Lei Chen.

Cheng, Chen, Chen, Xie

Attribute Uncertainty Model [TDRP98,ISSD99,VLDB04b]

pdf

y

PreciseLocation

x

yL

CloakedLocation

x

y

probabilitydensity function

U

(pdf)

Uncertainty region

We represent an uncertainty pdf as a histogram

Page 5: Cheng, Chen, Chen, Xie Evaluating Probability Threshold k- Nearest-Neighbor Queries over Uncertain Data Reynold Cheng (University of Hong Kong) Lei Chen.

Cheng, Chen, Chen, Xie

k-NN Queries

k-NN Query over Precise Data- application in LBS [VLDB03]- natural habitat monitoring system [VLDB04a]- network traffic analysis [ICDCS07]- pattern matching in CAM [VLDB04c]

k-NN over Uncertain Objects- [VLDB08a] ranks the probability each object is the NN of the query point.- [ICDE07a] use expected distance and does not discuss the probability.

Page 6: Cheng, Chen, Chen, Xie Evaluating Probability Threshold k- Nearest-Neighbor Queries over Uncertain Data Reynold Cheng (University of Hong Kong) Lei Chen.

Cheng, Chen, Chen, Xie

Agenda

1. Introduction 2. Problem Definition 3. Basic Solution 4. Efficient Solution 5. Results

Page 7: Cheng, Chen, Chen, Xie Evaluating Probability Threshold k- Nearest-Neighbor Queries over Uncertain Data Reynold Cheng (University of Hong Kong) Lei Chen.

Cheng, Chen, Chen, Xie

Probability Threshold k-Nearest-Neighbor Query (T-k-PNN)

INPUT

1. A query point q, parameter k, threshold T

2. A set of n objects with uncertainty regions and pdfs

OUTPUT A number of k-subset

p(S) is the qualification probability of the k-subset S})(|||{ TSpkSDSS

}...,{ 21 nOOOD

Page 8: Cheng, Chen, Chen, Xie Evaluating Probability Threshold k- Nearest-Neighbor Queries over Uncertain Data Reynold Cheng (University of Hong Kong) Lei Chen.

Cheng, Chen, Chen, Xie

Example of a k-PNN query (k=3)

{O1, O2 , O3}

{O1, O2 , O4}

O2

O3

O1

O4

O5

O6

O7

O8

q

Page 9: Cheng, Chen, Chen, Xie Evaluating Probability Threshold k- Nearest-Neighbor Queries over Uncertain Data Reynold Cheng (University of Hong Kong) Lei Chen.

Cheng, Chen, Chen, Xie

Example of a k-PNN query (k=3)

O2

O3

O1

O4

O5

O6

q

{O1, O2, O3} {O1, O2, O4}

{O6, O7, O8}

5683 C

k-bound

2063 C

{O1, O2, O3} {O1, O2, O4}

{O4, O5, O6}

O7

O8

Page 10: Cheng, Chen, Chen, Xie Evaluating Probability Threshold k- Nearest-Neighbor Queries over Uncertain Data Reynold Cheng (University of Hong Kong) Lei Chen.

Cheng, Chen, Chen, Xie

k-bound Filtering (k=3)

O2

O3

O1

O4

O5

O6

q

k-bound

O7

O8

f1

f2

f3

fk (k-bound): is the k-th minimum maximum distance

Since min(r7)> f3, O7 can not be 3-NN of q. Because there are always 3 objects with distances smaller than f3.We apply k-bound filtering on an index (e.g. R-tree) to prune unqualified objects.

Page 11: Cheng, Chen, Chen, Xie Evaluating Probability Threshold k- Nearest-Neighbor Queries over Uncertain Data Reynold Cheng (University of Hong Kong) Lei Chen.

Cheng, Chen, Chen, Xie

Agenda

1. Introduction 2. Problem Definition 3. Basic Solution 4. Efficient Solution 5. Results

Page 12: Cheng, Chen, Chen, Xie Evaluating Probability Threshold k- Nearest-Neighbor Queries over Uncertain Data Reynold Cheng (University of Hong Kong) Lei Chen.

Cheng, Chen, Chen, Xie

Basic solution for a T-k-PNN query (k=3,T=0.1)

3-subset QP

{O1, O2, O3} 0.2

{O1, O2, O4} 0.1

{O1, O3, O4} 0.1

{O2, O3, O4} 0.1

0.05{O2, O3, O5}

0.05{O1, O3, O5}

……

0.05{O1, O2, O5}

0.05{O1, O2, O5}

0.1{O2, O3, O4}

0.1{O1, O3, O4}

0.1{O1, O2, O4}

0.2{O1, O2, O3}

QP3-subset

O2

O3

O1

O4

O5

O6

q

k-bound

T=0.1

Exact QP is expensive to compute!

Too many k-subsets!

Step1: k-bound filteringStep2: QP CalculationStep3: Accept S, if qp(S)≥T

Symbol Meaning

ri |oi − q|

di(r) pdf of ri (distance pdf)

Di(r) cdf of ri (distance cdf)

Page 13: Cheng, Chen, Chen, Xie Evaluating Probability Threshold k- Nearest-Neighbor Queries over Uncertain Data Reynold Cheng (University of Hong Kong) Lei Chen.

Cheng, Chen, Chen, Xie

Agenda

1. Introduction 2. Problem Definition 3. Basic Solution 4. Efficient Solution 5. Results

Page 14: Cheng, Chen, Chen, Xie Evaluating Probability Threshold k- Nearest-Neighbor Queries over Uncertain Data Reynold Cheng (University of Hong Kong) Lei Chen.

Cheng, Chen, Chen, Xie

Efficient Solution Framework (GVR)

Lower bound

Upper bound3.

4. Refinement

k-subset Generation

k-subset VerificationAndRefinement

k-subsets

rejectedk-subsets

acceptedk-subsets

Candidate Objects

1. k-bound Filtering

2. Probabilistic Candidate Selection

k-subsets GenerationVerificationRefinement

Page 15: Cheng, Chen, Chen, Xie Evaluating Probability Threshold k- Nearest-Neighbor Queries over Uncertain Data Reynold Cheng (University of Hong Kong) Lei Chen.

Cheng, Chen, Chen, Xie

Probabilistic Candidates Selection

so kii

frScp )Pr()(

)()( ScpSqp

O2

O3

O1

O4

O5

O6

q

k-bound

0.1

0.2

0.5

Cutoff Probability of Oi : Pr(ri ≤fk)

SSScpScp '),()(

S1={O4, O5,O6}cp(S1)=0.5*0.2*0.1 = 0.01S2={O4, O5}

cp(S2)=0.5*0.2 = 0.1Given T=0.2, if cp(S2) < T, then qp(S1)<cp(S1)<T.S1 can be pruned.

Page 16: Cheng, Chen, Chen, Xie Evaluating Probability Threshold k- Nearest-Neighbor Queries over Uncertain Data Reynold Cheng (University of Hong Kong) Lei Chen.

Cheng, Chen, Chen, Xie

Probabilistic Candidates Selection

0.5{O4}

0.2{O5}

0.1{O6}

1{O3}

1{O2}

1{O1}

CP1-subset

0.2{O2, O3, O5}

0.2{O1, O3, O5}

0.1{O1, O4, O5}

0.5{O2, O3, O4}

0.1{O2, O4, O5}

0.1{O3, O4, O5}

0.5{O1, O3, O4}

0.2{O1, O2, O5}

0.5{O1, O2, O4}

1{O1, O2, O3}

CP3-subset

1{O2,O3}

0.5{O2,O4}

0.2{O2,O5}

0.5{O3,O4}

0.2{O3,O5}

0.2{O1,O5}

0.1{O4,O5}

0.5{O1,O4}

1{O1,O3}

1{O1,O2}

CP2-subset

T=0.2, k=3

Page 17: Cheng, Chen, Chen, Xie Evaluating Probability Threshold k- Nearest-Neighbor Queries over Uncertain Data Reynold Cheng (University of Hong Kong) Lei Chen.

Cheng, Chen, Chen, Xie

Storage Efficient Compression

1{O2,O3}

0.5{O2,O4}

0.2{O2,O5}

0.5{O3,O4}

0.2{O3,O5}

0.2{O1,O5}

0.5{O1,O4}

1{O1,O3}

1{O1,O2}

CP2-subset

Subsets are sorted in descending order of their CPs.

{O3,O5}

{O2,O5}

{O1,O5}

Size-2 Set

Original subsets

Compressed subsets

Store the common prefix of the subsetsAnd the last element of the subset that has the minimum product of cutoff probability greater than T

Page 18: Cheng, Chen, Chen, Xie Evaluating Probability Threshold k- Nearest-Neighbor Queries over Uncertain Data Reynold Cheng (University of Hong Kong) Lei Chen.

Cheng, Chen, Chen, Xie

Storage Efficient Compression

0.5{O4}

0.2{O5}

0.1{O6}

1{O3}

1{O2}

1{O1}

CP1-subset

0.2{O2, O3, O5}

0.2{O1, O3, O5}

0.1{O1, O4, O5}

0.5{O2, O3, O4}

0.1{O2, O4, O5}

0.1{O3, O4, O5}

0.5{O1, O3, O4}

0.2{O1, O2, O5}

0.5{O1, O2, O4}

1{O1, O2, O3}

CP3-subset

1{O2,O3}

0.5{O2,O4}

0.2{O2,O5}

0.5{O3,O4}

0.2{O3,O5}

0.2{O1,O5}

0.1{O4,O5}

0.5{O1,O4}

1{O1,O3}

1{O1,O2}

CP2-subset

{O4}

{O5}

{O3}

{O2}

{O1}

Size-1 Set

{O3,O5}

{O2,O5}

{O1,O5}

Size-2 Set Size-3 Set

{O1,O2,O5}

{O1,O3,O5}

{O2,O3,O5}

T=0,2, k=3

Page 19: Cheng, Chen, Chen, Xie Evaluating Probability Threshold k- Nearest-Neighbor Queries over Uncertain Data Reynold Cheng (University of Hong Kong) Lei Chen.

Cheng, Chen, Chen, Xie

O3

Seeds Pruning

O1

O2

q

O4

k=3

f1

f2f3

min(r4) > f2 > f1

Seeds: o1, o2, o3

If o4 belongs to a 3-nn set S, o1 and o2 must also belong to S.

r4 > r2 r4 > r1

min(r4)

For example, we can prune the set {o1,o3,o4}, according to the above rule.

max(r1) =f1 max(r2) =f2 max(r3) =f3

No CP calculation is needed.

Can prune more candidate k-sets

Page 20: Cheng, Chen, Chen, Xie Evaluating Probability Threshold k- Nearest-Neighbor Queries over Uncertain Data Reynold Cheng (University of Hong Kong) Lei Chen.

Cheng, Chen, Chen, Xie

Verifiers: Upper and Lower Bounds (T=0.2)

Candidates k-subsets

(After PCS)

0

1

S1 1

0

0.190.19

0.6

0.10.5 ?

0.4

0.540.14

0.15

0.180.03

Verifier Incremental Refinement

Classifier

1

1

0

S2

1 S3

0

1

Page 21: Cheng, Chen, Chen, Xie Evaluating Probability Threshold k- Nearest-Neighbor Queries over Uncertain Data Reynold Cheng (University of Hong Kong) Lei Chen.

Cheng, Chen, Chen, Xie

Verification and Refinement

0.3 0.3 0.1

0.3 0.7

0.30.7

r1

r2

r3

0.3 0.5 0.8

0.3 1

0.9

10.7

P1 P2 P3 P4

D1(e4)

e2 e3 e4e1

P1 P2 P3

r1

r2

r3

q

e5

f2

0.1

P4

0.2

Partitions Stair-Case Model

Divide the range [min(r1), fk] into a series of partitions.

Extended from the probabilistic verifiers in [ICDE08b]

Build a data structure, i.e. stair-case model, to store the distance cdf of each object.

Derive the lower and upper bounds of a k-set’s QP based on the stair-case model.

Reject (Accept) a k-set once its QP must be lower (larger) than the threshold.

Page 22: Cheng, Chen, Chen, Xie Evaluating Probability Threshold k- Nearest-Neighbor Queries over Uncertain Data Reynold Cheng (University of Hong Kong) Lei Chen.

Cheng, Chen, Chen, Xie

Agenda

1. Introduction 2. Problem Definition 3. Basic Solution 4. Efficient Solution 5. Results

Page 23: Cheng, Chen, Chen, Xie Evaluating Probability Threshold k- Nearest-Neighbor Queries over Uncertain Data Reynold Cheng (University of Hong Kong) Lei Chen.

Cheng, Chen, Chen, Xie

Experiment Setup

Uncertain Object DB

Long Beach (53k)(http://www.census.gov/geo/www/tiger/)

Uncertainty pdf Uniform (default)Gaussian (represented by histograms)

Threshold (T) 0.1

k 6

Page 24: Cheng, Chen, Chen, Xie Evaluating Probability Threshold k- Nearest-Neighbor Queries over Uncertain Data Reynold Cheng (University of Hong Kong) Lei Chen.

Cheng, Chen, Chen, Xie

1. k-bound Filtering

Page 25: Cheng, Chen, Chen, Xie Evaluating Probability Threshold k- Nearest-Neighbor Queries over Uncertain Data Reynold Cheng (University of Hong Kong) Lei Chen.

Cheng, Chen, Chen, Xie

2. Performance of GVR

Page 26: Cheng, Chen, Chen, Xie Evaluating Probability Threshold k- Nearest-Neighbor Queries over Uncertain Data Reynold Cheng (University of Hong Kong) Lei Chen.

Cheng, Chen, Chen, Xie

3. k-subset Generation

Page 27: Cheng, Chen, Chen, Xie Evaluating Probability Threshold k- Nearest-Neighbor Queries over Uncertain Data Reynold Cheng (University of Hong Kong) Lei Chen.

Cheng, Chen, Chen, Xie

3. k-subset Generation

Page 28: Cheng, Chen, Chen, Xie Evaluating Probability Threshold k- Nearest-Neighbor Queries over Uncertain Data Reynold Cheng (University of Hong Kong) Lei Chen.

Cheng, Chen, Chen, Xie

4. Verification and Refinement

Page 29: Cheng, Chen, Chen, Xie Evaluating Probability Threshold k- Nearest-Neighbor Queries over Uncertain Data Reynold Cheng (University of Hong Kong) Lei Chen.

Cheng, Chen, Chen, Xie

5. Time Analysis

Page 30: Cheng, Chen, Chen, Xie Evaluating Probability Threshold k- Nearest-Neighbor Queries over Uncertain Data Reynold Cheng (University of Hong Kong) Lei Chen.

Cheng, Chen, Chen, Xie

6. Gaussian Distribution

Page 31: Cheng, Chen, Chen, Xie Evaluating Probability Threshold k- Nearest-Neighbor Queries over Uncertain Data Reynold Cheng (University of Hong Kong) Lei Chen.

Cheng, Chen, Chen, Xie

Conclusion

We proposed an efficient evaluation framework for T-k-PNN query

We proposed various techniques:- k-bound to filter away those unqualified objects- PCS to reduce the number of k-subsets- verification/refinement methods to avoid exact calculation

Future Work- extend the techniques to other queries

Page 32: Cheng, Chen, Chen, Xie Evaluating Probability Threshold k- Nearest-Neighbor Queries over Uncertain Data Reynold Cheng (University of Hong Kong) Lei Chen.

Cheng, Chen, Chen, Xie

Reference [TDRP98] P. A. Sistla, O. Wolfson, S. Chamberlain, and S. Dao,“Querying the uncertain position of moving objects,” in Temporal

Databases: Research and Practice, 1998. [SSDBM99] D.Pfoser and C. Jensen, “Capturing the uncertainty of moving-objects representations,” in Proc. SSDBM, 1999. [VLDB04a] A. Deshpande, C. Guestrin, S. Madden, J. Hellerstein, and W. Hong, “Model-driven data acquisition in sensor networks,” in

Proc. VLDB, 2004. [ICDE06] C. Böhm, A. Pryakhin, and M. Schubert, “The gauss-tree: Efficient object identification in databases of probabilistic feature

vectors,” in Proc. ICDE, 2006. [ICDE07a] V. Ljosa and A. K. Singh, “APLA: Indexing arbitrary probability distributions,” in Proc. ICDE, 2007. [SIGMOD03] R. Cheng, D. Kalashnikov, and S. Prabhakar, “Evaluating probabilistic queries over imprecise data,” in Proc. ACM SIGMOD,

2003. [ICDE07b] J. Chen and R. Cheng, “Efficient evaluation of imprecise location-dependent queries,” in Proc. ICDE, 2007. [VLDB06a] M. Mokbel, C. Chow, and W. G. Aref, “The new casper: Query processing for location services without compromising privacy,”

in VLDB, 2006. [TKDE92] D. Barbara, H. Garcia-Molina, and D. Porter, “The management of probabilistic data,” TKDE, vol. 4, no. 5, 1992. [VLDB04b] N. Dalvi and D. Suciu, “Efficient query evaluation on probabilistic databases,” in VLDB, 2004. [VLDB06b] P. Agrawal, O. Benjelloun, A. D. Sarma, C. Hayworth, S. Nabar, T. Sugihara, and J. Widom, “Trio: A system for data,

uncertainty, and lineage,” in VLDB, 2006. [VLDB03] G. Iwerks, H. Samet, and K. Smith, “Continuous k-nearest neighbor queries for continuously moving points with updates,” in

Proc. VLDB, 2003. [ICDCS07] S. Ganguly, M. Garofalakis, R. Rastogi, and K. Sabnani, “Streaming algorithms for robust, real-time detection of ddos attacks,”

in ICDCS, 2007. [AKDDM96] U. Fayyad, G. Piatesky-Shapiro, P. Smyth, and R. Uthurusamy, Advances in Knowledge Discovery and Data Mining. AAAI

Press/MIT Press, 1996. [VLDB04c] N. Koudas, B. Ooi, K. Tan, and R. Zhang, “Approximate NN queries on streams with guaranteed error/performance bounds,” in

Proc. VLDB, 2004. [VLDB08a] G. Beskales, M. Soliman, and I. Ilyas, “Efficient search for the top-k probable nearest neighbors in uncertain databases,” in

VLDB, 2008. [VLDB06c] O. Mar, A. Sarma, A. Halevy, and J. Widom, “ULDBs: databases with uncertainty and lineage,” in VLDB, 2006.

Page 33: Cheng, Chen, Chen, Xie Evaluating Probability Threshold k- Nearest-Neighbor Queries over Uncertain Data Reynold Cheng (University of Hong Kong) Lei Chen.

Cheng, Chen, Chen, Xie

Reference [VLDB07a] L. Antova, C. Koch, and D. Olteanu, “Query language support for incomplete information in the maybms system,” in Prof.

VLDB, 2007. [SIGMOD08a] S. Singh et al, “Orion 2.0: Native support for uncertain data,” in Prof. ACM SIGMOD, 2008. [ICDE08a] Singh et al, “Database support for pdf attributes,” in Proc. ICDE, 2008. [TKDE04] R. Cheng, D. V. Kalashnikov, and S. Prabhakar, “Querying imprecise data in moving object environments,” IEEE TKDE, vol. 16,

no. 9, Sept. 2004. [DASFAA07] H. Kriegel, P. Kunath, and M. Renz, “Probabilistic nearest-neighbor query on uncertain objects,” in DASFAA, 2007. [MUD08] Y. Qi, S. Singh, R. Shah, and S. Prabhakar, “Indexing probabilistic nearest-neighbor threshold queries,” in Proc. Workshop on

Management of Uncertain Data, 2008. [TKDE08] X. Lian and L. Chen, “Probabilistic group nearest neighbor queries in uncertain databases,” IEEE Trans. On Knowledge and

Data Engineering, vol. 20, no. 6, 2008. [ICDE08b] R. Cheng, J. Chen, M. Mokbel, and C. Chow, “Probabilistic verifiers: Evaluating constrained nearest-neighbor queries over

uncertain data,” in Proc. ICDE, 2008. [VLDB05] Y. Tao, R. Cheng, X. Xiao, W. K. Ngai, B. Kao, and S. Prabhakar, “Indexing multi-dimensional uncertain data with arbitrary

probability density functions,” in Proc. VLDB, 2005. [VLDB07b] J. Pei, B. Jiang, X. Lin, and Y. Yuan, “Probabilistic skylines on uncertain data,” in Proc. VLDB, 2007. [SIGMOD08b] X. Lian and L. Chen, “Monochromatic and bichromatic reverse skyline search over uncertain databases,” in Proc.

SIGMOD, 2008. [ICDE07c] M. Soliman, I. Ilyas, and K. Chang, “Top-k query processing in uncertain databases,” in Proc. ICDE, 2007. [SIGMOD08c] M. Hua, J. Pei, W. Zhang, and X. Lin, “Ranking queries on uncertain data: A probabilistic threshold approach,” in Proc.

SIGMOD, 2008. [VLDB08b] V. Rastogi, D. Suciu, and E. Welbourne, “Access control over uncertain data,” in Proc. VLDB, 2008. [VLDB08c] C. Koch and D. Olteanu, “Conditioning probabilistic databases,” in Proc. VLDB, 2008. [VLDB08d] R. Cheng, J. Chen, and X. Xie, “Cleaning uncertain data with quality guarantees,” in Proc. VLDB, 2008. [SIGMOD84] A. Guttman, “R-trees: A dynamic index structure for spatial searching,” Proc. of the ACM SIGMOD Int’l. Conf., 1984.

Page 34: Cheng, Chen, Chen, Xie Evaluating Probability Threshold k- Nearest-Neighbor Queries over Uncertain Data Reynold Cheng (University of Hong Kong) Lei Chen.

Cheng, Chen, Chen, Xie

Q & A

Thanks!