Fast Statistical Algorithms in Databases Alexander Gray Georgia Institute of Technology College of Computing FASTlab: Fundamental Algorithmic and Statistical.

Fast Statistical Algorithms in Databases

Alexander GrayGeorgia Institute of Technology

College of Computing

FASTlab: Fundamental Algorithmic and Statistical Tools Laboratory

The FASTlabFundamental Algorithmic and Statistical Tools Laboratory

1. Arkadas Ozakin: Research scientist, PhD Theoretical Physics2. Nikolaos Vasiloglou: Research scientist, PhD Elec Comp Engineering3. Abhimanyu Aditya: MS, CS4. Leif Poorman: MS, Physics5. Dong Ryeol Lee: PhD student, CS + Math6. Ryan Riegel: PhD student, CS + Math7. Parikshit Ram: PhD student, CS + Math8. William March: PhD student, Math + CS9. James Waters: PhD student, Physics + CS10. Hua Ouyang: PhD student, CS11. Sooraj Bhat: PhD student, CS12. Ravi Sastry: PhD student, CS13. Long Tran: PhD student, CS14. Wei Guan: PhD student, CS (co-supervised)15. Nishant Mehta: PhD student, CS (co-supervised)16. Wee Chin Wong: PhD student, ChemE (co-supervised)17. Jaegul Choo: PhD student, CS (co-supervised)18. Ailar Javadi: PhD Student, ECE (co-supervised)19. Shakir Sharfraz: MS student, CS20. Subbanarasimhiah Harish: MS student, CS

Our challenge

• Allow statistical analyses (both simple and state-of-the-art) on terascale or petascale datasets

• Return answers with error guarantees• Rigorous theoretical runtimes • Ensure small constants (small actual

runtimes, not just good asymptotic runtimes)

10 data analysis problems, and scalable tools we’d like for them1. Querying (e.g. characterizing a region of space):

1. spherical range-search O(N)2. orthogonal range-search O(N)3. k-nearest-neighbors O(N)4. all-k-nearest-neighbors O(N2)

• Density estimation (e.g. comparing galaxy types): – mixture of Gaussians– kernel density estimation O(N2)– manifold kernel density estimation O(N3) [Ozakin and Gray

2009, in submission]– hyper-kernel density estimation O(N4) [Sastry and Gray 2009,

in submission]– L2 density tree [Ram and Gray, in prep]

10 data analysis problems, and scalable tools we’d like for them3. Regression (e.g. photometric redshifts):

– linear regression O(ND2)– kernel regression O(N2)– Gaussian process regression/kriging O(N3)

4. Classification (e.g. quasar detection, star-galaxy separation):

– k-nearest-neighbor classifier O(N2)– nonparametric Bayes classifier O(N2)– support vector machine (SVM) O(N3)– isometric separation map O(N3) [Vasiloglou, Gray, and

Anderson 2008]– non-negative SVM O(N3) [Guan and Gray, in prep]– false-positive-limiting SVM O(N3) [Sastry and Gray, in prep]

10 data analysis problems, and scalable tools we’d like for them5. Dimension reduction (e.g. galaxy or spectra

characterization): – principal component analysis O(D3)– non-negative matrix factorization– kernel PCA (including LLE, IsoMap, et al.) O(N3)– maximum variance unfolding O(N3)– rank-based manifolds O(N3) [Ouyang and Gray 2008]– isometric non-negative matrix factorization O(N3) [Vasiloglou,

Gray, and Anderson 2008]– density-preserving maps O(N3) [Ozakin and Gray 2009, in

submission]

• Outlier detection (e.g. new object types, data cleaning):

5. by density estimation, by dimension reduction

6. by robust Lp estimation [Ram, Riegel and Gray, in prep]

10 data analysis problems, and scalable tools we’d like for them7. Clustering (e.g. automatic Hubble sequence)

– by dimension reduction, by density estimation– k-means– mean-shift segmentation O(N2)– hierarchical clustering (“friends-of-friends”) O(N3)

8. Time series analysis (e.g. asteroid tracking, variable objects):

– Kalman filter O(D2)– hidden Markov model O(D2)– trajectory tracking O(Nn)– functional independent component analysis [Mehta and Gray

2008]– Markov matrix factorization [Tran, Wong, and Gray 2009, in

submission]

10 data analysis problems, and scalable tools we’d like for them9. Feature selection and causality (e.g. which features

predict star/galaxy)– LASSO regression– L1 SVM– Gaussian graphical model inference and structure search– discrete graphical model inference and structure search– mixed-integer SVM [Guan and Gray 2009, in submission]– L1 Gaussian graphical model inference and structure search

[Tran, Lee, Holmes, and Gray, in prep]

10. 2-sample testing and matching (e.g. cosmological validation, multiple surveys): – minimum spanning tree O(N3)– n-point correlation O(Nn)– bipartite matching/Gaussian graphical model inference O(N3)

[Waters and Gray, in prep]

Core methods ofstatistics / machine learning / mining

• Querying: nearest-neighbor O(N), spherical range-search O(N), orthogonal range-search O(N), contingency table

• Density estimation: kernel density estimation O(N2), mixture of Gaussians O(N)

• Regression: linear regression O(ND2), kernel regression O(N2), Gaussian process regression O(N3)

• Classification: nearest-neighbor classifier O(N2), nonparametric Bayes classifier O(N2), support vector machine

• Dimension reduction: principal component analysis O(D3), non-negative matrix factorization, kernel PCA O(N3), maximum variance unfolding O(N3)

• Outlier detection: by robust L2 estimation, by density estimation, by dimension reduction

• Clustering: k-means O(N), hierarchical clustering O(N3), by dimension reduction

• Time series analysis: Kalman filter O(D3), hidden Markov model, trajectory tracking

• Feature selection and causality: LASSO, L1 SVM, Gaussian graphical models, discrete graphical models

• 2-sample testing and matching: n-point correlation O(Nn), bipartite matching O(N3)

4 main computational bottlenecks:generalized N-body, graphical models, linear algebra, optimization

• Querying: nearest-neighbor O(N), spherical range-search O(N), orthogonal range-search O(N), contingency table

• Density estimation: kernel density estimation O(N2), mixture of Gaussians O(N)

• Regression: linear regression O(ND2), kernel regression O(N2), Gaussian process regression O(N3)

• Classification: nearest-neighbor classifier O(N2), nonparametric Bayes classifier O(N2), support vector machine

• Dimension reduction: principal component analysis O(D3), non-negative matrix factorization, kernel PCA O(N3), maximum variance unfolding O(N3)


• Clustering: k-means O(N), hierarchical clustering O(N3), by dimension reduction

• Time series analysis: Kalman filter O(D3), hidden Markov model, trajectory tracking


• 2-sample testing: n-point correlation O(Nn), bipartite matching O(N3)

kd-trees:most widely-used space-

partitioning tree for general dimension

[Bentley 1975], [Friedman, Bentley & Finkel 1977],[Moore & Lee 1995]

(others: ball-trees, cover-trees, etc)

GNPs: Data structure needed

What algorithmic methods do we use?

• Generalized N-body algorithms (multiple trees) for distance/similarity-based computations [2000, 2003, 2009]

• Hierarchical series expansions for kernel summations [2004, 2006, 2008]

• Multi-scale Monte Carlo for linear algebra and summations [2007, 2008]

• Parallel computing [1998, 2006, 2009]

Computational complexityusing fast algorithms

• Querying: nearest-neighbor O(logN), spherical range-search O(logN), orthogonal range-search O(logN), contingency table

• Density estimation: kernel density estimation O(N) or O(1), mixture of Gaussians O(logN)

• Regression: linear regression O(D) or O(1), kernel regression O(N) or O(1), Gaussian process regression O(N) or O(1)

• Classification: nearest-neighbor classifier O(N), nonparametric Bayes classifier O(N), support vector machine O(N)

• Dimension reduction: principal component analysis O(D) or O(1), non-negative matrix factorization, kernel PCA O(N) or O(1), maximum variance unfolding O(N)


• Clustering: k-means O(logN), hierarchical clustering O(NlogN), by dimension reduction

• Time series analysis: Kalman filter O(D) or O(1), hidden Markov model, trajectory tracking


• 2-sample testing and matching: n-point correlation O(Nlogn), bipartite matching O(N) or O(1)

Computational complexityusing fast algorithms

• Querying: nearest-neighbor O(logN), spherical range-search O(logN), orthogonal range-search O(logN), contingency table

• Density estimation: kernel density estimation O(N) or O(1), mixture of Gaussians O(logN)

• Regression: linear regression O(D) or O(1), kernel regression O(N) or O(1), Gaussian process regression O(N) or O(1)

• Classification: nearest-neighbor classifier O(N), nonparametric Bayes classifier O(N), support vector machine O(N)

• Dimension reduction: principal component analysis O(D) or O(1), non-negative matrix factorization, kernel PCA O(N) or O(1), maximum variance unfolding O(N)


• Clustering: k-means O(logN), hierarchical clustering O(NlogN), by dimension reduction

• Time series analysis: Kalman filter O(D) or O(1), hidden Markov model, trajectory tracking


• 2-sample testing and matching: n-point correlation O(Nlogn), bipartite matching O(N) or O(1)

All-k-nearest neighbors

• Naïve cost: O(N2)

• Fastest algorithm: Generalized Fast Multipole Method, aka multi-tree methods– Technique: Use two trees simultaneously– [Gray and Moore 2001, NIPS; Ram, Lee, March,

and Gray 2009, submitted; Riegel, Boyer and Gray, to be submitted]

– O(N), exact

• Applications: querying, adaptive density estimation first step [Budavari et al., in prep]

Kernel density estimation


• Fastest algorithm: dual-tree fast Gauss transform, Monte Carlo multipole method– Technique: Hermite operators, Monte Carlo– [Lee, Gray and Moore 2005, NIPS; Lee and Gray

2006, UAI; Holmes, Gray and Isbell 2007, NIPS; Lee and Gray 2008, NIPS]

– O(N), relative error guarantee

• Applications: accurate density estimates, e.g. galaxy types [Balogh et al. 2004]

Nonparametric Bayes classification


• Fastest algorithm: dual-tree bounding with hybrid tree expansion – Technique: bound each posterior– [Liu, Moore, and Gray 2004; Gray and Riegel 2004,

CompStat; Riegel and Gray 2007, SDM]

– O(N), exact

• Applications: accurate classification, e.g. quasar detection [Richards et al. 2004,2009], [Scranton et al. 2005]

Principal component analysis

• Naïve cost: O(D3)

• Fastest algorithm: QUIC-SVD– Technique: tree-based Monte Carlo using

cosine tree– [Holmes, Gray, and Isbell 2009]

– O(1), relative error guarantee

• Applications: spectral decomposition

Friends-of-friends


• Fastest algorithm: dual-tree Boruvka algorithm – Technique: maintain implicit friend sets– [March and Gray, to be submitted]

– O(NlogN), exact

• Applications: cluster-finding, 2-sample

n-point correlations

• Naïve cost: O(Nn)

• Fastest algorithm: n-tree algorithms – Technique: use n trees simultaneously– [Gray and Moore 2001, NIPS; Moore et al. 2001,

Mining the Sky; Waters, Riegel and Gray 2009, to be submitted]

– O(Nlogn), exact

• Applications: 2-sample, e.g. AGN [Wake et al. 2004], ISW [Giannantonio et al. 2006], 3pt validation [Kulkarni et al. 2007]

3-point runtime

(biggest previous: 20K)

VIRGO simulation data,N = 75,000,000

naïve: 5x109 sec. (~150 years)multi-tree: 55 sec. (exact)

n=2: O(N)

n=3: O(Nlog3)

n=4: O(N2)

Software

• MLPACK (C++)– First scalable comprehensive ML library

• MLPACK Pro

- Very-large-scale data (parallel)

Software

• MLPACK (C++)– First scalable comprehensive ML library

• MLPACK Pro

- Very-large-scale data (parallel)

• MLPACK-db – fast data analytics in relational

databases (SQL Server; on-disk)

– Thanks to Tamas Budavari

Fast algorithms in SQL Server: Challenges

• Don’t copy the data: In-RAM kd-trees disk-based data structures

• Commercial database: no access to guts (e.g. caching, memory)

• Already optimized for different operations (SQL queries) and data structures (B+ trees)

• Big learning curve

Building a kd-tree in SQL Server

• Piggyback onto B+ trees and disk layout layers: Use SQL queries

• Hybrid approach: upper tree using Morton z-curve (SQL), lower using in-RAM building; 3 passes total

• Clustered indices for fast sequential access of nodes, data

• Abstracted algs in managed C#

Current state

• So far: All-NN, KDE*, NBC*, 2pt*• Performance for all-nearest-neighbors on

40M, 4-D SDSS color data:– Naïve SQL Server: 50 yrs– Batched SQL Server: 3 yrs– Single-tree MLPACK-db: 14.5 hrs– Dual-tree MLPACK-db: 227 min– TPIE: 23 min, MLPACK-Pro: 10 min

• We believe there is 10x factor of fat – e.g. this uses only 100Mb of 2Gb RAM

The end

• You can now use many of the state-of-the-art data analysis methods on large survey datasets

• We need more funding

• Talk to me… I have a lot of problems (ahem) but always want more!

Alexander Gray [email protected]

mailto:[email protected]

Ex: support vector machine

Data: IJCNN1 [DP01a]2 classes49,990 training points91,701 testing points22 features

SMO: 12,831 SV’s, 84,360 iterations, 98.3% accuracy, 765 secSFW: 4,145 SV’s, 4,500 iterations, 98.1% accuracy, 21 sec

[Ouyang and Gray 2009, in submission]

SMO: O(N2-N3)

SFW: O(N/e + 1/e2)

A few new ones…

• Non-vector data (like time series, images, graphs, trees):– Functional ICA (Mehta and Gray 2008)– Generalized mean map kernel (Mehta and Gray, in

submission)

• Visualizing high-dimensional data– Rank-based manifolds (Ouyang and Gray 2008)– Density-preserving maps (Ozakin and Gray, in submission)– Intrinsic dimension estimation (Ozakin and Gray, in prep)

• Accounting for measurement errors• Probabilistic cross-match

2-point correlation

r

N

i

N

ijji rxxI )(

Characterization of an entire distribution?

“How many pairs have distance < r ?”

2-point correlationfunction

The n-point correlation functions• Spatial inferences: filaments, clusters, voids,

homogeneity, isotropy, 2-sample testing, …

• Foundation for theory of point processes [Daley,Vere-Jones 1972], unifies spatial statistics [Ripley 1976]

• Used heavily in biostatistics, cosmology, particle physics, statistical physics

)](1[212 rdVdVdP

2pcf definition:

)],,()()()(1[ 1323121323123213 rrrrrrdVdVdVdP

3pcf definition:

3-point correlation

)()()( 321 rIrIrI ki

N

i

N

ij

N

ijkjkij

“How many triples have pairwise distances < r ?”

r3

r1

r2

Standard model: n>0 terms should be zero!

How can we count n-tuples efficiently?

“How many triples have pairwise distances < r ?”

Use n trees![Gray & Moore, NIPS 2000]

“How many valid triangles a-b-c(where )

could there be? CcBbAa ,,

A

B

C

r

count{A,B,C} =

?



count{A,B,C} =

count{A,B,C.left}+

count{A,B,C.right}A

B

C

r



A

B

C

r

count{A,B,C} =

count{A,B,C.left}+

count{A,B,C.right}



A

B

C

r

count{A,B,C} =

?



A

B

C

r

Exclusion

count{A,B,C} =

0!



A B

C

count{A,B,C} =

?

r



A B

C

Inclusion

count{A,B,C} =

|A| x |B| x |C|

r

Inclusion

Inclusion

Fast Statistical Algorithms in Databases Alexander Gray Georgia Institute of Technology College of Computing FASTlab: Fundamental Algorithmic and Statistical.

Documents

phd student

ms student

scalable tools

prep10 data analysis

svm on3 sastry

maps on3 ozakin

cs mathwilliam

nonnegative svm on3