Fast Statistical Algorithms in Databases Alexander Gray Georgia Institute of Technology College of Computing FASTlab: Fundamental Algorithmic and Statistical Tools Laboratory
Jan 02, 2016
Fast Statistical Algorithms in Databases
Alexander GrayGeorgia Institute of Technology
College of Computing
FASTlab: Fundamental Algorithmic and Statistical Tools Laboratory
The FASTlabFundamental Algorithmic and Statistical Tools Laboratory
1. Arkadas Ozakin: Research scientist, PhD Theoretical Physics2. Nikolaos Vasiloglou: Research scientist, PhD Elec Comp Engineering3. Abhimanyu Aditya: MS, CS4. Leif Poorman: MS, Physics5. Dong Ryeol Lee: PhD student, CS + Math6. Ryan Riegel: PhD student, CS + Math7. Parikshit Ram: PhD student, CS + Math8. William March: PhD student, Math + CS9. James Waters: PhD student, Physics + CS10. Hua Ouyang: PhD student, CS11. Sooraj Bhat: PhD student, CS12. Ravi Sastry: PhD student, CS13. Long Tran: PhD student, CS14. Wei Guan: PhD student, CS (co-supervised)15. Nishant Mehta: PhD student, CS (co-supervised)16. Wee Chin Wong: PhD student, ChemE (co-supervised)17. Jaegul Choo: PhD student, CS (co-supervised)18. Ailar Javadi: PhD Student, ECE (co-supervised)19. Shakir Sharfraz: MS student, CS20. Subbanarasimhiah Harish: MS student, CS
Our challenge
• Allow statistical analyses (both simple and state-of-the-art) on terascale or petascale datasets
• Return answers with error guarantees• Rigorous theoretical runtimes • Ensure small constants (small actual
runtimes, not just good asymptotic runtimes)
10 data analysis problems, and scalable tools we’d like for them1. Querying (e.g. characterizing a region of space):
1. spherical range-search O(N)2. orthogonal range-search O(N)3. k-nearest-neighbors O(N)4. all-k-nearest-neighbors O(N2)
• Density estimation (e.g. comparing galaxy types): – mixture of Gaussians– kernel density estimation O(N2)– manifold kernel density estimation O(N3) [Ozakin and Gray
2009, in submission]– hyper-kernel density estimation O(N4) [Sastry and Gray 2009,
in submission]– L2 density tree [Ram and Gray, in prep]
10 data analysis problems, and scalable tools we’d like for them3. Regression (e.g. photometric redshifts):
– linear regression O(ND2)– kernel regression O(N2)– Gaussian process regression/kriging O(N3)
4. Classification (e.g. quasar detection, star-galaxy separation):
– k-nearest-neighbor classifier O(N2)– nonparametric Bayes classifier O(N2)– support vector machine (SVM) O(N3)– isometric separation map O(N3) [Vasiloglou, Gray, and
Anderson 2008]– non-negative SVM O(N3) [Guan and Gray, in prep]– false-positive-limiting SVM O(N3) [Sastry and Gray, in prep]
10 data analysis problems, and scalable tools we’d like for them5. Dimension reduction (e.g. galaxy or spectra
characterization): – principal component analysis O(D3)– non-negative matrix factorization– kernel PCA (including LLE, IsoMap, et al.) O(N3)– maximum variance unfolding O(N3)– rank-based manifolds O(N3) [Ouyang and Gray 2008]– isometric non-negative matrix factorization O(N3) [Vasiloglou,
Gray, and Anderson 2008]– density-preserving maps O(N3) [Ozakin and Gray 2009, in
submission]
• Outlier detection (e.g. new object types, data cleaning):
5. by density estimation, by dimension reduction
6. by robust Lp estimation [Ram, Riegel and Gray, in prep]
10 data analysis problems, and scalable tools we’d like for them7. Clustering (e.g. automatic Hubble sequence)
– by dimension reduction, by density estimation– k-means– mean-shift segmentation O(N2)– hierarchical clustering (“friends-of-friends”) O(N3)
8. Time series analysis (e.g. asteroid tracking, variable objects):
– Kalman filter O(D2)– hidden Markov model O(D2)– trajectory tracking O(Nn)– functional independent component analysis [Mehta and Gray
2008]– Markov matrix factorization [Tran, Wong, and Gray 2009, in
submission]
10 data analysis problems, and scalable tools we’d like for them9. Feature selection and causality (e.g. which features
predict star/galaxy)– LASSO regression– L1 SVM– Gaussian graphical model inference and structure search– discrete graphical model inference and structure search– mixed-integer SVM [Guan and Gray 2009, in submission]– L1 Gaussian graphical model inference and structure search
[Tran, Lee, Holmes, and Gray, in prep]
10. 2-sample testing and matching (e.g. cosmological validation, multiple surveys): – minimum spanning tree O(N3)– n-point correlation O(Nn)– bipartite matching/Gaussian graphical model inference O(N3)
[Waters and Gray, in prep]
Core methods ofstatistics / machine learning / mining
• Querying: nearest-neighbor O(N), spherical range-search O(N), orthogonal range-search O(N), contingency table
• Density estimation: kernel density estimation O(N2), mixture of Gaussians O(N)
• Regression: linear regression O(ND2), kernel regression O(N2), Gaussian process regression O(N3)
• Classification: nearest-neighbor classifier O(N2), nonparametric Bayes classifier O(N2), support vector machine
• Dimension reduction: principal component analysis O(D3), non-negative matrix factorization, kernel PCA O(N3), maximum variance unfolding O(N3)
• Outlier detection: by robust L2 estimation, by density estimation, by dimension reduction
• Clustering: k-means O(N), hierarchical clustering O(N3), by dimension reduction
• Time series analysis: Kalman filter O(D3), hidden Markov model, trajectory tracking
• Feature selection and causality: LASSO, L1 SVM, Gaussian graphical models, discrete graphical models
• 2-sample testing and matching: n-point correlation O(Nn), bipartite matching O(N3)
4 main computational bottlenecks:generalized N-body, graphical models, linear algebra, optimization
• Querying: nearest-neighbor O(N), spherical range-search O(N), orthogonal range-search O(N), contingency table
• Density estimation: kernel density estimation O(N2), mixture of Gaussians O(N)
• Regression: linear regression O(ND2), kernel regression O(N2), Gaussian process regression O(N3)
• Classification: nearest-neighbor classifier O(N2), nonparametric Bayes classifier O(N2), support vector machine
• Dimension reduction: principal component analysis O(D3), non-negative matrix factorization, kernel PCA O(N3), maximum variance unfolding O(N3)
• Outlier detection: by robust L2 estimation, by density estimation, by dimension reduction
• Clustering: k-means O(N), hierarchical clustering O(N3), by dimension reduction
• Time series analysis: Kalman filter O(D3), hidden Markov model, trajectory tracking
• Feature selection and causality: LASSO, L1 SVM, Gaussian graphical models, discrete graphical models
• 2-sample testing: n-point correlation O(Nn), bipartite matching O(N3)
kd-trees:most widely-used space-
partitioning tree for general dimension
[Bentley 1975], [Friedman, Bentley & Finkel 1977],[Moore & Lee 1995]
(others: ball-trees, cover-trees, etc)
GNPs: Data structure needed
What algorithmic methods do we use?
• Generalized N-body algorithms (multiple trees) for distance/similarity-based computations [2000, 2003, 2009]
• Hierarchical series expansions for kernel summations [2004, 2006, 2008]
• Multi-scale Monte Carlo for linear algebra and summations [2007, 2008]
• Parallel computing [1998, 2006, 2009]
Computational complexityusing fast algorithms
• Querying: nearest-neighbor O(logN), spherical range-search O(logN), orthogonal range-search O(logN), contingency table
• Density estimation: kernel density estimation O(N) or O(1), mixture of Gaussians O(logN)
• Regression: linear regression O(D) or O(1), kernel regression O(N) or O(1), Gaussian process regression O(N) or O(1)
• Classification: nearest-neighbor classifier O(N), nonparametric Bayes classifier O(N), support vector machine O(N)
• Dimension reduction: principal component analysis O(D) or O(1), non-negative matrix factorization, kernel PCA O(N) or O(1), maximum variance unfolding O(N)
• Outlier detection: by robust L2 estimation, by density estimation, by dimension reduction
• Clustering: k-means O(logN), hierarchical clustering O(NlogN), by dimension reduction
• Time series analysis: Kalman filter O(D) or O(1), hidden Markov model, trajectory tracking
• Feature selection and causality: LASSO, L1 SVM, Gaussian graphical models, discrete graphical models
• 2-sample testing and matching: n-point correlation O(Nlogn), bipartite matching O(N) or O(1)
Computational complexityusing fast algorithms
• Querying: nearest-neighbor O(logN), spherical range-search O(logN), orthogonal range-search O(logN), contingency table
• Density estimation: kernel density estimation O(N) or O(1), mixture of Gaussians O(logN)
• Regression: linear regression O(D) or O(1), kernel regression O(N) or O(1), Gaussian process regression O(N) or O(1)
• Classification: nearest-neighbor classifier O(N), nonparametric Bayes classifier O(N), support vector machine O(N)
• Dimension reduction: principal component analysis O(D) or O(1), non-negative matrix factorization, kernel PCA O(N) or O(1), maximum variance unfolding O(N)
• Outlier detection: by robust L2 estimation, by density estimation, by dimension reduction
• Clustering: k-means O(logN), hierarchical clustering O(NlogN), by dimension reduction
• Time series analysis: Kalman filter O(D) or O(1), hidden Markov model, trajectory tracking
• Feature selection and causality: LASSO, L1 SVM, Gaussian graphical models, discrete graphical models
• 2-sample testing and matching: n-point correlation O(Nlogn), bipartite matching O(N) or O(1)
All-k-nearest neighbors
• Naïve cost: O(N2)
• Fastest algorithm: Generalized Fast Multipole Method, aka multi-tree methods– Technique: Use two trees simultaneously– [Gray and Moore 2001, NIPS; Ram, Lee, March,
and Gray 2009, submitted; Riegel, Boyer and Gray, to be submitted]
– O(N), exact
• Applications: querying, adaptive density estimation first step [Budavari et al., in prep]
Kernel density estimation
• Naïve cost: O(N2)
• Fastest algorithm: dual-tree fast Gauss transform, Monte Carlo multipole method– Technique: Hermite operators, Monte Carlo– [Lee, Gray and Moore 2005, NIPS; Lee and Gray
2006, UAI; Holmes, Gray and Isbell 2007, NIPS; Lee and Gray 2008, NIPS]
– O(N), relative error guarantee
• Applications: accurate density estimates, e.g. galaxy types [Balogh et al. 2004]
Nonparametric Bayes classification
• Naïve cost: O(N2)
• Fastest algorithm: dual-tree bounding with hybrid tree expansion – Technique: bound each posterior– [Liu, Moore, and Gray 2004; Gray and Riegel 2004,
CompStat; Riegel and Gray 2007, SDM]
– O(N), exact
• Applications: accurate classification, e.g. quasar detection [Richards et al. 2004,2009], [Scranton et al. 2005]
Principal component analysis
• Naïve cost: O(D3)
• Fastest algorithm: QUIC-SVD– Technique: tree-based Monte Carlo using
cosine tree– [Holmes, Gray, and Isbell 2009]
– O(1), relative error guarantee
• Applications: spectral decomposition
Friends-of-friends
• Naïve cost: O(N3)
• Fastest algorithm: dual-tree Boruvka algorithm – Technique: maintain implicit friend sets– [March and Gray, to be submitted]
– O(NlogN), exact
• Applications: cluster-finding, 2-sample
n-point correlations
• Naïve cost: O(Nn)
• Fastest algorithm: n-tree algorithms – Technique: use n trees simultaneously– [Gray and Moore 2001, NIPS; Moore et al. 2001,
Mining the Sky; Waters, Riegel and Gray 2009, to be submitted]
– O(Nlogn), exact
• Applications: 2-sample, e.g. AGN [Wake et al. 2004], ISW [Giannantonio et al. 2006], 3pt validation [Kulkarni et al. 2007]
3-point runtime
(biggest previous: 20K)
VIRGO simulation data,N = 75,000,000
naïve: 5x109 sec. (~150 years)multi-tree: 55 sec. (exact)
n=2: O(N)
n=3: O(Nlog3)
n=4: O(N2)
Software
• MLPACK (C++)– First scalable comprehensive ML library
• MLPACK Pro
- Very-large-scale data (parallel)
Software
• MLPACK (C++)– First scalable comprehensive ML library
• MLPACK Pro
- Very-large-scale data (parallel)
• MLPACK-db – fast data analytics in relational
databases (SQL Server; on-disk)
– Thanks to Tamas Budavari
Fast algorithms in SQL Server: Challenges
• Don’t copy the data: In-RAM kd-trees disk-based data structures
• Commercial database: no access to guts (e.g. caching, memory)
• Already optimized for different operations (SQL queries) and data structures (B+ trees)
• Big learning curve
Building a kd-tree in SQL Server
• Piggyback onto B+ trees and disk layout layers: Use SQL queries
• Hybrid approach: upper tree using Morton z-curve (SQL), lower using in-RAM building; 3 passes total
• Clustered indices for fast sequential access of nodes, data
• Abstracted algs in managed C#
Current state
• So far: All-NN, KDE*, NBC*, 2pt*• Performance for all-nearest-neighbors on
40M, 4-D SDSS color data:– Naïve SQL Server: 50 yrs– Batched SQL Server: 3 yrs– Single-tree MLPACK-db: 14.5 hrs– Dual-tree MLPACK-db: 227 min– TPIE: 23 min, MLPACK-Pro: 10 min
• We believe there is 10x factor of fat – e.g. this uses only 100Mb of 2Gb RAM
The end
• You can now use many of the state-of-the-art data analysis methods on large survey datasets
• We need more funding
• Talk to me… I have a lot of problems (ahem) but always want more!
Alexander Gray [email protected]
Ex: support vector machine
Data: IJCNN1 [DP01a]2 classes49,990 training points91,701 testing points22 features
SMO: 12,831 SV’s, 84,360 iterations, 98.3% accuracy, 765 secSFW: 4,145 SV’s, 4,500 iterations, 98.1% accuracy, 21 sec
[Ouyang and Gray 2009, in submission]
SMO: O(N2-N3)
SFW: O(N/e + 1/e2)
A few new ones…
• Non-vector data (like time series, images, graphs, trees):– Functional ICA (Mehta and Gray 2008)– Generalized mean map kernel (Mehta and Gray, in
submission)
• Visualizing high-dimensional data– Rank-based manifolds (Ouyang and Gray 2008)– Density-preserving maps (Ozakin and Gray, in submission)– Intrinsic dimension estimation (Ozakin and Gray, in prep)
• Accounting for measurement errors• Probabilistic cross-match
2-point correlation
r
N
i
N
ijji rxxI )(
Characterization of an entire distribution?
“How many pairs have distance < r ?”
2-point correlationfunction
The n-point correlation functions• Spatial inferences: filaments, clusters, voids,
homogeneity, isotropy, 2-sample testing, …
• Foundation for theory of point processes [Daley,Vere-Jones 1972], unifies spatial statistics [Ripley 1976]
• Used heavily in biostatistics, cosmology, particle physics, statistical physics
)](1[212 rdVdVdP
2pcf definition:
)],,()()()(1[ 1323121323123213 rrrrrrdVdVdVdP
3pcf definition:
3-point correlation
)()()( 321 rIrIrI ki
N
i
N
ij
N
ijkjkij
“How many triples have pairwise distances < r ?”
r3
r1
r2
Standard model: n>0 terms should be zero!
How can we count n-tuples efficiently?
“How many triples have pairwise distances < r ?”
Use n trees![Gray & Moore, NIPS 2000]
“How many valid triangles a-b-c(where )
could there be? CcBbAa ,,
A
B
C
r
count{A,B,C} =
?
“How many valid triangles a-b-c(where )
could there be? CcBbAa ,,
count{A,B,C} =
count{A,B,C.left}+
count{A,B,C.right}A
B
C
r
“How many valid triangles a-b-c(where )
could there be? CcBbAa ,,
A
B
C
r
count{A,B,C} =
count{A,B,C.left}+
count{A,B,C.right}
“How many valid triangles a-b-c(where )
could there be? CcBbAa ,,
A
B
C
r
count{A,B,C} =
?
“How many valid triangles a-b-c(where )
could there be? CcBbAa ,,
A
B
C
r
Exclusion
count{A,B,C} =
0!
“How many valid triangles a-b-c(where )
could there be? CcBbAa ,,
A B
C
count{A,B,C} =
?
r
“How many valid triangles a-b-c(where )
could there be? CcBbAa ,,
A B
C
Inclusion
count{A,B,C} =
|A| x |B| x |C|
r
Inclusion
Inclusion