How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing Joint work with Gordon Richards (Princeton), Robert Nichol (Portsmouth ICG), Robert Brunner (UIUC/NCSA), Andrew Moore (CMU)
42
Embed
How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
How to do Bayes-Optimal Classification with Massive Datasets:
Large-scale Quasar Discovery
Alexander GrayGeorgia Institute of Technology
College of Computing
Joint work with Gordon Richards (Princeton), Robert Nichol (Portsmouth ICG),
Robert Brunner (UIUC/NCSA), Andrew Moore (CMU)
What I do
Often the most general and powerful statistical (or “machine learning”) methods are computationally infeasible.
I design machine learning methods and fast algorithms to make such statistical methods possible on massive datasets (without sacrificing accuracy).
Quasar detection
• Science motivation: use quasars to trace the distant/old mass in the universe
• Thus we want lots of sky SDSS DR1, 2099 square degrees, to g = 21
• Biggest quasar catalog to date: tens of thousands
• Should be ~1.6M z<3 quasars to g=21
Classification
• Traditional approach: look at 2-d color-color plot (UVX method)– doesn’t use all available information– not particularly accurate (~60% for relatively
bright magnitudes)
• Statistical approach: Pose as classification.1.Training: Train a classifier on large set of
known stars and quasars (‘training set’)
2.Prediction: The classifier will label an unknown set of objects (‘test set’)
Which classifier?
1. Statistical question: Must handle arbitrary nonlinear decision boundaries, noise/overlap
2. Computational question: We have 16,713 quasars from [Schneider et al. 2003] (.08<z<5.4), 478,144 stars (semi-cleaned sky sample) – way too big for many classifiers
3. Scientific question: We must be able to understand what it’s doing and why, and inject scientific knowledge
Which classifier?
• Popular answers:– logistic regression: fast but linear only– naïve Bayes classifier: fast but quadratic only – decision tree: fast but not the most accurate– support vector machine: accurate but O(N3) – boosting: accurate but requires thousands of
classifiers– neural net: reasonable compromise but
awkward/human-intensive to train
• The good nonparametric methods are also black boxes – hard/impossible to interpret
Main points of this talk
1. nonparametric Bayes classifier
2. can be made fast (algorithm design)
3. accurate and tractable science
Main points of this talk
1. nonparametric Bayes classifier
2. can be made fast (algorithm design)
3. accurate and tractable science
Optimal decision theory
Optimal decision boundary
Star density
Quasar density
x
dens
ity
f(x
)
Bayes’ rule, for Classification
)|(ˆ)()|(ˆ)(
)|(ˆ)()|(
2211
111 CxfCPCxfCP
CxfCPxCP
qq
qq
)|(ˆ)()|(ˆ)(
)|(ˆ)()|(
2211
111 CxfCPCxfCP
CxfCPxCP
qq
qq
)|(ˆ 2Cxf q
)|(ˆ 1Cxf q
)|(ˆ)()|(ˆ)(
)|(ˆ)()|(
2211
111 CxfCPCxfCP
CxfCPxCP
qq
qq
)|(ˆ 2Cxf q
So how do you estimatean arbitrary density?
)|(ˆ 1Cxf q
Kernel Density Estimation (KDE)
N
qrrqhq xxK
Nxf )(
1)(ˆ
N
qrrqq hxx
hCNxf }2/exp{
)(11
)(ˆ 2
for example (Gaussian kernel):
Kernel Density Estimation (KDE)
N
qrrqhq xxK
Nxf )(
1)(ˆ
• There is a principled way to choose the optimal smoothing parameter h
• Guaranteed to converge to the true underlying density (consistency)
• Nonparametric – distribution need not be known
Nonparametric Bayes Classifier (NBC)[1951]
)|(ˆ)()|(ˆ)(
)|(ˆ)()|(
2211
111 CxfCPCxfCP
CxfCPxCP
qq
qq
• Nonparametric – distribution can be arbitrary• This is Bayes-optimal, given the right densities• Very clear interpretation• Parameter choices are easy to understand, automatable• There’s a way to enter prior information