Nearly-Tight Sample Complexity Bounds for Learning Mixtures of Gaussians Hassan Ashtiani 12 , Shai Ben-David 2 , Christopher Liaw 3 , Nicholas J.A. Harvey 3 , Abbas Mehrabian 4 , Yaniv Plan 3 1 McMaster University, 2 University of Waterloo, 3 University of British Columbia, 4 McGill University Our results Theorem. The sample complexity for learning mixtures of k Gaussians in R d up to total variation distance ε is ( e Θ(·) suppresses polylog(kd/ε) factors) • e Θ kd 2 ε 2 (for general Gaussians) • e Θ ( kd ε 2 ) (for axis-aligned Gaussians) Correspondingly, given n samples from the true distribution, the minimax risk is ˜ O ( p kd 2 /n) and ˜ O ( p kd/n), respectively. PAC Learning of Distributions • Given i.i.d. samples from unknown target distribution D , output ˆ D such that d TV (D , ˆ D ) = sup E | Pr D [E ] - Pr ˆ D [E ]| = 1 2 kf D - f D 0 k 1 ≤ ε. •F : An arbitrary class of distributions (e.g. Gaussians) k -mix(F ): k -mixtures of F , i.e. k -mix(F ) : = { ∑ i∈[k] w i D i : w i ≥ 0, ∑ i w i =1, D i ∈ F}. • Sample complexity of F is minimum number m F (ε) such that there is an algorithm that, given m F (ε) samples from D , outputs ˆ D with d TV (D , ˆ D ) ≤ ε. • PAC learning is not equivalent to parameter estimation where goal is to recover parameters of distribution. Compression Framework We develop a novel compression framework that uses few samples to build a representative family of distributions. 1. Encoder given true distribution D∈F & draws m(ε) points D . 2. Encoder sends t(ε) points and/or “helper” bits to decoder. 3. Decoder outputs ˆ D∈F such that d TV (D , ˆ D ) ≤ ε w.p. 2/3. If this is possible, we say F is (m(ε), t(ε))-compressible. Compression Theorem Compression Theorem [ABHLMP ’18]. If F is (m(ε), t(ε))-compressible then sample complexity for learning F is e O m(ε) + t(ε) ε 2 . Compression of Mixtures Lemma. If F is (m(ε), t(ε))-compressible then k -mix(F ) is ( km(ε/k) ε ,k t(ε/k ))-compressible. Compression Theorem for Mixtures. If F is (m(ε), t(ε))-compressible then sample complexity for learning k -mix(F ) is e O km(ε/k) ε + kt(ε/k) ε 2 . Example: Gaussians in R Claim. Gaussians in R are (1/ε, 2)-compressible. 1. True distribution is N (μ, σ 2 ); encoder draws 1/ε points from N (μ, σ 2 ). 2. With high probability, ∃ X i ≈ μ + σ, X j ≈ μ - σ . 3. Encoder sends X i ,X j ; decoder recovers μ, σ approximately. Outline of the Algorithm Assume: (i) F is (m(ε), t(ε))-compressible; (ii) true dist. D∈F Input: Error parameter ε> 0. 1. Draw m(ε) i.i.d. samples from D . 2. Encoder has at most m(ε) t(ε) 2 t(ε) outputs so enumerate all M = m(ε) t(ε) 2 t(ε) of decoder’s outputs, D 1 ,..., D M . By assumption, d TV (D i , D ) ≤ ε for some i. 3. Use tournament algorithm [DL ’01] to find best distribution amongst D 1 ,..., D M ; O (log(M )/ε 2 ) samples suffice for this step. Sample complexity is m(ε) + O (log(M )/ε 2 )= e O m(ε) + t(ε) ε 2 . Proof of Upper Bound Lemma. Gaussians in R d are (O (d), e O (d 2 ))-compressible. Sketch of lemma. Suppose true Gaussian is N (μ, Σ). • Encoder draws O (d) points from N (μ, Σ). • Points give rough shape of ellipsoid induced by μ, Σ; encoder sends points & e O (d 2 ) bits; decoder approximates ellipsoid. • Decoder outputs N (ˆ μ, ˆ Σ). Proof of upper bound. Combine lemma with compression theorem. Lower Bound Technique Theorem (Fano’s Inequality). If D 1 ,..., D r are distributions such that d TV (D i , D j ) ≥ ε and KL(D i , D j ) ≤ ε 2 for all i 6= j then sample complexity is Ω(log(r )/ε 2 ). • Use probabilistic method to find 2 Ω(d 2 ) Gaussian distributions satisfying hypothesis of Fano’s Inequality. • Repeat following procedure 2 Ω(d 2 ) times: 1. Start with identity covariance matrix. 2. Choose random subspace S a of dimension d/10 & perturb eigenvalues by ε/ √ d along S a . Let Σ a be corresponding covariance matrix and D a = N (0, Σ a ). Claim. If a 6= b then KL(D a , D b ) ≤ ε 2 and d TV (D a , D b ) ≥ ε with probability 1 - exp(-Ω(d 2 )). Can lift construction to get 2 Ω(kd 2 ) k -mixture of d-dimensional Gaussians satsifying Fano’s Inequality. Remark. Lower bound for axis-aligned proved by [SOAJ ’14]. References [DL ’01] Devroye, L., & Lugosi, G. (2001). Combinatorial methods in density estimation. Springer Science & Business Media. [SOAJ ’14] Suresh, A. T., Orlitsky, A., Acharya, J., & Jafarpour, A. (2014). Near-optimal-sample estimators for spherical gaussian mixtures. In Advances in Neural Information Processing Systems (pp. 1395-1403).