S BIC+SB (G | D ) = log P ( D | G ) − log n 2 G + max C ∈Ω ij E ij ∉G ∑ min c ∈val ( C ) sb( X i , X j | D , C = c ) Bayesian Information Criterion (BIC) SparsityBoost (SB) P(S|R) Sprinkler T F Rain F 0.2 0.8 T 0.01 0.99 • A BN is a graph G that represents a probability distribution with one node per variable, and one conditional probability distribution (CPD) per node, • Structure learning as discrete optimization: A New Scoring Function for Bayesian Network Structure Learning Extended to Arbitrary Discrete Variables Rachel Hodos 1 , David Sontag 2 1. Computational Biology program, NYU, 2. Computer Science Dept., NYU Learning Task: Given n observations of m variables, learn the Bayesian network structure which generated the data. 1. Introduction Focus of this work: Devise a score that makes structure learning easier with more data, and that works for discrete variables with any number of states. Input: Data, D X 1 X 2 X 5 X 4 X 3 X 1 X 2 X 3 X 4 X 5 02102 11000 10112 Output: Graph, G 2. Bayesian Networks (BNs) and Structure Learning P ( X i | Pa( X i )) P(R) Rain T F 0.3 0.7 P(G|S,R) Grass Wet S R T F T T 0.99 0.01 T F 0.9 0.1 F T 0.8 0.2 F F 0 1 Rain Grass Wet A simple example: • But what score to use? • Maximum likelihood estimation, S = P(D|G), gives complete graph, which is useless! • Hence we need to encourage sparsity G * = argmax G S (G | D ) P ( X 1 , X 2 , … , X n ) = P ( X i | Pa( X i )) i= 1 n ∏ generic score function Sprinkler 3. A New Score: SparsityBoost • PROBLEM: Existing complexity penalties are data agnostic , causing the score to be more difficult to optimize with more data • IDEA: Add a data-dependent term that boosts sparsity. • HOW: Search for evidence that an edge should not be present, and boost score of any graph that does not contain that edge. Data-agnostic complexity penalty, |G| = # of parameters of G Large for strong evidence of independence, small otherwise Ω ij is a set of conditioning sets, e.g. all subsets of size ≤ 2, excluding X i and X j Want strongest evidence for each edge, so take max over conditioning sets Independence should hold for all values of conditioning set 1 2 3 4 5 1 2 4 3 5 4. Bayesian Independence Testing sb( X i , X j | D , C = c ) = “Conditioning on C=c, how strongly do the data show that X i is indep. of X j ?” • Use conditional mutual information, MI • Assuming uniform prior over joint distributions, derive posterior, p(MI | D) MI ( P ( X i , X j | c )) = P ( x i , x j | c )log P ( x i , x j | c ) P ( x i | c ) P ( x j | c ) x i , x j ∑ 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0 10 20 30 40 50 60 70 80 Prior over I (n=0) I (mutual information) P(I) 0.09 0.1 10 20 30 40 50 60 70 80 Posterior over I with n=800 P(I | D) 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0 10 20 30 40 50 60 70 80 Posterior over I with n=1200 I (mutual information) P(I | D) 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0 10 20 30 40 50 60 70 80 I (mutual information) P(I | D) 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0 10 20 30 40 50 60 I (mutual information) P(I | D) 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0 10 20 30 40 50 60 70 80 Posterior over I with n=1200 I (mutual information) P(I | D) 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0 10 20 30 40 50 60 70 80 Prior over I (n=0) I (mutual information) P(I) 0 0.01 0.02 0 10 20 30 40 50 60 70 80 Posterior over I with n=400 P(I | D) P(MI | D), D from indep . distribution P(MI | D), D from dep . distribution n increasing sb( X i , X j | D , C = c ) = − log p( MI ( P ( X i , X j | C = c )) | D ) η ∞ ∫ dp threshold, η [1] Hutter, M., Zaffalon, M., Distribution of Mutual Information, Computational Statistics and Data Analysis, Vol. 48, No. 3, March 2005, pages 633-657. [2] Brenner, E., Sontag, D., SparsityBoost: A New Scoring Function for Learning Bayesian Network Structure, Proceedings of the 29 th Conference on Uncertainty in Artificial Intelligence, July 2013. [3] Beinlich, I. A., Suermondt, H. J., Chavez, R.M., & Cooper, G. F. 1989. The ALARM Monitoring System: A Case Study with Two Probabilistic Inference Techniques for Belief Networks. Pages 247-256 of: Proceedings of the 2nd European Conference on Articial Intelligence in Medicine. Springer-Verlag. [4] Cussens, J. Bayesian network learning with cutting planes. In Fabio G. Cozman and Avi Pfeffer, editors, Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence (UAI 2011), pages 153-160, Barcelona, 2011, AUAI Press. 6. References 5. Results on Synthetic Data 0 1000 2000 3000 4000 5000 6000 7000 8000 0 500 1000 1500 2000 2500 3000 Number of samples Average runtime (sec) BIC BIC+SB1 BIC+SB2 0 500 1000 1500 2000 2500 3000 3500 4000 0 5 10 15 20 25 30 35 40 Number of samples Structural Hamming Distance BIC BIC+SB1 BIC+SB2 • Start with known network structure (‘Alarm’ network) • Generate synthetic data (only binary data shown) • Find globally optimal structure with respect to score • Each point shown is average of ten independent experiments • Both accuracy and runtime improved Error Runtime