J. Statist. Comput. Simul., 2000, Vol. 00, pp. 1 – 22 # 2000 OPA (Overseas Publishers Association) N.V. Reprints available directly from the publisher Published by license under Photocopying permitted by license only the Gordon and Breach Science Publishers imprint. Printed in Malaysia. PROBABILITY MODEL SELECTION USING INFORMATION-THEORETIC OPTIMIZATION CRITERION BON K. SY* Queens College/CUNY, Department of Computer Science, Flushing, NY 11367 (Received 10 September 1999; In final form 22 September 2000) Probability models with discrete random variables are often used for probabilistic inference and decision support. A fundamental issue lies in the choice and the validity of the probability model. An information theoretic-based approach for probability model selection is discussed. It will be shown that the problem of probability model selection can be formulated as an optimization problem with linear (in)equality constraints and a non-linear objective function. An algorithm for model discovery/selection based on a primal – dual formulation similar to that of the interior point method is presented. The implementation of the algorithm for solving an algebraic system of linear constraints is based on singular value decomposition and the numerical method proposed by Kuenzi, Tzschach, and Zehnder. Preliminary comparative evaluation is also discussed. Keywords: Probabilistic inference; Model selection; Information theory; Optimization 1. INTRODUCTION In statistics, model selection based on information-theoretic criteria can be dated back to early 70s when the Akaike Information Criterion (AIC) was introduced (Akaike, 1973). Since then, various information criteria have been introduced for statistical analysis. For example, Schwarz information criterion (SIC) (Schwarz, 1978) is introduced to take into account the maximum likelihood estimate of the model, the number of free parameters in the model, and the sample size. SIC has *Tel.: 718-997-3566, Fax: 718-997-3513, e-mail: [email protected]1 I164T001059 . 164 T001059d.164
22
Embed
PROBABILITY MODEL SELECTION USING INFORMATION …Probability models with discrete random variables are often used for probabilistic inference and decision support. A fundamental issue
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
J. Statist. Comput. Simul., 2000, Vol. 00, pp. 1 ± 22 # 2000 OPA (Overseas Publishers Association) N.V.
Reprints available directly from the publisher Published by license under
Photocopying permitted by license only the Gordon and Breach Science
Publishers imprint.
Printed in Malaysia.
PROBABILITY MODEL SELECTIONUSING INFORMATION-THEORETIC
OPTIMIZATION CRITERION
BON K. SY*
Queens College/CUNY, Department of Computer Science, Flushing,NY 11367
(Received 10 September 1999; In ®nal form 22 September 2000)
Probability models with discrete random variables are often used for probabilisticinference and decision support. A fundamental issue lies in the choice and the validity ofthe probability model. An information theoretic-based approach for probability modelselection is discussed. It will be shown that the problem of probability model selectioncan be formulated as an optimization problem with linear (in)equality constraints and anon-linear objective function. An algorithm for model discovery/selection based on aprimal ± dual formulation similar to that of the interior point method is presented. Theimplementation of the algorithm for solving an algebraic system of linear constraints isbased on singular value decomposition and the numerical method proposed by Kuenzi,Tzschach, and Zehnder. Preliminary comparative evaluation is also discussed.
Keywords: Probabilistic inference; Model selection; Information theory; Optimization
1. INTRODUCTION
In statistics, model selection based on information-theoretic criteria
can be dated back to early 70s when the Akaike Information Criterion
(AIC) was introduced (Akaike, 1973). Since then, various information
criteria have been introduced for statistical analysis. For example,
Schwarz information criterion (SIC) (Schwarz, 1978) is introduced to
take into account the maximum likelihood estimate of the model, the
number of free parameters in the model, and the sample size. SIC has
In Section 5 the prototype implementation of the algorithm as an
ActiveX application was discussed. In this section the focus will be on
a preliminary evaluation of the ActiveX application. The evaluation
was conducted on an Intel Pentium 133MHZ laptop with 32M RAM
13PROBABILITY MODEL SELECTION
I164T001059 . 164T001059d.164
TABLEII
Alocaloptimalprobabilitymodel
ofmusk
P0±P7
0.03698
0.0565
0.002036
0.0008
0.0202
0.0115
0.005729
0.0029
P8±P15
0.003083
0.00197
0.003083
0.003083
0.006776
0.006776
0.006776
0.006776
P16±P23
0.006269
0.006269
0.006269
0.006269
0.009963
0.009963
0.009963
0.009963
P24±P31
0.007317
0.007317
0.007317
0.007317
0.01101
0.01101
0.0003
0.01101
P32±P39
0.00697
0.00318
0.004879
0.00136
0.00788
0.0026
0.008572
0.008572
P40±P47
0.005927
0.0035
0.005927
0.005927
0.00962
0.00962
0.00962
0.00962
P48±P55
0.009113
0.009113
0.009113
0.009113
0.012806
0.012806
0.012806
0.012806
P56±P63
0.01016
0.01016
0.01016
0.01016
0.013854
0.013854
0.013854
0.013854
P64±P71
0.000497
0.000497
0.000497
0.000497
0.00419
0.00419
0.00419
0.00419
P72±P79
0.001545
0.001545
0.001545
0.001545
0.005238
0.005238
0.005238
0.005238
P80±P87
0.004731
0.004731
0.004731
0.004731
0.008424
0.008424
0.008424
0.008424
P88±P95
0.005779
0.005779
0.005779
0.005779
0.009472
0.009472
0.009472
0.009472
P96±P103
0.003341
0.003341
0.003341
0.003341
0.007034
0.007034
0.007034
0.007034
P104±P111
0.004388
0.004388
0.004388
0.004388
0.008081
0.008081
0.008081
0.008081
P112±P119
0.007575
0.007575
0.007575
0.007575
0.011268
0.011268
0.011268
0.011268
P120±P127
0.008622
0.008622
0.008622
0.008622
0.012315
0.012315
0.012315
0.012315
I164T001059 . 164T001059d.164
and a hard disk of 420M bytes working space. The laptop was
equipped with an Internet Explorer 4.0 web browser with ActiveX
enabled. In addition, the laptop also had installed S-PLUS 4.5 and an
add-on commercial tool for numerical optimization NUOPT. The
commercial optimizer NUOPT was used for comparative evaluation.
A total of 17 test cases, indexed as C1 ±C17 listed in Table III
shown in the next section are derived from three sources for a
comparative evaluation. The ®rst source is the Hock and Schittkowski
problem set (Hock, 1980), which is a test set also used by NUOPT for
its benchmark testing. The second source is a set of test cases, which
originated in real world problems. The third source is a set of
randomly generated test cases. All 17 test cases, listed as ``nexp1.dat'',
``nexp2.dat'', . . . , ``nexp17.dat'', are accessible via item 8 of [www
http://bonnet3.cs.qc.edu/jscs9902.html].
Seven test cases (C1 ±C7) are derived from the ®rst source ±
abbreviated as STC (Ci) (the ith Problem in the set of Standard Test
Cases of the ®rst source). Four test cases originated from real world
problems in di�erent disciplines such as analytical chemistry, medical
diagnosis, sociology, and aviation. The remaining six test cases are
randomly generated and abbreviated as RTCi (the ith Randomly
generated Test Case).
The Hock and Schittkowski problem set is comprised of all kinds of
optimization test cases classi®ed by means of four attributes. The ®rst
attribute is the type of objective function such as linear, quadratic, or
general objective functions. The second attribute is the type of
constraint such as linear equality constraint, upper and lower bounds
constraint etc. The third is the type of the problems whether they are
regular or irregular. The fourth is the nature of the solution; i.e.,
whether the exact solution is known (so-called `theoretical' problems),
or the exact solution is not known (so-called `practical' problems).
In the Hock and Schittkowski problem set, only those test cases
with linear (in)equality constraints are applicable to the comparable
evaluation. Unfortunately those test cases need two pre-processing
steps; namely, normalization and normality. These two pre-proces-
sings are necessary because the variables in the original problems are
not necessarily bounded between 0 and 1 ± an implicit assumption for
terms in a probability model selection problem. Furthermore, all terms
15PROBABILITY MODEL SELECTION
I164T001059 . 164T001059d.164
TABLEIII
Comparativeevaluationresults
Sourceoftest
case/
#ofnon-
Entropy
With
application
trivial
NUOPT:entropy
Prototype:
entropy
upper
bound
initial
Case
domain
#ofterm
sconstraints
ofoptimalmodel
ofoptimalmodel
estimate
guess
C1
STC
(P55)
63
2.5475
2.55465
3.3
(2.58)
No
C2
STC
(P21)
63
0.971
0.971
� 1.306(2.58)
No
C3a
STC
(P76)
44
1.9839
0.9544
7.07(2)
No
C3b
STC
(P76)
1.9855
±Yes
C4
STC
(P86)
58
±±
±No
C5
STC
(P110)
10
21
±±
±No
C6
STC
(P112)
10
43.2457
3.2442
3.966(3.322)
No
chem
ical
equilibrium
C7a
STC
(P119)
16
93.498
2.7889
ÿ(4)
No
C7b
STC
(P119)
3.4986
ÿ(4)
Yes
C8
RTC1
43
1.9988
1.991
2.9546(2)
No
C9
CensusBureau/
12
10
2.8658
2.8656
ÿ(3.5849)
No
sociologystudy
C10
Chem
ical
128
20
6.6935
6.6792
23.633(7)
No
analysis(Ex.in
Section6)
C11
RTC2
94
2.9936
2.9936
4.247(3.167)
No
C12a
RTC3
43
1.328
0.85545
� 1.9687(2)
No
C12b
RTC3
1.328
6.242(2)
Yes
C13
RTC4
43
21.889
3.3589(2)
No
C14a
RTC5
43
1.72355
0.971
ÿ(2)
No
C14b
RTC5
1.72355
5.742(2)
Yes
C15a
RTC6
43
1.96289
0.996
ÿ(2)
No
C15b
RTC6
1.96289
6.09755(2)
Yes
C16
Medical
256
24
2.8658
3.37018
8.726(8)
No
diagnosis
C17
Single-engine
2187
10
10.13323
10.13357
� 11.0406
No
pilottraining
model
(11.0947)
I164T001059 . 164T001059d.164
must be added to a unity in order to satisfy the normality property,
which is an axiom of the probability theory.
The second source consists of four test cases. These four cases (C9,
C10, C16 and C17) are originated from real world problems. The ®rst
case C9 is from census data analysis for studying social patterns. The
second case C10 is from analytical chemistry for classifying whether a
molecule is a musk-like. The third case C16 is from medical diagnoses.
The last one is from aviation, illustrating a simple model of
aerodynamics for single-engine pilot training.
In addition to the seven ``benchmark'' test cases and the four test
cases from real world problems. Six additional test cases (C8, C11 ±
C15) are included for the comparative evaluation. These six cases,
indexed by RTCi (the ith randomly generated test case), are generated
based on a reverse engineering approach that guarantees knowledge of
a solution. Note that all seven test cases from the Hock and
Schittkowski problem set do not have to have a solution after the
introduction of the normality constraint (i.e., all variables add up to
one). Regarding the test cases originated from the real world
problems, there is again no guarantee of the existence of solution(s).
As a consequence, the inclusion of these six cases constitutes yet
another test source that is important for the comparative evaluation.
8. PRELIMINARY COMPARATIVE EVALUATION
The results of the comparative evaluation are summarized in Table III.
The ®rst column in the table is the case index of a test case. The second
column indicates the source of the test cases. The third column is
the number of joint probability terms in a model selection problem.
The fourth column is the number of non-trivial constraints. In general,
the degree of di�culty in solving a model selection problem is
proportional to the number of joint probability terms in a model and
the number of constraints.
The ®fth and the sixth columns are the expected Shannon entropy of
the optimal model identi®ed by the commercial tool NUOPT and the
ActiveX application respectively. Recall the objective is to ®nd a
model that is least biased, thus of maximal entropy, with respect to
unknown information while preserving the known information
17PROBABILITY MODEL SELECTION
I164T001059 . 164T001059d.164
stipulated as constraints. Hence, a model with a greater entropy value
is a better model in comparison to one with a smaller entropy value.
The seventh column reports the upper bound of the entropy of an
optimal model. Two estimated maximum entropies are reported. The
®rst estimate is derived based on the method discussed earlier (Steps 6
and 7). The second estimate (in parenthesis) is the theoretical upper
bound of the entropy of a model based on Log2 n; where n is the
number of probability terms (3rd column) in a model. Further details
about the theoretical upper bound are referred to the report elsewhere
(Shannon, 1972).
The last column indicates whether an initial guess is provided for the
prototype software to solve a test case. The prototype implementation
allows a user to provide an initial guess before the algorithm is applied
to solve a test case (e.g., C3b, C7b, C12b, C14b, and C15b). There
could be cases where other tools may reach a local optimal solution
that can be further improved. This feature provides ¯exibility to
further improve a local optimal solution.
9. DISCUSSION OF COMPARATIVE EVALUATION
As shown in Table III, both our prototype implementation and the
commercial tool NUOPT solved 15 out of the 17 cases. Further
investigation reveals that the remaining two test cases have no
solution. For these 15 cases, both systems are capable of reaching
optimal solutions similar to each other in most of the cases. In one
case (C16) the ActiveX application reached a solution signi®cantly
better than NUOPT, while NUOPT reached a signi®cantly better
solution in four case (C3, C12, C14, C15). It is interesting to note that
the ActiveX application actually improves the optimal solution of
NUOPT in one of these four cases (C3) when the ActiveX application
uses the optimal solution of NUOPT as an initial guess in an attempt
to further improve the solutions of these problems.
Referring to the seventh column, the result of estimating the upper
bound entropy value of the global optimal model using the proposed
dual formulation approach is less than satisfactory. In only three
(marked with �) of the 15 solvable test cases the proposed dual
formulation approach yields a better upper bound in comparison to
18 B. K. SY
I164T001059 . 164T001059d.164
the theoretical upper bound that does not consider the constraints of a
test case. Furthermore, in only one of the three cases the estimated
upper bound derived by the dual formulation approach is signi®cantly
better than the theoretical upper bound. This suggests the utility of the
dual formulation for estimating an upper bound is limited according
to our test cases.
It should also be noted that the proposed dual formulation fails to
produce an upper bound in three of the 15 solvable cases (C7, C14,
and C15). This is due to the fact that the transpose of the original
constraint set may turn slack variables in the primal formulation to
variables in the dual formulation that have to be non-negative. But the
SVD cannot guarantee to ®nd solutions that those variables are non-
negative. When the solution derived using SVD contains negative
values assigned to the slack variables, the dual formulation will fail to
produce an estimate of the upper bound, which occurred three times in
the 15 solvable test cases.
In the comparative evaluation we chose not to report the
quantitative comparison of the run time performance for two reasons.
First, our prototype implementation allows a user to control the
number of iterations indirectly through a parameter that de®nes the
size of incremental step in the search direction of SVD similar to that
of the interior point method. The current setting is 100 steps in the
interval of possible bounds in the linear search direction of SVD.
When the number of steps is reduced, the speed of reaching a local
optimal solution increases. In other words, one can trade the quality of
the local optimal solution for the speed in the ActiveX application.
Furthermore, if one provides a ``good'' initial guess, one may be able
to a�ord a large incremental step, which improves the speed, without
much compromise on the quality of the solution. Therefore, a direct
comparative evaluation on the run-time performance will not be
appropriate.
The second reason not to have a direct comparative evaluation of
the run-time is the need of re-formulating a test case using SIMPLE
(System for Interactive Modeling and Programming Language
Environment) before NUOPT can ``understand'' the problem, and
hence solving it. Since NUOPT optimizes its run-time performance by
dividing the workload of solving a problem into two steps, and only
reporting the elapsed time of the second step, it is not possible to
19PROBABILITY MODEL SELECTION
I164T001059 . 164T001059d.164
establish an objective ground for a comparative evaluation on the run-
time. Nevertheless, the ActiveX application solves all the test cases
quite e�ciently. As typical to any ActiveX deployment, a one-time
download of the ActiveX application from the Internet is required. It
takes about ®ve minutes to download using a 33 bps modem via an
ActiveX enabled IE4 web browser. Afterwards, almost all the test
cases can be solved instantly, except the last case (C17), in our
computing environment ± a Pentium 133MHZ laptop with 32M
RAM and 420M bytes of hard disk.
10. CONCLUSION
An algorithm for probability model selection is presented. It is found
that probability model selection can be formulated as an optimization
problem with linear order constraints and a non-linear objective
function. The proposed algorithm adopts an approach similar to the
primal ± dual formulation for the interior point method. The
theoretical development of the algorithm has led to a property that
can be interpreted semantically as the weight of evidence in
information theory. Our prototype implementation of the algorithm
is web deployable and can be accessed via an ActiveX enabled
browser. Preliminary comparative evaluation is made using a beta
version of the NUOPT for S-PLUS commercial package.
Because of the nature of the problem and the use of browser
technology, comparative test cases are conducted on relatively small
problems, but with non-trivial complexity due to high interactions
(thus dependency) among the model parameters. In the comparative
evaluation, it is noted that both the ActiveX implementation and
NUOPT can solve most of the model selection problems. An
interesting result is that in those problems where both our algorithm
and NUOPT can solve, the optimality of the models identi®ed by the
ActiveX application and NUOPT are comparable.
There are still many interesting issues to explore for the probability
model selection problems. For example, any probability model
selection problem has an inherent exponential complexity with respect
to the number of random variables. One avenue of approach to this
issue is to reduce the search space through parameter tuning (e.g.,
20 B. K. SY
I164T001059 . 164T001059d.164
granularization) or transformation (e.g., mapping probability space to
log probability space) if probability independence properties exist
among the variables. Another interesting issue is the convergence and
solvability issue of optimization. There are probability constraint sets
that have a degree of freedom which, in theory, corresponds to a
permissible search space while the proposed algorithm and existing
commercial package may not solve them well. The relationship
between the theoretical convergence rate and the solvability of a
practical implementation is another interesting issue to explore. Those
interesting issues will be the focus of our future study.
Acknowledgements
This author is grateful to the Associate Editor Dr. Morgan Wang and
an anonymous reviewer for their comments that help to improve the
manuscript. Professor David Locke of Chemistry Department in
Queens College provided technical proofreading and comments on the
``musk'' illustration. Ms. XiuYi Huang, under the partial support of a
grant from the PSC-CUNY Research Award, designed and imple-
mented the web page that provides convenient entry points to various
resources mentioned in this paper. NUOPT beta version used in this
paper is a result of being a beta tester site for Mathsoft Inc.
Preparation of the manuscript and web hosting resources are
supported in part by a NSF DUE grant #97-51135.
References
Akaike, H. (1973) ``Information Theory and an Extension of the Maximum Likeli-hood Principle'', In: Proceedings of the 2nd International Symposium of Informa-tion Theory, Eds. Petrov, B. N. and Csaki, E. Budapest: Akademiai Kiado,pp. 267 ± 281.
Borgwardt, K. H., The Simplex Method, A Probabilistic Analysis, Springer-Verlag,Berlin, 1987.
Chen, J. and Gupta, A. K., ``Testing and Locating Variance Change Points withApplication to Stock Prices'', Journal of the American Statistical Association,92(438), American Statistical Association, June, 1997, pp. 739 ± 747.
Chen, J. H. and Gupta, A. K., ``Information Criterion and Change Point Problem forRegular Models'', Technical Report No. 98-05, April, 1998, Department ofMathematics and Statistics, Bowling Green State University.
Good, I. J. (1960) ``Weight of Evidence, Correlation, Explanatory Power, Information,and the Utility of Experiments'', Journal of Royal Statistics Society, Series B, 22,319 ± 331.
21PROBABILITY MODEL SELECTION
I164T001059 . 164T001059d.164
Gupta, A. K. and Chen, J. (1996) ``Detecting Changes of Mean in MultidimensionalNormal Sequences with Applications to Literature and Geology'', ComputationalStatistics, 11, 211 ± 221, Physica-Verlag, Heidelberg.
Hock, W. and Schittkowski, K. (1980) Lecture Notes in Economics and MathematicalSystems 187: Test Examples for Nonlinear Programming Codes, Beckmann, M. andKunzi, H. P. Eds., Springer-Verlag, Berlin, Heidelberg, New York.
Johnson, G. D., ``Quantitative Characterization of Watershed-delineated LandscapePatterns in Pennsylvania: An Evaluation of Conditional Entropy Pro®les'',(Abstract), Ninth Lukacs Symposium, Frontiers of Environmental and EcologicalStatistics for the 21st Century, Bowling Green State University, Bowling Green,Ohio, April, 1999.
Karmarkar, N. (1984) ``A New Polynomial-time Algorithm for Linear Programming'',Combinatorica, 4(4), 373 ± 395.
Kuenzi, H. P., Tzschach, H. G. and Zehnder, C. A. (1971) Numerical Methods ofMathematical Optimization, New York, Academic Press.
Martin, D., Seminar in ``Financial Topics in S-PLUS'', Mathsoft Inc., WashingtonD.C., Oct., 1998.
Murphy, P. M. and Aha, D. W. (1994) UCI repository of machine learning databases,Department of Information and Computer Science, Irvine, University of California,(second musk data set) http://www.ics.uci.edu/�mlearn/MLRepository.html
Schwarz, C. (1978) ``Estimating the Dimension of a Model'', The Annals of Statistics, 6,461 ± 464.
Shannon, C. E. and Weaver, W., The Mathematical Theory of Communication,University of Urbana Press, Urbana, 1972.
The NUOPT for S-PLUS Manual, Mathematical Systems, Inc., Oct., 1998.Sy, B. K., ``Pattern-based Inference Approach for Data Mining'', Proceeding of the 18th
International Conference of North American Fuzzy Information Processing Society -NAFIPS, New York, June, 1999.
Wright, S., Primal ±Dual Interior Point Methods, SIAM, 1997, ISBN 0-89871-382-X.[www http://bonnet3.cs.qc.edu/jscs9902.html]