I164T001059 . 164 T001059d.164 J. Statist. Comput. Simul., 2000, Vol. 00, pp. 1 ± 22 Reprints available directly from the publisher Photocopying permitted by license only # 2000 OPA (Overseas Publishers Association) N.V. Published by license under the Gordon and Breach Science Publishers imprint. Printed in Malaysia. PROBABILITY MODEL SELECTION USING INFORMATION-THEORETIC OPTIMIZATION CRITERION BON K. SY* Queens College/CUNY, Department of Computer Science, Flushing, NY 11367 (Received 10 September 1999; In ®nal form 22 September 2000) Probability models with discrete random variables are often used for probabilistic inference and decision support. A fundamental issue lies in the choice and the validity of the probability model. An information theoretic-based approach for probability model selection is discussed. It will be shown that the problem of probability model selection can be formulated as an optimization problem with linear (in)equality constraints and a non-linear objective function. An algorithm for model discovery/selection based on a primal ± dual formulation similar to that of the interior point method is presented. The implementation of the algorithm for solving an algebraic system of linear constraints is based on singular value decomposition and the numerical method proposed by Kuenzi, Tzschach, and Zehnder. Preliminary comparative evaluation is also discussed. Keywords: Probabilistic inference; Model selection; Information theory; Optimization 1. INTRODUCTION In statistics, model selection based on information-theoretic criteria can be dated back to early 70s when the Akaike Information Criterion (AIC) was introduced (Akaike, 1973). Since then, various information criteria have been introduced for statistical analysis. For example, Schwarz information criterion (SIC) (Schwarz, 1978) is introduced to take into account the maximum likelihood estimate of the model, the number of free parameters in the model, and the sample size. SIC has *Tel.: 718-997-3566, Fax: 718-997-3513, e-mail: [email protected] 1 I164T001059 . 164 T001059d.164 2 B. K. SY been further studied by Chen and Gupta (Chen, 1997) (Gupta, 1996) for testing and locating change points in mean and variance of multivariate statistical models with independent random variables. Chen (Chen, 1998) further elaborated SIC to change point problem for regular models. Potential applications on using information criterion for model selection to ®elds such as environmental statistics and ®nancial statistics (Johnson, 1999) (Martin, 1998) are also discussed elsewhere. To date, studies in information criteria for model selection have been focused on statistical models with continuous random variables, and in many cases, with the assumption of iid (independent and identically distributed ). In this work, the focus is rather dierent. Our focus is on probability models with discrete random variables. While the application of the statistical models discussed elsewhere is mainly for statistical inference based on statistic hypothesis test, the application of the probability models is for probabilistic inference. The context of probabilistic inference could range from probability assessment of an event outcome to identifying the most probable events, or from testing independence among random variables to identifying event patterns of signi®cant event association. In decision science, the utility of a decision support model may be evaluated based on the amount of biased information. Let's assume we have a set of simple ®nancial decision models. Each model manifests an oversimpli®ed relationship among strategy, risk, and return as three interrelated discrete binary-valued random variables. The purpose of these models is to assist an investor in choosing the type of an investment profolio based on an individual's investment objective; e.g., a decision could be whether one should construct a profolio in which resource allocation is diversi®ed. Let's assume one's investment objective is to have a moderate return with relatively low risk. Suppose if a model returns an equal preference on strategies to, or not to, diversify, it may not be too useful to assist an investor in making a decision. On the other hand, a model that is biased towards one strategy over the other may be more informative to assist one in making a decision ± even the decision does not have to be the correct one. For example, a model may choose to bias towards a strategy based on probability assessment on strategy conditioned on risk and return. I164T001059 . 164 T001059d.164 PROBABILITY MODEL SELECTION 3 In information theory, the amount of biased probability information can be measured by means of expected Shannon entropy (Shannon, 1972) de®ned as ÿ iPi Log Pi. Let ; J ; P be a given probability model; where is the sample space, J is a -®eld of sets each of which is a subset of , and P(E ) is the probability of an event E 2 J . Let's also de®ne a linear (in)equality constraint on a probability model as a linear combination of the joint probabilities Ps in a model. The model selection problem discussed in this paper can be formally formulated as below: Let M fMi : ; J ; Pji 1; 2; . . .g be a set of probability models where all models share an identical set of primitive events de®ned as the supremum (the least upper bound ) taken over all partition of . Let C {Ci : i 1, 2, . . . } be a set of linear (in)equality constraints de®ned over the joint probability of primitive events. Within the space of all probability models bounded by C, the problem of probability model selection is to ®nd the model that maximizes expected Shannon entropy. It can be shown that the problem of model selection just described is actually an optimization problem with linear order constraints de®ned over the joint probability terms of a model, and a non-linear objective function (de®ned by ÿ iPi Log Pi). It is important to note an interesting property of the model selection problem just described: Property 1: Principle of Minimum Information Criterion An optimal probability model is one that minimizes bias, in terms of expected entropy, in probability assessments that depend on unknown information, while it preserves known biased probability information speci®ed as constraints in C. 2. OPTIMIZATION In the operations research community, techniques for solving various optimization problems have been discussed extensively. Simplex and Karmarkar algorithms (Borgwardt, 1987) (Karmarkar, 1984) are two methods that are constantly being used, and are robust for solving many linear optimization problems. Wright (Wright, 1997) has written an excellent textbook on primal ± dual formulation for the interior I164T001059 . 164 T001059d.164 4 B. K. SY point method with dierent variants of search methods for solving non-linear optimization problems. It was discussed in Wright's book that the primal ± dual interior point method is robust on searching optimal solutions for problems that satisfy KKT conditions with a second order objective function. At ®rst glance, it seems that existing optimization techniques can be readily applied to solve the probability model selection problem. Unfortunately, there are subtle diculties that make probability model selection a more challenging optimization problem. First of all, each model parameter in the optimization problem is a joint probability term bounded between 0 and 1. This essentially limits the polytope of the solution space to be much smaller in comparison to a non-probability based optimization problem with identical set of non-trivial constraints (i.e., those constraints other than 1 Pi 0). In addition, the choice of robust optimization methodologies is relatively limited due to the nature of non-linear log objective functions. Primal ± dual interior point is one of the few promising techniques for the probability model selection problem. But unfortunately the primal ± dual interior point method requires the existence of an initial solution, and an iterative process to solve an algebraic system for estimating incrementally revised errors between a current suboptimal solution and the estimated global optimal solution. This raises two problems. First, the primal ± dual formulation requires a natural augmentation of the size of the algebraic system to be solved, even if the augmented matrix happens to be a sparse matrix. Since the polytope of the solution space is ``shrunk'' by the trivial constraints 1 Pi 0, solving the augmented algebraic system in successive iterations to estimate incremental revised errors is not always possible. Another even more fundamental problem is that the convergence of the iterations in primal ± dual interior point method relies on the KKT conditions. Such conditions may not even exist in many practical model selection problems. As a result, an optimization algorithm taking a hybrid approach is developed. It follows the spirit of the primal ± dual interior point method, but deviates from the traditional approach on the search towards an optimal solution. I164T001059 . 164 T001059d.164 PROBABILITY MODEL SELECTION 5 3. OPTIMIZATION ALGORITHM FOR PROBABILITY MODEL SELECTION The basic idea of the proposed optimization algorithm for probability model selection problems consists of eight steps: Step 1 Construct the primal formulation of the algebraic system equations de®ned by the constraint set in the form of Ax b; i.e., each constraint in C, with proper slack variable introduced when necessary, accounts for a row in matrix A. Step 2 Obtain a feasible solution for the primal formulation using the numerical method proposed by Kuenzi, Tzschach and Zehnder (Kuenzi, 1971). Obtain another feasible solution by applying the Singular Value Decomposition (SVD) algorithm. Compare the two solutions and choose the better one (in terms of expected Shannon entropy) as the initial solution x. Step 3 Identify column vectors {Vij i 1, 2, . . . } from the by-product of SVD that correspond to the zero entries in the diagonal matrix of the SVD of A. Step 4 Obtain multiple alternative solutions y by constructing the linear combination of the initial solution x with the Vi; i.e., Ay b where y x iaiVi for some constants ai. Step 5 Identify the local optimal model x [P1, . . . , Pn]T where ÿ iPi log Pi of x is the largest among all solution models found. In other words, the local optimal solution minimizes cTx, where c [log P1, . . . , log Pn]T. Step 6 Construct the dual formulation AT c, and solve using SVD subject to maximizing bT ; where log p1 ; . . . ; log pn T . Step 7 Compare the estimated value of the objective function bT (due to the global optimal model) with the value of the objective function cTx (due to the local optimal model). Step 8 Solve the optimization problem with one constraint: xT Log x0 bT subject to Minj1 ÿ i P0i j where Log x0 Log P01 ; . . . ; Log P0n T , and x is the optimal solution vector obtained I164T001059 . 164 T001059d.164 6 B. K. SY in Step 4. If x0 satis®es all axioms of probability theory, the optimal probability model to be selected is x0 . Otherwise, the optimal probability model is x. 4. DISCUSSION OF THE ALGORITHM @Step 1 A typical scenario in probability model selection is expert testimony or valuable information obtained from data analysis expressed in terms of probability constraints. For example, consider the following case where one is interested in an optimal probability model with two binary-valued random variables, let's say {X1: 0, 1}, {X2: 0, 1}, and P0 Pr(X1: 0, X2: 0), P1 Pr(X1: 0, X2:1), P2 Pr (X1: 1, X2: 0), P3 Pr(X1: 1, X2: 1), Expert testimony P x1 : 0 0:65()P0 P1 S 0:65 9S 0 P x2 : 0 0:52()P0 P2 0:52 i Pi 1:0()P0 P1 P2 P3 1:0 Primal formulation 2 1 41 1 1 0 1 0 1 1 2 3 3 P0 2 3 7 0 1 6 0:65 6 P1 7 7 4 5 0 0 56 6 P2 7 0:52 , Ax b 4 5 P3 1 0 1:00 S In general, a probability model with n probability terms, v inequality constraints, and w equality constraints will result in a constraint matrix A with size (v w)(n v). In the example just shown, n 4, v 1, and w 2. @Steps 2 and 3 The basic idea of the Kuenzi, Tzschach, and Zehnder approach for solving an algebraic system of linear constraints is to reformulate the constraint set by introducing (v w) variables ± one for each constraint. Using the previous example, I164T001059 . 164 T001059d.164 PROBABILITY MODEL SELECTION 7 Z0 0:65 ÿ P0 ÿ P1 ÿ S Z1 0:52 ÿ P0 P2 Z2 1:0 ÿ P0 ÿ P1 ÿ P2 ÿ P3 ; Z0 0 Z1 0 Z2 0 The above (in)equalities can be thought of as a constraint set of yet another optimization problem with a cost function Min[Z0 Z1 Z2]. Note that a feasible solution of this new optimization problem is a vector of seven parameters [Z0 Z1 Z2 P0 P1 P2 P3]. If the global minimum can be achieved in this new optimization problem, this is equivalent to Z0 Z1 Z2 0, which in turn gives a feasible solution for the original problem. That is, Pis in the global optimal solution [0 0 0 P0 P1 P2 P3] is a feasible solution of the original problem. In additional to the Kuenzi, Tzschach, and Zehnder approach for solving the algebraic system of linear constraints, Singular Value decomposition (SVD) algorithm is also applied to obtain another feasible solution. The basic concept of SVD of A is to express A in the form of UDVT A is a (v w) by (n v) matrix. U is a (v w) by (n v) orthonormal matrix satisfying UTU I, where I is an identity matrix. D is a diagonal matrix with a size of (n v) by (n v). V transpose (VT ) is a (n v) by (n v) orthonormal matrix satisfying VVT I. It can be shown a solution to Ax b is simply x VD ÿ 1UTb; where ÿ1 D D I. Note that D ÿ 1 can be easily constructed from D by taking the reciprocal of non-zero diagonal entries of D while replicating the diagonal entry from D if an entry in D is zero. @Step 4 It can also be shown that whenever there is a zero diagonal entry di,i 0 in D of SVD, a linear combination of a solution vector x with the corresponding ith column vector of V is also a solution to Ax b. This is due to the fact that such a column vector of V actually is mapped to a null space through the transformation matrix A; i.e., AVi 0. This enables a search of the optimal probability model along the direction of the linear combination of the initial solution vector and the column vectors of V with the corresponding diagonal entry in D equal to zero. A local optimal solution of the example discussed in Step 1 that minimizes iPi log Pi (or maximizes ÿ iPi log Pi) is shown below: x P0 P1 P2 P3 T 0:2975 0:24 0:2225 0:24T i Pi log e Pi ÿ1:380067 with I164T001059 . 164 T001059d.164 8 B. K. SY At ®rst glance, one may wonder why two dierent approaches (instead of just SVD) are used to obtain an initial feasible solution since SVD generates an initial solution as well as the necessary information for deriving multiple solutions in a search for an optimal solution. There are two reasons. First, the initial feasible solution de®nes the region of the search space the search path traverses. Therefore, using two dierent approaches to obtain an initial feasible solution improves the chance of searching in the space where the global optimal solution resides. Second, although SVD is a robust and ecient algorithm for solving linear algebra, the trivial non-negative constraints (Pi 0) are dicult to include in the formulation required for applying SVD. As a consequence, a solution obtained from applying SVD, albeit satisfying all non-trivial constraints, may not satisfy the trivial constraints. Recall from the previous discussion the mechanism for generating multiple solution is based on Ay b where y x iaiVi for some constants ai. It is also now known that SVD may fail to generate a feasible solution that satis®es both trivial and non-trivial constraints. When this happens, however, one can still apply the same mechanism for generating multiple solutions using the feasible solution obtained from the numerical method of Kuenzi, Tzschach, and Zehnder. @Steps 5 and 6 The local optimal solution is found in the previous step through a linear search along the vectors that are mapped to the null space in the SVD process. Our approach to avoid getting trapped in the local plateau is to conduct an optimization in the log space that corresponds to the dual part of the model selection problem formulation. Speci®cally, the constant vector for the algebraic system of the dual part can be constructed using the local optimal solution obtained in the previous step; i.e., 2 2 1 1 61 0 6 6 40 1 1 1 0 0 1 1 0 0 1 0 0 1 1 0 1 0 0 0 3 X0 6 7 36 X1 7 2 0 6 7 6 X2 7 6 7 6 07 76 7 6 76 S0 7 6 0 56 7 4 6 S1 7 7 1 6 6 7 4 S2 5 S3 log 0:2975 3 log 0:24 7 7 7 , AT c log 0:2225 5 log 0:24 subject to maximizing bT ; I164T001059 . 164 T001059d.164 PROBABILITY MODEL SELECTION 9 where cT log 0:2975 log 0:24 log 0:2225 log 0:2 bT 0:65 0:52 1:0 Note that the column corresponding to the slack variables is dropped in the dual formulation since it does not contribute useful information to estimating the optimal bound of the solution. In addition, the optimization problem de®ned in this dual part consists only of linear order constraints and a linear objective function. However, there are subtle issues involved in solving the dual part. It is not sucient just to apply SVD to solve for AT c because a legitimate solution requires non-negative values of Si (i 0, 1, 2, 3) in the solution vector of . In the above example, although there are four equations, there are only three variables that can span over the entire range of real numbers. The remaining four slack variables can only span over the non-negative range of real numbers. It is not guaranteed there will always be a solution for the dual part even there is a local optimal solution found in the above example. The local optimal solution is listed below: T 0:331614 0:108837 ÿ 2:320123 where maximal bT ÿ2:047979 @Steps 7 and 8 In the previous step, the optimal value of the objective function bT is an estimate of the optimality of the solution obtained in the primal part. When cT x bT , x is the optimal probability model with respect to maximum expected entropy. It is often cT x bT when there is a stable solution for the dual part. This can be proved easily with few steps of derivation similar to that of the standard primal ± dual formulation for optimizations described in (Wright, 1997). In this case, we can formulate yet another optimization problem to conduct a search on the optimal solution. In particular, the optimization problem has only one constraint de®ned as xT Log x0 bT , with an objective function de®ned as Minj1 ÿ i P0i j where Log x0 Log P01 ; . . . ; Log P0n T , and x is the optimal solution vector obtained in the primal part. Note that is related to the log probability terms, thus the solution Log x0 represents a log probability model. The concept behind I164T001059 . 164 T001059d.164 10 B. K. SY xT Log x0 bT is to try to get a probability model that has a weighted information measure equal to the estimated global value bT . This is an interesting property: Property 2 The constraint xT x0 bT de®nes a similarity measure identical to the weight of evidence (Good, 1960) in comparing two models. To understand Property 2, let's consider the case cT x bT , x Log x0 bT becomes xT Log x0 xTc, or xT(cÿ Log x0 ) 0. xT (c ÿ Log x0 ) 0 is equivalent to i Pi log Pi =P0i 0, which has a semantic interpretation in information theory that two models are identical based on the weight of evidence measurement function. It is worth noting that the optimization in Step 8 is a search in log space, but NOT necessarily log probability space since the boundary for the probability space is de®ned in the objective function, rather than the constraint set. As a consequence, a solution from Step 8 does not necessarily correspond to a legitimate candidate for the probability model selection. T 5. PROTOTYPE IMPLEMENTATION DETAILS The algorithm discussed in the previous section has been implemented in Borland C Builder 3.0 and wrapped as an ActiveX control application component. The ActiveX application component can be accessed and executed directly from an ActiveX enabled web browser. At the present time, the ActiveX technology is only supported for Microsoft environments such as Windows 95, Windows 98, and NT, and the Internet Explorer is the only ActiveX enabled web browser. The current implementation has been tested on all three Microsoft environments (Windows 95, 98 and NT) for web deployment. Web access URL for the ActiveX application can be found in item 5 of [www http://bonnet3.cs.qc.edu/jscs9902.html]. The format of the data ®le that speci®es an optimization problem can be found in item 6.1 of [www http://bonnet3.cs.qc.edu/ jscs9902.html]. The data ®le for the primal part of the example used in the previous sections can be found in Item 6.2 of [www http:// bonnet3.cs.qc.edu/jscs9902.html]. This data ®le is readily usable as an I164T001059 . 164 T001059d.164 PROBABILITY MODEL SELECTION 11 input ®le for the ActiveX application. In the step for the primal part, all discovered probability models are stored in the ®le ``pro_src.dat''; where each row corresponds to a probability model [P1, . . . , Pn]T. The optimal model is stored in the ®le ``optimal.dat'', and the information content log Pi of each event is stored in ``entropy.dat''. The data ®le for the dual part of the example ± referred to as ``PDStep2.dat'' ± can be access through item 6.3 of [www http:// bonnet3.cs.qc.edu/jscs9902.html]. The data ®le for Step 8 can be found in item 6.4 of [www http://bonnet3.cs.qc.edu/jscs9902.html] ± referred to as ``PDStep3.dat''. These two data ®les ± ``PDStep2.dat'' and ``PDStep3.dat'' are readily usable as input ®les for the ActiveX application as well. In the implementation of the algorithm, the application also has a feature to support probability inference using multiple models. The limitation is that each query must be expressed as a linear combination of the joint probability terms. A probability interval will be estimated for each query. A sample query ®le can be found in item 6.5 of [www http://bonnet3.cs.qc.edu/jscs9902.html]. This query ®le is for the example in the previous section, and can be used as input for the ``probabilistic inference'' option of the ActiveX application. Further details about accessing the software implementation can be obtained from the author. 6. A PRACTICAL EXAMPLE USING A REAL WORLD PROBLEM Synthetic molecules may be classi®ed as musk-like or not musk-like. A molecule is classi®ed as musk-like if it has certain chemical binding properties. The chemical binding properties of a molecule depend on its spatial conformation. The spatial conformation of a molecule can be represented by distance measurements between the center of the molecule and its surface along certain rays. This distance measurements can be characterized by 165 attributes of continuous variable (Murphy, 1994). A common task in ``musk'' analysis is to determine whether a given molecule has a spatial conformation that falls into the musk-like category. Our recent study discovers that it is possible to use only six I164T001059 . 164 T001059d.164 12 B. K. SY discretized variables (together with an additional ¯ag) to accomplish the task satisfactory (with a performance index ranging from 80% to 91% with an average 88%). Prior to the model selection process, there is a process of pattern analysis for selecting the seven variables out of the 165 attributes and for discretizing the selected variables. Details of pattern analysis are beyond the scope of this paper and can be referred to in Sy (Sy, 1999). Based on the ``musk'' data set available elsewhere (Murphy, 1994) with 6598 records of 165 attributes, six variables are identi®ed and discretized into binary-valued variables according to the mean values. These six variables, referred to as V 1 to V6, are from the columns 38, 126, 128, 134, 137, and 165 in the data ®le mentioned elsewhere (Murphy, 1994). Each of these six random variables takes on two possible values {0, 1}. V 7 is introduced to represent a ¯ag. V 7 : 0 indicates an identi®ed pattern is part of a spatial conformation that falls into the musk category, while V 7 : 1 indicates otherwise. Below is a list of 14 patterns of variable instantiation identi®ed during the process of pattern analysis and their corresponding probabilities: Remark The index i of Pi in the table shown above corresponds to an integer value whose binary representation is the instantiation of the variables (V1 V2 V3 V4 V5 V6 V7). A pattern of variable instantiation that is statistically signi®cant may appear as part of the spatial conformation that exists in both the TABLE I Illustration of event patterns as constraints for probability model selection V1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 V2 V3 V4 V5 V6 V7 Pr(V1,V2,V3,V4,V5,V6,V7) 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 1 1 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 1 0 0 1 0 0 0 0 1 1 0 1 1 1 0 0 1 1 0 1 1 P0 0.03698 P1 0.0565 P3 0.0008 P4 0.0202 P5 0.0155 P7 0.0029 P9 0.00197 P30 0.0003 P32 0.00697 P33 0.00318 P35 0.00136 P36 0.00788 P37 0.0026 P41 0.0035 I164T001059 . 164 T001059d.164 PROBABILITY MODEL SELECTION 13 musk and the non-musk categories; for example, the ®rst two rows in the above table. As a result, the spatial conformation of a molecule may be modeled using the probability and statistical information embedded in data to reveal the structure characteristics. One approach to represent the spatial conformation of a molecule is to develop a probability model that captures the probability information shown above, as well as the probability information shown below to preserve signi®cant statistical information existed in data: P V1 : 0 0:59 P V4 : 0 0:5215 P V2 : 0 0:462 P V5 : 0 0:42255 P V3 : 0 0:416 Note that a probability model of musk is de®ned by a joint probability distribution of 128 terms; i.e., P0 . . . P127. In this example we have 20 constraints C0 . . . C19; namely, C0 : P0 0:03698 C1 : P1 0:0565 C2 : P3 0:0008 C3 : P4 0:0202 C4 : P5 0:0155 C5 : P7 0:0029 C6 : P9 0:00197 C7 : P30 0:0003 C8 : P32 0:00697 C9 : P33 0:00318 C10 : P35 0:00136 C11 : P36 0:00788 C12 : P37 0:0026 C13 : P41 0:0035 C14 : P V1 : 0 V2;V3;V4;V5;V6;V7 P V1 : 0;V2;V3;V4;V5;V6;V7 0:59 C15 : P V2 : 0 V1;V3;V4;V5;V6;V7 P V1;V2 : 0;V3;V4;V5;V6;V7 0:462 C16 : P V3 : 0 V1;V2;V4;V5;V6;V7 P V1;V2;V3 : 0;V4;V5;V6;V7 0:416 C17 : P V4 : 0 V1;V2;V3;V5;V6;V7 P V1;V2;V3;V4 : 0;V5;V6;V7 0:5215 C18 : P V5 : 0 V1;V2;V3;V4;V6;V7 P V1;V2;V3;V4;V5 : 0;V6;V7 0:42255 C19 : V1;V2;V3;V4;V5;V6;V7 P V1;V2;V3;V4;V5;V6;V7 1:0 The optimal model identi®ed by applying the algorithm discussed in this paper is shown below: Expected Shannon entropy ÿ iPi Log2 Pi 6.6792 bits 7. EVALUATION PROTOCOL DESIGN In Section 5 the prototype implementation of the algorithm as an ActiveX application was discussed. In this section the focus will be on a preliminary evaluation of the ActiveX application. The evaluation was conducted on an Intel Pentium 133MHZ laptop with 32 M RAM P0 ± P7 P8 ± P15 P16 ± P23 P24 ± P31 P32 ± P39 P40 ± P47 P48 ± P55 P56 ± P63 P64 ± P71 P72 ± P79 P80 ± P87 P88 ± P95 P96 ± P103 P104 ± P111 P112 ± P119 P120 ± P127 0.03698 0.003083 0.006269 0.007317 0.00697 0.005927 0.009113 0.01016 0.000497 0.001545 0.004731 0.005779 0.003341 0.004388 0.007575 0.008622 0.0565 0.00197 0.006269 0.007317 0.00318 0.0035 0.009113 0.01016 0.000497 0.001545 0.004731 0.005779 0.003341 0.004388 0.007575 0.008622 0.002036 0.003083 0.006269 0.007317 0.004879 0.005927 0.009113 0.01016 0.000497 0.001545 0.004731 0.005779 0.003341 0.004388 0.007575 0.008622 0.0008 0.003083 0.006269 0.007317 0.00136 0.005927 0.009113 0.01016 0.000497 0.001545 0.004731 0.005779 0.003341 0.004388 0.007575 0.008622 0.0202 0.006776 0.009963 0.01101 0.00788 0.00962 0.012806 0.013854 0.00419 0.005238 0.008424 0.009472 0.007034 0.008081 0.011268 0.012315 0.0115 0.006776 0.009963 0.01101 0.0026 0.00962 0.012806 0.013854 0.00419 0.005238 0.008424 0.009472 0.007034 0.008081 0.011268 0.012315 TABLE II A local optimal probability model of musk 0.005729 0.006776 0.009963 0.0003 0.008572 0.00962 0.012806 0.013854 0.00419 0.005238 0.008424 0.009472 0.007034 0.008081 0.011268 0.012315 0.0029 0.006776 0.009963 0.01101 0.008572 0.00962 0.012806 0.013854 0.00419 0.005238 0.008424 0.009472 0.007034 0.008081 0.011268 0.012315 I164T001059 . 164 T001059d.164 I164T001059 . 164 T001059d.164 PROBABILITY MODEL SELECTION 15 and a hard disk of 420 M bytes working space. The laptop was equipped with an Internet Explorer 4.0 web browser with ActiveX enabled. In addition, the laptop also had installed S-PLUS 4.5 and an add-on commercial tool for numerical optimization NUOPT. The commercial optimizer NUOPT was used for comparative evaluation. A total of 17 test cases, indexed as C1 ± C17 listed in Table III shown in the next section are derived from three sources for a comparative evaluation. The ®rst source is the Hock and Schittkowski problem set (Hock, 1980), which is a test set also used by NUOPT for its benchmark testing. The second source is a set of test cases, which originated in real world problems. The third source is a set of randomly generated test cases. All 17 test cases, listed as ``nexp1.dat'', ``nexp2.dat'', . . . , ``nexp17.dat'', are accessible via item 8 of [www http://bonnet3.cs.qc.edu/jscs9902.html]. Seven test cases (C1 ± C7) are derived from the ®rst source ± abbreviated as STC (Ci) (the ith Problem in the set of Standard Test Cases of the ®rst source). Four test cases originated from real world problems in dierent disciplines such as analytical chemistry, medical diagnosis, sociology, and aviation. The remaining six test cases are randomly generated and abbreviated as RTCi (the ith Randomly generated Test Case). The Hock and Schittkowski problem set is comprised of all kinds of optimization test cases classi®ed by means of four attributes. The ®rst attribute is the type of objective function such as linear, quadratic, or general objective functions. The second attribute is the type of constraint such as linear equality constraint, upper and lower bounds constraint etc. The third is the type of the problems whether they are regular or irregular. The fourth is the nature of the solution; i.e., whether the exact solution is known (so-called `theoretical' problems), or the exact solution is not known (so-called `practical' problems). In the Hock and Schittkowski problem set, only those test cases with linear (in)equality constraints are applicable to the comparable evaluation. Unfortunately those test cases need two pre-processing steps; namely, normalization and normality. These two pre-processings are necessary because the variables in the original problems are not necessarily bounded between 0 and 1 ± an implicit assumption for terms in a probability model selection problem. Furthermore, all terms C17 C11 C12a C12b C13 C14a C14b C15a C15b C16 C10 C7a C7b C8 C9 C1 C2 C3a C3b C4 C5 C6 Case STC (P55) STC (P21) STC (P76) STC (P76) STC (P86) STC (P110) STC (P112) chemical equilibrium STC (P119) STC (P119) RTC1 Census Bureau/ sociology study Chemical analysis (Ex. in Section 6) RTC2 RTC3 RTC3 RTC4 RTC5 RTC5 RTC6 RTC6 Medical diagnosis Single-engine pilot training model Source of test case/ application domain 3 24 4 256 10 3 3 4 4 2187 4 3 9 4 20 3 10 4 12 128 9 8 21 4 5 10 10 16 3 3 4 6 6 4 # of terms # of nontrivial constraints 10.13323 2.8658 1.96289 2 1.72355 2.9936 1.328 6.6935 1.9988 2.8658 3.498 ± ± 3.2457 2.5475 0.971 1.9839 NUOPT: entropy of optimal model 10.13357 2.9936 0.85545 1.328 1.889 0.971 1.72355 0.996 1.96289 3.37018 6.6792 2.7889 3.4986 1.991 2.8656 2.55465 0.971 0.9544 1.9855 ± ± 3.2442 Prototype: entropy of optimal model TABLE III Comparative evaluation results 3.3 (2.58) 1.306 (2.58) 7.07 (2) ± ± ± 3.966 (3.322) 11.0406 (11.0947) 4.247 (3.167) 1.9687 (2) 6.242 (2) 3.3589 (2) ÿ (2) 5.742 (2) ÿ (2) 6.09755 (2) 8.726 (8) 23.633 (7) ÿ (4) ÿ (4) 2.9546 (2) ÿ (3.5849) Entropy upper bound estimate No No No Yes No No Yes No Yes No No No Yes No No No No No Yes No No No With initial guess I164T001059 . 164 T001059d.164 I164T001059 . 164 T001059d.164 PROBABILITY MODEL SELECTION 17 must be added to a unity in order to satisfy the normality property, which is an axiom of the probability theory. The second source consists of four test cases. These four cases (C9, C10, C16 and C17) are originated from real world problems. The ®rst case C9 is from census data analysis for studying social patterns. The second case C10 is from analytical chemistry for classifying whether a molecule is a musk-like. The third case C16 is from medical diagnoses. The last one is from aviation, illustrating a simple model of aerodynamics for single-engine pilot training. In addition to the seven ``benchmark'' test cases and the four test cases from real world problems. Six additional test cases (C8, C11 ± C15) are included for the comparative evaluation. These six cases, indexed by RTCi (the ith randomly generated test case), are generated based on a reverse engineering approach that guarantees knowledge of a solution. Note that all seven test cases from the Hock and Schittkowski problem set do not have to have a solution after the introduction of the normality constraint (i.e., all variables add up to one). Regarding the test cases originated from the real world problems, there is again no guarantee of the existence of solution(s). As a consequence, the inclusion of these six cases constitutes yet another test source that is important for the comparative evaluation. 8. PRELIMINARY COMPARATIVE EVALUATION The results of the comparative evaluation are summarized in Table III. The ®rst column in the table is the case index of a test case. The second column indicates the source of the test cases. The third column is the number of joint probability terms in a model selection problem. The fourth column is the number of non-trivial constraints. In general, the degree of diculty in solving a model selection problem is proportional to the number of joint probability terms in a model and the number of constraints. The ®fth and the sixth columns are the expected Shannon entropy of the optimal model identi®ed by the commercial tool NUOPT and the ActiveX application respectively. Recall the objective is to ®nd a model that is least biased, thus of maximal entropy, with respect to unknown information while preserving the known information I164T001059 . 164 T001059d.164 18 B. K. SY stipulated as constraints. Hence, a model with a greater entropy value is a better model in comparison to one with a smaller entropy value. The seventh column reports the upper bound of the entropy of an optimal model. Two estimated maximum entropies are reported. The ®rst estimate is derived based on the method discussed earlier (Steps 6 and 7). The second estimate (in parenthesis) is the theoretical upper bound of the entropy of a model based on Log2 n; where n is the number of probability terms (3rd column) in a model. Further details about the theoretical upper bound are referred to the report elsewhere (Shannon, 1972). The last column indicates whether an initial guess is provided for the prototype software to solve a test case. The prototype implementation allows a user to provide an initial guess before the algorithm is applied to solve a test case (e.g., C3b, C7b, C12b, C14b, and C15b). There could be cases where other tools may reach a local optimal solution that can be further improved. This feature provides ¯exibility to further improve a local optimal solution. 9. DISCUSSION OF COMPARATIVE EVALUATION As shown in Table III, both our prototype implementation and the commercial tool NUOPT solved 15 out of the 17 cases. Further investigation reveals that the remaining two test cases have no solution. For these 15 cases, both systems are capable of reaching optimal solutions similar to each other in most of the cases. In one case (C16) the ActiveX application reached a solution signi®cantly better than NUOPT, while NUOPT reached a signi®cantly better solution in four case (C3, C12, C14, C15). It is interesting to note that the ActiveX application actually improves the optimal solution of NUOPT in one of these four cases (C3) when the ActiveX application uses the optimal solution of NUOPT as an initial guess in an attempt to further improve the solutions of these problems. Referring to the seventh column, the result of estimating the upper bound entropy value of the global optimal model using the proposed dual formulation approach is less than satisfactory. In only three (marked with ) of the 15 solvable test cases the proposed dual formulation approach yields a better upper bound in comparison to I164T001059 . 164 T001059d.164 PROBABILITY MODEL SELECTION 19 the theoretical upper bound that does not consider the constraints of a test case. Furthermore, in only one of the three cases the estimated upper bound derived by the dual formulation approach is signi®cantly better than the theoretical upper bound. This suggests the utility of the dual formulation for estimating an upper bound is limited according to our test cases. It should also be noted that the proposed dual formulation fails to produce an upper bound in three of the 15 solvable cases (C7, C14, and C15). This is due to the fact that the transpose of the original constraint set may turn slack variables in the primal formulation to variables in the dual formulation that have to be non-negative. But the SVD cannot guarantee to ®nd solutions that those variables are nonnegative. When the solution derived using SVD contains negative values assigned to the slack variables, the dual formulation will fail to produce an estimate of the upper bound, which occurred three times in the 15 solvable test cases. In the comparative evaluation we chose not to report the quantitative comparison of the run time performance for two reasons. First, our prototype implementation allows a user to control the number of iterations indirectly through a parameter that de®nes the size of incremental step in the search direction of SVD similar to that of the interior point method. The current setting is 100 steps in the interval of possible bounds in the linear search direction of SVD. When the number of steps is reduced, the speed of reaching a local optimal solution increases. In other words, one can trade the quality of the local optimal solution for the speed in the ActiveX application. Furthermore, if one provides a ``good'' initial guess, one may be able to aord a large incremental step, which improves the speed, without much compromise on the quality of the solution. Therefore, a direct comparative evaluation on the run-time performance will not be appropriate. The second reason not to have a direct comparative evaluation of the run-time is the need of re-formulating a test case using SIMPLE (System for Interactive Modeling and Programming Language Environment) before NUOPT can ``understand'' the problem, and hence solving it. Since NUOPT optimizes its run-time performance by dividing the workload of solving a problem into two steps, and only reporting the elapsed time of the second step, it is not possible to I164T001059 . 164 T001059d.164 20 B. K. SY establish an objective ground for a comparative evaluation on the runtime. Nevertheless, the ActiveX application solves all the test cases quite eciently. As typical to any ActiveX deployment, a one-time download of the ActiveX application from the Internet is required. It takes about ®ve minutes to download using a 33 bps modem via an ActiveX enabled IE4 web browser. Afterwards, almost all the test cases can be solved instantly, except the last case (C17), in our computing environment ± a Pentium 133 MHZ laptop with 32M RAM and 420 M bytes of hard disk. 10. CONCLUSION An algorithm for probability model selection is presented. It is found that probability model selection can be formulated as an optimization problem with linear order constraints and a non-linear objective function. The proposed algorithm adopts an approach similar to the primal ± dual formulation for the interior point method. The theoretical development of the algorithm has led to a property that can be interpreted semantically as the weight of evidence in information theory. Our prototype implementation of the algorithm is web deployable and can be accessed via an ActiveX enabled browser. Preliminary comparative evaluation is made using a beta version of the NUOPT for S-PLUS commercial package. Because of the nature of the problem and the use of browser technology, comparative test cases are conducted on relatively small problems, but with non-trivial complexity due to high interactions (thus dependency) among the model parameters. In the comparative evaluation, it is noted that both the ActiveX implementation and NUOPT can solve most of the model selection problems. An interesting result is that in those problems where both our algorithm and NUOPT can solve, the optimality of the models identi®ed by the ActiveX application and NUOPT are comparable. There are still many interesting issues to explore for the probability model selection problems. For example, any probability model selection problem has an inherent exponential complexity with respect to the number of random variables. One avenue of approach to this issue is to reduce the search space through parameter tuning (e.g., I164T001059 . 164 T001059d.164 PROBABILITY MODEL SELECTION 21 granularization) or transformation (e.g., mapping probability space to log probability space) if probability independence properties exist among the variables. Another interesting issue is the convergence and solvability issue of optimization. There are probability constraint sets that have a degree of freedom which, in theory, corresponds to a permissible search space while the proposed algorithm and existing commercial package may not solve them well. The relationship between the theoretical convergence rate and the solvability of a practical implementation is another interesting issue to explore. Those interesting issues will be the focus of our future study. Acknowledgements This author is grateful to the Associate Editor Dr. Morgan Wang and an anonymous reviewer for their comments that help to improve the manuscript. Professor David Locke of Chemistry Department in Queens College provided technical proofreading and comments on the ``musk'' illustration. Ms. XiuYi Huang, under the partial support of a grant from the PSC-CUNY Research Award, designed and implemented the web page that provides convenient entry points to various resources mentioned in this paper. NUOPT beta version used in this paper is a result of being a beta tester site for Mathsoft Inc. Preparation of the manuscript and web hosting resources are supported in part by a NSF DUE grant #97-51135. References Akaike, H. (1973) ``Information Theory and an Extension of the Maximum Likelihood Principle'', In: Proceedings of the 2nd International Symposium of Information Theory, Eds. Petrov, B. N. and Csaki, E. Budapest: Akademiai Kiado, pp. 267 ± 281. Borgwardt, K. H., The Simplex Method, A Probabilistic Analysis, Springer-Verlag, Berlin, 1987. Chen, J. and Gupta, A. K., ``Testing and Locating Variance Change Points with Application to Stock Prices'', Journal of the American Statistical Association, 92(438), American Statistical Association, June, 1997, pp. 739 ± 747. Chen, J. H. and Gupta, A. K., ``Information Criterion and Change Point Problem for Regular Models'', Technical Report No. 98-05, April, 1998, Department of Mathematics and Statistics, Bowling Green State University. Good, I. J. (1960) ``Weight of Evidence, Correlation, Explanatory Power, Information, and the Utility of Experiments'', Journal of Royal Statistics Society, Series B, 22, 319 ± 331. I164T001059 . 164 T001059d.164 22 B. K. SY Gupta, A. K. and Chen, J. (1996) ``Detecting Changes of Mean in Multidimensional Normal Sequences with Applications to Literature and Geology'', Computational Statistics, 11, 211 ± 221, Physica-Verlag, Heidelberg. Hock, W. and Schittkowski, K. (1980) Lecture Notes in Economics and Mathematical Systems 187: Test Examples for Nonlinear Programming Codes, Beckmann, M. and Kunzi, H. P. Eds., Springer-Verlag, Berlin, Heidelberg, New York. Johnson, G. D., ``Quantitative Characterization of Watershed-delineated Landscape Patterns in Pennsylvania: An Evaluation of Conditional Entropy Pro®les'', (Abstract), Ninth Lukacs Symposium, Frontiers of Environmental and Ecological Statistics for the 21st Century, Bowling Green State University, Bowling Green, Ohio, April, 1999. Karmarkar, N. (1984) ``A New Polynomial-time Algorithm for Linear Programming'', Combinatorica, 4(4), 373 ± 395. Kuenzi, H. P., Tzschach, H. G. and Zehnder, C. A. (1971) Numerical Methods of Mathematical Optimization, New York, Academic Press. Martin, D., Seminar in ``Financial Topics in S-PLUS'', Mathsoft Inc., Washington D.C., Oct., 1998. Murphy, P. M. and Aha, D. W. (1994) UCI repository of machine learning databases, Department of Information and Computer Science, Irvine, University of California, (second musk data set) http://www.ics.uci.edu/ mlearn/MLRepository.html Schwarz, C. (1978) ``Estimating the Dimension of a Model'', The Annals of Statistics, 6, 461 ± 464. Shannon, C. E. and Weaver, W., The Mathematical Theory of Communication, University of Urbana Press, Urbana, 1972. The NUOPT for S-PLUS Manual, Mathematical Systems, Inc., Oct., 1998. Sy, B. K., ``Pattern-based Inference Approach for Data Mining'', Proceeding of the 18th International Conference of North American Fuzzy Information Processing Society NAFIPS, New York, June, 1999. Wright, S., Primal ± Dual Interior Point Methods, SIAM, 1997, ISBN 0-89871-382-X. [www http://bonnet3.cs.qc.edu/jscs9902.html]
© Copyright 2026 Paperzz