Strategy optimization for controlled Markov process with descriptive

www.scichina.com
info.scichina.com
www.springer.com/scp
www.springerlink.com
Strategy optimization for controlled Markov
process with descriptive complexity constraint
JIA QingShan & ZHAO QianChuan†
Center for Intelligent and Networked Systems, Department of Automation, TNLIST, Tsinghua University, Beijing 100084, China
Due to various advantages in storage and implementation, simple strategies are usually preferred than
complex strategies when the performances are close. Strategy optimization for controlled Markov process with descriptive complexity constraint provides a general framework for many such problems. In
this paper, we first show by examples that the descriptive complexity and the performance of a strategy could be independent, and use the F-matrix in the No-Free-Lunch Theorem to show the risk that
approximating complex strategies may lead to simple strategies that are unboundedly worse in cardinal
performance than the original complex strategies. We then develop a method that handles the descriptive complexity constraint directly, which describes simple strategies exactly and only approximates
complex strategies during the optimization. The ordinal performance difference between the resulting
strategies of this selective approximation method and the global optimum is quantified. Numerical examples on an engine maintenance problem show how this method improves the solution quality. We
hope this work sheds some insights to solving general strategy optimization for controlled Markov process with descriptive complexity constraint.
strategy optimization, controlled Markov process, descriptive complexity
1 Introduction
Strategy optimization for controlled Markov
process[1,2] provides a general framework for many
control, decision making, and optimization problems in real life. Besides the well-known difficulty
of large state space and large action space, the pervasive application of digital computers introduces
the constraint of limited memory space when using computer-based optimization or implementing
a strategy in computers in practice. Strategies that
can be stored in the given memory space and run in
reasonable time are called simple strategies. Otherwise are called complex strategies. Simple strategies are easy to store and implement, contain less
number of parameters, thus require less historical
data and shorter training time, and usually generalize better in unknown environment. Due to these
various advantages, simple strategies are usually
Received January 27, 2009; accepted August 7, 2009
doi: 10.1007/s11432-009-0192-8
†
Corresponding author (email: [email protected])
Supported by the National Natural Science Foundation of China (Grant Nos. 60274011, 60574067, 60704008, 60736027, 60721003, 90924001),
the New Century Excellent Talents in University (Grant No. NCET-04-0094), the Specialized Research Fund for the Doctoral Program of Higher
Education (Grant No. 20070003110), and the Programme of Introducing Talents of Discipline to Universities (the National 111 International
Collaboration Projects) (Grant No. B06002)
Citation: Jia Q S, Zhao Q C. Strategy optimization for controlled Markov process with descriptive complexity constraint. Sci China Ser
F-Inf Sci, 2009, 52(11): 1993–2005, doi: 10.1007/s11432-009-0192-8
preferred than complex strategies when the performances are close. These preferences can be modeled as descriptive complexity constraints[3] . So it
is of great practical interest to study strategy optimization for controlled Markov processes with descriptive complexity constraint.
The concept of Kolmogorov complexity (KC)[3]
quantifies the descriptive complexity of a strategy
mathematically. However, KC is in general incomputable (Theorem 2.3.2 in ref. [3]). Also note that
the descriptive complexity of a strategy is related
to the action of all the states. This makes the actions at different states correlated. So the traditional policy iteration and value iteration cannot
be applied directly.
The preference to simple strategies are usually
handled in two ways in practice. First, practitioners usually focus on strategies with specific structures or properties, say threshold type of strategies or parameterized strategies, and then solve a
parameter optimization problem to find a strategy. However, simple strategies do not necessarily
have good performance. Thus it is usually difficult
to quantify the performance difference among the
strategy obtained in the above way, the optimal
simple strategy, and the optimal strategy (not necessarily simple). Also the given memory space may
not be fully utilized.
A second way is to solve strategy optimization for controlled Markov processes without the
descriptive complexity constraint. If the resulting strategy is complex, we then approximate the
strategy by simple ones. However, as we will show
in section 3.1, there is no insurance that the simple
strategy obtained in this way still has good performance. In fact, this simple strategy can be arbitrarily worse than the complex strategy in cardinal performance, which will be shown by examples
later.
In this paper, we first present the formulation of
the strategy optimization problem for controlled
Markov process with descriptive complexity constraint. Please note that the general dynamic programming problem can be handled in the same
way. Then we show by examples that the descriptive complexity constraint and the performance of
a strategy could be independent, and use the F1994
matrix in the No-Free-Lunch Theorem to show the
risk that approximating complex strategies may
lead to simple strategies that are unboundedly
worse in cardinal performance than the original optimal (complex) strategies. We then provide an
upper bound of the ordinal performance difference
between simple strategies and optimal strategies.
After that, we develop a method that handles the
descriptive complexity constraint directly. This
method uses support vector machine (SVM) to describe strategies in a succinct way, describes simple strategies exactly, and only approximates complex strategies during the optimization. The ordinal performance difference between the solution of
this selective approximation method and the global
optimum is quantified. We then use numerical examples on an engine maintenance problem to show
how this selective approximation method improves
the solution quality.
The rest of this paper is organized as follows.
The problem formulation is provided in section 2.
The bounds of performance difference between simple and complex strategies are provided in section
3. The SVM-based selective approximation is introduced in section 4. The numerical results on a
generic strategy optimization problem and an engine maintenance problem are shown in section 5.
We briefly conclude in section 6.
2 Problem formulation
We present the mathematical problem formulation
of strategy optimization for controlled Markov processes with descriptive complexity constraint in
this section. Let S be the state space and A(s)
be the action space at state s ∈ S. To simplify discussion, we consider discrete and finite S and A. A
strategy γ is a mapping from the state space to the
action space. Let Γ be the set of strategies. When
the system state is at s ∈ S at time t, and an action
a ∈ A(s) is taken, a cost c(s, a) is caused, which
is assumed to take only finite values, i.e., there exists M > 0 such that |c(s, a)| < M for all s ∈ S
and a ∈ A. After the action is taken, the system
state transits to another state s with probability
Pr{st+1 = s |st = s, at = a}. The following criteria are usually considered in a controlled Markov
JIA Q S et al. Sci China Ser F-Inf Sci | Nov. 2009 | vol. 52 | no. 11 | 1993-2005
process, i.e., the target cost is
⎤
⎡
c(st , γ(st ))⎦ ,
J(γ) = E ⎣
(1)
t=0,...,min{T |sT ∈ST }
which is the expected cost of the Markov process
until the state reaches some target set ST ; the total
cost is
T −1
c(st , γ(st )) ;
(2)
J(γ) = E
t=0
the discounted cost of infinite horizon is
T −1
t
α c(st , γ(st )) ,
J(γ) = lim E
T →∞
(3)
t=0
where α is the discounted factor, 0 < α < 1; and
the average cost is
T −1
1
c(st , γ(st )) .
(4)
J(γ) = lim E
T →∞ T
t=0
Strategy optimization for controlled Markov process is to find a strategy γ ∈ Γ that minimizes one
of the above J(γ)’s.
The descriptive complexity constraint is modeled
as
(5)
C(γ|U ) C0 ,
where U is a given description mechanism (e.g.,
threshold type of strategies, neural networks, or
SVM as in section 4), which will be denoted by
USVM , C(γ|U ) is the shortest length of the program that implements strategy γ using description
mechanism U , and C0 represents the up limit of
descriptive complexity of the strategy that one can
tolerate (e.g., the size of the given memory space)
and is called the descriptive capacity. Note that
C(γ|U ) is also called the conditional KC. Thus,
the problem we want to solve is
min J(γ) s.t. C(γ|U ) C0 .
γ∈Γ
(6)
The meaning is to find among all the strategies
that can be stored in the given memory space a
strategy that minimizes the cost of the controlled
Markov process.
3 Bound of performance difference between simple and complex strategies
To solve the problem in (6), one may want to first
solve the problem without the constraint. If the
solution strategy is complex, one may approximate
this strategy by a simple one that satisfies the descriptive complexity constraint (5); and hope this
simple strategy also has good performance. However, as we will see in section 3.1, the simple strategy obtained in the above way may have poor performance. In fact, the performance and descriptive
complexity of a strategy are independent. In section 3.2, we provide an upper bound of the ordinal performance difference between the best simple
strategy and the global optimum.
3.1 Independence between performance
and descriptive complexity
We first use an example to show that the performance and descriptive complexity of a strategy are
independent. Consider a Markov chain with |S|
states. At each state there are |A| actions. The
choice of the action only affects the one-step cost
c(s, a), and does not affect the transition probability Pr{st+1 = s |st = s, at = a}. We have Pr{st+1 = s
1
|st = s, at = a} = |S|
for all s, a, and s . We
consider the average cost criterion. The average
cost of a strategy γ then is J(γ) = limT →∞ T1
T −1
1
E[ t=0 c(st , γ(st ))] = |S|
s∈S c(s, γ(s)). To describe a strategy in a digital computer, we assign
an index to each action. Then a strategy γ can
be regarded as a program, which outputs the index of γ(st ) when state st is inputted. As aforementioned the descriptive complexity of γ is defined as the length of the shortest program that
implements γ. So C(γ) is independent of c(s, γ(s)).
Thus the performance and descriptive complexity
of the strategy are independent. We use P =
{c(s, a), s ∈ S, a ∈ A} to denote a problem setting.
When we change c(s, a) in the problem setting, the
average cost of a strategy γ is changed but without
affecting the descriptive complexity of γ. This implies that the performance difference between the
simple and complex strategies could be arbitrarily
large. Mathematically speaking, we have the following.
Proposition 1. ∀γ1 and γ2 ∈ Γ , γ1 = γ2 ,
and ∀M > 0, there exists a problem setting P s.t.
J P (γ1 ) − J P (γ2 ) M , where J P (γ) is the average
cost of strategy γ under problem setting P.
Proof. We prove this proposition by con-
JIA Q S et al. Sci China Ser F-Inf Sci | Nov. 2009 | vol. 52 | no. 11 | 1993-2005
1995
structing a problem setting P s.t. J P (γ1 ) −
J P (γ2 ) = M . First, we specify c(s, γ2 (s)) = 0
for all s ∈ S. Since γ1 = γ2 , there must exist
at least one state s0 s.t. γ1 (s0 ) = γ2 (s0 ). Let
c(s0 , γ1 (s0 )) = M |S|, and c(s, γ1 (s)) = 0 for all
s ∈ {s|γ1 (s) = γ2 (s), s = s0 }. Then we have
1 c(s, γ1 (s))
J P (γ1 ) − J P (γ2 ) = J P (γ1 ) =
|S| s∈S
M |S|
= M.
(7)
|S|
So P satisfies the requirement.
An implication of Proposition 1 is that when we
approximate a complex strategy by a simple strategy, the performance of the simple strategy could
be much worse than that of the complex strategy. In the following, we use the Fundamental
matrix (F-matrix for short) in the No-Free-Lunch
Theorem[4] to further explain this.
An n-row F-matrix is defined as a matrix that
1) contains n-rows and 2n columns; 2) each entry
is either 0 or 1; 3) no two columns are the same.
Each column of the n-row F-matrix represents a
Boolean function f (x) with x taking n values. An
example of a 3-row F-matrix is shown in Figure 1.
=
f1 (x)
f2 (x)
f3 (x)
f4 (x)
f5 (x)
f6 (x)
f7 (x)
f8 (x)
x1
0
0
0
0
1
1
1
1
x2
0
0
1
1
0
0
1
1
x3
0
1
0
1
0
1
0
1
Figure 1
A 3-row Fundamental matrix.
One important property of the F-matrix is that
each row contains equal number of 0’s and 1’s. The
F-matrix is used to prove the No-Free-Lunch Theorem in ref. [4]. We use this concept to show why
approximating complex strategies may lead to simple strategies with bad performances.
Let each row of the F-matrix represent a strategy γ, each column represents a problem setting
P. We say a strategy has satisfying performance
if J(γ) J0 , where J0 is a constant defined by
the user. If strategy γi has satisfying performance
in problem setting Pj , we let the (i, j)th entry of
the F-matrix be 1; otherwise, 0. Following this
formulation, all the problem settings can be classified into 2n types (where n = |Γ | is the number of strategies of interests). Proposition 1 shows
1996
that each type contains at least one problem setting. When the descriptive complexity constraint
is given, some rows represent simple strategies and
the rest rows represent complex strategies. When
the value of C0 changes, the number of simple
strategies and complex strategies may change, respectively. However, we can always find a problem
setting in which all complex strategies have satisfying performance and all simple strategies have
nonsatisfying performance. Vice versa, it is then
clear that if we do not know which type of problem
we are dealing with, the performance difference between the simple and complex strategies could be
arbitrarily large in cardinal values.
In this subsection we have shown that if the performance of a simple strategy is not considered
when approximating a complex strategy, this could
lead to a performance degradation that is arbitrarily large. This shows the importance of considering
the performance of simple strategies during the approximation.
3.2 Ordinal performance bounds for simple strategies
In section 3.1, we see the difficulty to quantify the
performance difference between simple and complex strategies in cardinal values. In this subsection, we take another viewpoint and focus on the
ordinal performance of simple strategies, i.e., if we
rank all the simple and complex strategies according to a performance criterion J from small to
large, what is the rank of the best simple strategy? Let Nsimple be the number of simple strategies. Then the best simple strategy is at least top(|Γ | − Nsimple + 1), where |Γ | is the total number of
strategies including both simple and complex ones.
This upper bound is conservative since it assumes
all the simple strategies have worse performance
than the complex strategies. Although this rank
does not tell us the performance difference between
the best simple strategy and the global optimum
in cardinal values, it provides us some information
on the global goodness of such strategies.
In this section, we have seen that approximating
complex strategies may not produce simple strategies with satisfying performance. Searching within
strategies with specific structures means to search
JIA Q S et al. Sci China Ser F-Inf Sci | Nov. 2009 | vol. 52 | no. 11 | 1993-2005
within a subset of all the simple strategies, and
thus may not obtain a good simple strategy. These
two existing methods fail to solve the strategy optimization for controlled Markov process with descriptive complexity constraint. We thus need to
handle the descriptive complexity constraint directly. This means to identify which strategies are
simple and which are complex.
4 SVM-based selective approximation for
strategies
Given a description mechanism, only some strategies can be described exactly in the limited memory
space. To fully utilize the given memory space, we
should calculate the minimal memory space to describe a strategy using the given description mechanism (i.e., the conditional KC). For those simple strategies (i.e., strategies that can be described
within the given memory space), we store them exactly. For other strategies (called complex strategies) we can only store them approximately. This
will be called selective approximation, and is discussed in details in section 4.2. In this way, we
can explore more strategies than other methods,
and thus improve the solution quality. As a first
attempt in this research direction, we show how to
implement this selective approximation method for
a specific description mechanism, which is based on
the support vector machine (SVM)[5] . For other
description mechanisms, for example, neural networks used in neuro-dynamic programming[6] , the
implementation procedure provided in this paper
supplies a guide.
We note that the task of describing a strategy
is essentially a classification problem by choosing
suitable feature. There are many tools for classification. We adopt SVM, which has specific advantages as follows.
1. There is a one-to-one correspondence between
the number of support vectors and the conditional
KC of a strategy. So we can distinguish simple and
complex strategies.
2. It is easy to determine how many bits are
required to store one support vector. So giving
memory space in unit of bits, we can directly tell
how many support vectors can be stored. Follow-
ing this way, we can answer how many strategies
can be described exactly and what these strategies
are.
3. The calculation of support vectors and parameter values is a quadratic optimization problem
(details follow). There are standard algorithms to
solve this problem. There are not many parameters
to tune by heuristics.
4. SVM can be regarded as a generalized threshold type description technique in the following
sense. Threshold type of strategy approximates
the original strategy well only when that strategy is
“piecewise constant” (i.e., the actions corresponding to several neighboring states are the same) and
the number of break points is no greater than the
number of thresholds. SVM uses similar technique
(details follow) but permit to do nonlinear transformation of (state, action) pairs. Furthermore, this
nonlinear transformation is done implicitly. This
enables us to describe strategies more than threshold type.
5. For practical application, the state space may
be extremely large so that it is computationally infeasible to explicitly list the entire (state, action)
pairs of a strategy. The good generalization ability
of SVM is well known. This supplies the possibility to encode (approximately) the strategy based
on some instead of all the (state, action) pairs.
4.1
SVM-based description mechanism
Consider a two-class classification problem. Let
{(xi , yi )}, i = 1, 2, . . . , N be the set of training samples, where xi is a row vector indicating the input
pattern of the sample, and yi is a scalar taking values from {−1, +1}, −1 for the first class and +1
for the second class. We want to find a hyperplane that can classify the training samples correctly and has good generalization. According to
the structural risk minimization principle, this optimal hyper-plane should maximize the margin to
both classes of training samples. We use Figure 2
to illustrate this.
In Figure 2, black points represent training samples from class 1 and crosses represent samples
from class 2. The solid line represents the optimal
hyper-plane. The dashed lines are parallel to the
JIA Q S et al. Sci China Ser F-Inf Sci | Nov. 2009 | vol. 52 | no. 11 | 1993-2005
1997
Figure 2
The optimal hyper-plane should separate the training
samples with the maximal margin.
solid line. And there are no training samples between the dashed and solid lines. The margin is
the distance between the dashed and solid lines.
The two margins are equal. To maximize the two
margins, each dashed line should cross at least one
training sample in one class, which is marked by
circle. These training samples are called support
vectors. A support vector machine is a decision
function based on the sign of
m
yi αi K(xi , x) + b,
(8)
f (x) =
is: When the training samples are not linearly
separable, SVM uses a kernel function to convert
training samples to a high-dimensional space, in
which training samples are linearly separable. One
advantage of SVM is that such convert is done implicitly by kernel functions. We use a simple example to show how SVM can describe a strategy
succinctly.
Suppose the state space is directly observable
and there are 4 states in total, which are denoted
by 1 through 4 respectively. There are two actions
to choose for each state, represented by 0 and 1.
Then using the lookup table, a strategy can be represented by a 4-bit sequence. For example “0011”
represents one strategy, which takes action 0 for
states 1 and 2, and takes action 1 for states 3 and
4. Using SVM, we can find two support vectors:
action 0 at state 2 and action 1 at state 3, and an
optimal hyper-plane, also shown in Figure 3.
i=1
where m(m N ) is the number of support vectors (the corresponding multiplier αi is non-zero),
K(xi , x) is the kernel function mapping the input
vectors into a feature space and b is the bias term[7] .
By letting the hyper-plane pass through the origin,
we can let b = 0. The coefficients αi are obtained
by solving the following dual form of quadratic optimization problem[7,8] :
w(α) =
N
i=1
αi −
N
1 αi αj yi yj K(xi , xj )
2 i,j=1
(9)
subject to the constraints
αi 0, i = 1, 2, . . . , N.
(10)
This particular SVM optimization problem formulation is called the hard margin formulation without threshold (b = 0), where no training errors are
allowed. So basically SVM is a kernel-functionbased function approximation technique. The idea
1998
Figure 3
Use SVM to describe strategy “0011”.
In Figure 3, the horizontal axis represents the
input pattern (state), and the vertical axis represents the label (action). We use circles to mark
the support vectors. The solid line represents the
optimal hyper-plane found by SVM. Using the two
support vectors, we can recover strategy “0011”.
Note that if we use a lookup table to represent this
strategy, we need to store four (state, action) pairs
(e.g., four support vectors). Using SVM, we need
to store only two support vectors. We find a more
succinct way to represent strategy “0011”.
When the kernel function is fixed and all the
strategies can be recovered using SVM (i.e., no
training error on all strategies), we use the minimal
number of support vectors to evaluate C(γ | USVM )
for strategy γ.
JIA Q S et al. Sci China Ser F-Inf Sci | Nov. 2009 | vol. 52 | no. 11 | 1993-2005
Note that although the optimal hyper-plane is
unique1) (see p. 402 in ref. [5]), the related optimal number of support vectors may not be unique
(see p. 406 in ref. [5]). So when we find several support vectors that can be used to recover a
strategy, this number is an upper bound estimate
of C(γ | USVM ) of that strategy. As we will see in
section 5, using this upper bound estimate we can
already discover interesting effect of descriptive capacity (i.e., the given memory space) and improve
solution quality. To simplify the discussion, we do
not distinguish the upper bound estimate and the
true C(γ | USVM ) in the rest of this paper.
4.2
Selective approximation
To be specific, we propose the following two-step
technique (one example of the selective approximation) to deal with the constraint of limited memory
space.
The two-step technique
Step 1. We use SVM to describe strategies succinctly, and use the number of support vectors to
measure the conditional KC of each strategy.
Step 2. By comparing the conditional KC of
each strategy and the given descriptive capacity,
we know what strategies can be described exactly (simple strategies) and what cannot (complex
strategies). Then for simple strategies, we describe
exactly; for complex strategies, we use approximation techniques. (An approximation technique
represents a group of strategies (instead of one
strategy) together. The strategies within the same
group cannot be distinguished from each other.
One example of the approximation techniques is
discussed in section 5.) In this way, we can utilize
the descriptive capacity more sufficiently, and find
better solution.
5 Numerical results
In this section, we first consider a generic strategy
optimization problem, in which all strategies represented by binary sequences are feasible. Then we
consider an engine maintenance problem, in which
only the strategies represented by some Boolean
sequences are feasible.
5.1 Example 1: a generic strategy
optimization problem
In this subsection, we first show how descriptive
capacity (i.e., C0 ) affects the number of strategies
that can be explored. Then we consider the case
with insufficient descriptive capacity and show that
selective approximation performs better than total approximation and no approximation (defined
later). After that we show how to give an upper
bound estimate for the ordinal difference between
the strategy that is found by the proposed method
and the global optimum (not necessarily describable in the given memory space).
Encoding is an important factor that affects the
performance of a description mechanism. For example, assume there are 4 states in total, and we
consider only binary actions at each state. If the
optimal strategy is “0101” in one encoding, then
we can exchange the index of action “0” and “1”
in states 2 and 4. This will give us a new encoding, in which the optimal strategy is “0000”. This
means if we exchange the indexes of some states
and the indexes of some actions, then we can find
one encoding method, in which the optimal strategy can be easily described. So before we study the
performance of a description mechanism, we must
first fix the encoding. In all the following discussion, we assume the encoding is fixed.
Consider the following strategy optimization
problem. There are 9 states and 4 actions for each
state (represented by “00”, “01”, “10”, and “11”).
In a lookup table, we can use an 18-bit binary sequence (i.e., 9 support vectors) to represent a strategy.
Within this section, we assume that
Assumption 1. We can exactly evaluate a
strategy, i.e., when strategy γ is specified, we know
J(γ) exactly.
If only the noisy performance evaluation is available, the truly optimal strategy that has been explored may not be selected as the result. Then the
following analysis shows the best that each method
1) We assume there is an optimal hyper-plane that classifies all training samples correctly (this is called the separable case). If there
is no such optimal hyper-plane (this is called the non-separable case), we can introduce punishment for each training error, and solve a
similar quadratic optimization problem to find the support vectors and corresponding coefficients (see pp. 408-412 in ref. [5]).
JIA Q S et al. Sci China Ser F-Inf Sci | Nov. 2009 | vol. 52 | no. 11 | 1993-2005
1999
can do.
Effect of descriptive capacity on exploration. There are 218 = 262144 strategies in total. We use SVM to calculate the conditional KC
of each strategy. There are many toolboxes to find
the SVM, which are on-line and free. We use the
OSU SVM toolbox in Matlab[9] . We use the
polynomial kernel function and the default parameter settings of that software. To describe a strategy
exactly, we need at most 9 support vectors and at
least 0 support vector2) . We increase the descriptive capacity (C0 ) from 1 to 9, where C0 is in the
unit of the number of support vectors that can be
stored. For each C0 , we count how many strategies
can be described exactly (denoted as |ΓC0 |) and
show in Table 1.
Table 1
Exploration under different descriptive capacities in
Example 1
C0
C0
|ΓC0 |
10
6
48130
2
230
7
115140
3
1290
8
210382
4
4410
9
262144
5
16290
1
|ΓC0 |
From Table 1 we see that most strategies are
with high conditional KC (higher than 5). This
means if the descriptive capacity is small (e.g., 3),
we can explore only a small portion of the strategies. When the descriptive capacity is given, we
can immediately tell how many strategies can be
described exactly and what these strategies are.
For example, when the descriptive capacity is 7,
from Table 1, we can explore only 115140 strategies. Then by calculating the conditional KC of
each strategy and comparing with this descriptive
capacity, we know what these 115140 strategies are.
We show how this affects the solution quality in the
next subsection.
Compare selective approximation, total
approximation, and no approximation. Selective approximation, total approximation, and no
approximation are three different types of description techniques. We define the following techniques
as representations of each type.
1. SVM description (no approximation). We
use a succinct description of each strategy. When
the descriptive capacity is insufficient, we only explore those strategies that can be described exactly.
We use the ordinal index (1 for the best and 218 for
the worst) of the strategy to evaluate the solution
quality.
2. Lookup table description (total approximation). In a lookup table, we need 18 bits (i.e., 9
support vectors) to describe one strategy exactly.
Each 2 bits represent one action for a state. In this
sense a lookup table uses 9 support vectors to recover each strategy. When the descriptive capacity
is insufficient, we cannot specify one strategy, but
a group of strategies. For example, if the descriptive capacity allows to store only 7 support vectors, then we cannot describe strategy “000000000”
but “0000000XX”, where X can be any of the 4
actions randomly and with equal probability. In
other words, we can distinguish only 47 groups of
strategies. We use the average ordinal index of
strategies within the same group to represent the
performance of that group. In this case to solve the
strategy optimization problem, we select the best
group as the solution, and regard the corresponding average ordinal index as the solution quality.
A lookup table is an approximation technique in
the following sense. When we cannot use a lookup
table to describe one strategy exactly, we only distinguish among different groups of strategies. This
means we use a simpler representation (e.g., 7 support vectors) to approximate a group of strategies
(e.g., each 7 support vectors approximate 16 different strategies).
3. Two-step technique (selective approximation). The two-step technique is a combination
of the SVM description and the lookup table description. When the descriptive capacity is insufficient (e.g., allow to store at most 7 support vectors), we separate all strategies into two classes:
simple and complex. For simple strategies, we use
SVM to describe exactly. For complex strategies,
we use the above lookup-table-based technique to
describe approximately, i.e., within each of the 47
2) When the actions at all states are identical, we only need to record one action and need no support vector. In all other cases, we
need at least two support vectors to decide a hyper-plane.
2000
JIA Q S et al. Sci China Ser F-Inf Sci | Nov. 2009 | vol. 52 | no. 11 | 1993-2005
groups of strategies, some may be simple and can
be described exactly using the SVM description
method. After excluding these simple strategies
from a group, we cannot distinguish the rest strategies within the same group. In this way, we can distinguish each simple strategy and different groups
of complex strategies. (Note that we still cannot distinguish the complex strategies within the
same group.) Then we compare and select the best
(group of) strategy(-ies) as the solution. Similarly,
the (average) ordinal index is used to evaluate the
solution quality of a group of complex strategies.
The cost function could be any criterion mentioned in section 2. When comparing different description methods, we consider the following two
extreme cases:
1) A strategy with higher conditional KC also
has better performance (i.e., smaller value of performance function).
2) A strategy with smaller conditional KC has
better performance.
We compare the three description methods in
both cases and show the numerical results in Tables 2 and 3. In Table 2, each number in the second, third, and fourth rows is the (average) ordinal
performance of the solution under that descriptive
capacity. For example, when the descriptive capacity is 7, if we use the SVM description method, the
best strategy we find is top-147005 among all the
262144 strategies. If we use the lookup table description method, the best group of strategies we
find is top-9884.6. If we use the two-step technique,
we can find the top-4703 strategy on the average.
Table 2
Comparison of SVM, lookup table, and two-step tech-
nique in case 1
C0
SVM
Lookup table
Two-step
1
262135
99794.2
99789.3
2
261915
76159.8
76125.7
3
260855
48561.2
48561.2
4
257735
37346.2
37346.2
5
245855
28344.2
28344.2
6
214015
22523.1
10354.0
7
147005
9884.6
4703.0
8
51763
18.5
1.5
9
1
1.0
1.0
In case 2, the best strategy is the simplest to be
described. The SVM and the two-step technique
can explore this best strategy under all descriptive
capacities. Under Assumption 1, the SVM and the
two-step technique always select this best strategy
as final solution.
Table 3
C0
The solution quality of lookup table in case 2
Lookup table
C0
Lookup table
1
99965.7
6
1456.8
2
72185.6
7
137.3
3
31658.3
8
9.3
4
13143.1
9
1.0
5
3735.4
In Table 2, the lookup table method outperforms the SVM method. On the contrary the SVM
method outperforms the lookup table method in
Table 3. In both tables, the two-step technique
is always the best under all descriptive capacities.
This is reasonable, because the SVM description
and the lookup table description are special cases of
the two-step technique. When enlarging the search
region, this does not hurt the solution of optimization problems.
We have used numerical examples to show how
measuring conditional KC helps to sufficiently utilize the descriptive capacity, and how this can improve the solution quality. In the following, we
show how to give a reasonable estimate of the ordinal difference between the strategy that the twostep technique finds and the global optimal strategy (not necessarily describable in the given memory space). Since we have no a priori knowledge
about the performance of a strategy before conducting the evaluation, we can only give a conservative estimate, i.e., the strategy we found is
the best among all the strategies that we evaluated, and we can say nothing more than this.
Following this idea, let Nexact denote the number
of strategies that can be described exactly, and
Napproximate denote the number of groups of strategies that can only be described approximately (under that descriptive capacity, we can only distinguish among different groups but not among strategies within a group). When we use the two-step
technique, the solution we find is at least top(|Γ | − Nexact − Napproximate + 1). For Example 1,
we show the corresponding values in Table 4.
In Table 4, the second row shows the upper
JIA Q S et al. Sci China Ser F-Inf Sci | Nov. 2009 | vol. 52 | no. 11 | 1993-2005
2001
bound estimate of the ordinal performance (P ) of
the strategy found by the two-step technique under
different descriptive capacities. For example, when
the descriptive capacity is 8, the strategy found by
the two-step technique is at least top-28992. The
true ordinal performance is top-1.5 in case 1 and
top-1 in case 2 (please refer to Table 2).
Table 4
The upper bound estimate of ordinal performance of
the solution found by the two-step technique in Example 1
C0
P
C0
P
1
262131
6
209967
2
261899
7
133093
3
260791
8
28992
4
257479
9
1
5
244831
By comparing the last row in Table 2 and the
second row in Table 4, we find that the estimate
in Table 4 is conservative. If we can incorporate
problem information, we can possibly give a tighter
upper bound estimate. However, this is beyond the
scope of this paper.
5.2 Example 2: an engine maintenance
problem
Due to the significance in many industry and military areas, the optimization of maintenance problems has been extensively studied in the past several decades[10−13] , and more recently in refs. [14,
15]. The engine maintenance strategy optimization is one kind of maintenance problem, which
finds the maintenance strategy with the minimal
cost during a contract duration or on the average.
There are many components in an engine with
different new lifetimes and prices. When the engine is working, the remaining lifetimes of the components decrease. When a component is expired
(i.e., the remaining lifetime is zero) or an emergent failure happens (e.g., the engine gets stuck),
the engine stops working and is then sent to the
workshop, which causes a shop visit. During the
shop visit, the engine is disassembled into components. A maintenance strategy determines which
components to be replaced by new ones. After the
replacement, the engine is assembled and shipped
back to work. The next expiration of a component or the next emergent failure causes the next
2002
shop visit. The cost of a shop visit consists of
two parts: the shipment cost of the engine and
the prices of the replaced new components. Many
engines are expensive. So the manufacturer usually signs a contract with the customer to cover
the maintenance cost of the engine for a couple of
years. Then the manufacturer wants to find the
best maintenance strategy that has the minimal
cost during the contract duration or on the average.
The difficulty of this engine maintenance strategy
optimization problem is well known[16,17] . Besides
the difficulty of large state space and large action
space, the constraint of limited memory space is
also usually met in practice, and is the focus of
this paper.
Suppose there are n components; the new lifetime of each component is d days; and we consider
only stationary strategies. Then the state is the remaining lifetime of each component during a shop
visit, which is a 1-by-n row vector. The size of
the state space is |S| = dn . The action is a 1-byn Boolean vector, where “1” represents to replace
the component and “0” represents not. The size
of the action space is |A| = 2n . To store such a
strategy by a lookup table, we need to store dn
(state, action) pairs. It is trivial to show that for
each such pair we need n log2 d + n bits, where
· is the ceil function. Thus to store all the pairs,
we need (n log2 d + n)dn bits. When d = 100,
n = 10, this value is 9.625 × 1011 Gega-Bytes(GBs),
which exceeds the memory space of any nowaday
computers so far as we know. So it is important to
answer the following question: How to fully utilize
the given memory space?
Since in an engine maintenance problem the size
n
of the state space (2nd ) increases faster than an
exponent of n, for proof of concept, we consider
the following simple case. There are 2 components
and the new lifetime of each component is 3 days,
i.e., n = 2 and d = 3. There are 9 states and
4 actions (represented by “00”, “01”, “10”, and
“11”) in total. In the lookup table, we can use an
18-bit binary sequence (i.e., 9 support vectors) to
represent a strategy. Suppose at each time unit the
engine has probability 0.1 to fail. The failure of the
engine does not affect the remaining lifetime of the
components. For example, we consider an aircraft
JIA Q S et al. Sci China Ser F-Inf Sci | Nov. 2009 | vol. 52 | no. 11 | 1993-2005
engine. A bird may hit the engine and we need to
send the engine to the workshop to clean the mess.
When some component expires or the engine fails,
a shop visit happens and a maintenance strategy
determines which component to replace. Each shop
visit causes a shipping cost of 1. Components 1 and
2 have a price of 2 and 3, respectively. We want
to minimize the daily average maintenance cost,
and use a simulation of 1000 days to estimate this
average cost of a strategy.
Different from the generic strategy optimization
problem considered in section 5.1, in engine maintenance problem not all the 18-bit binary sequences
represent feasible strategies. For example, when a
component expires we have to replace this component by a new one. This means, if only component
1 expires, the feasible actions are “10” and “11”;
if only component 2 expires, the feasible actions
are “01” and “11”; if both components expire, the
feasible action is “11”; and only when no component expires, all four actions are feasible. Following this analysis, we can see that among all the
218 = 262144 18-bit binary sequences, only 4096
such sequences represent feasible strategies. We
only consider these feasible strategies in this subsection. We now see the effect of descriptive capacity on exploration.
Effect of descriptive capacity on exploration. We use SVM to calculate the conditional
KC of each feasible strategy. To describe a strategy
exactly, we need at most 9 support vectors and at
least 0 support vector. We increase the descriptive
capacity (C0 ) from 1 to 9, where C0 is in the unit of
the number of support vectors that can be stored.
For each C0 , we count how many strategies can be
described exactly within C0 (denoted as |ΓC0 |) and
show in Table 5.
Table 5
Exploration under different descriptive capacities in
Example 2
C0
|ΓC0 |
C0
|ΓC0 |
1
1
6
637
2
11
7
1149
3
50
8
1847
4
117
9
4096
5
312
Similar to Table 1, we see that in Table 5 most
strategies are with high conditional KC (higher
than 3). This means that if we focus on simple
strategies, we will only explore a small portion of
all the strategies. We show how this affects the solution quality in the next subsection.
Compare selective approximation, total
approximation, and no approximation. Since
only strategies represented by some 18-bit binary
sequences are feasible, we clarify the three techniques as follows.
1. SVM description (no approximation). When
the descriptive capacity is insufficient, we only explore the feasible strategies that can be described
exactly.
2. Lookup table description (total approximation). When the descriptive capacity is insufficient,
we cannot specify one deterministic strategy, but a
randomized strategy that picks from a group of deterministic strategies with equal probability. Note
that each such deterministic strategy is feasible.
Also note that the maintenance cost of this randomized strategy may not be the average of the
costs of the group of deterministic strategies. We
use simulation to estimate the daily average maintenance cost of such randomized strategies.
3. Two-step technique (selective approximation). When the descriptive capacity is insufficient
(e.g., allow to store at most 7 support vectors), we
separate all strategies into two classes: simple and
complex. For simple strategies, we use SVM to describe exactly. For complex strategies, we use the
above lookup-table-based technique to describe approximately (i.e., use a randomized strategy to approximate a group of complex deterministic strategies). In this way, we have simple deterministic
strategies and randomized strategies. We use simulation to estimate the performance of each such
strategy. Then we compare and select the best (deterministic or randomized) strategy as the solution.
For different values of C0 , we show the best strategy found by each technique in Table 6. We can
see the following interesting facts. First, selective approximation is the best. For 1 C0 8,
both SVM and the two-step technique (selective
approximation) find strategies with maintenance
cost strictly smaller than the look-up-table technique. For C0 = 9, all three techniques find the
JIA Q S et al. Sci China Ser F-Inf Sci | Nov. 2009 | vol. 52 | no. 11 | 1993-2005
2003
best strategy. The two-step technique beats SVM
when C0 = 1 and ties SVM for 2 C0 9. Second,
focusing on simply strategies reduces the search
space effectively. The best strategy is with performance 2.0620. Both SVM and the two-step technique find the best strategy for all C0 4. The
look-up-table technique, however, finds this best
strategy only when C0 = 9. If we consider only
strategies with complexity no greater than 4, there
are only 117 such strategies (refer to Table 5). By
focusing on these simple strategies, we reduce the
search space from 4096 to 117, which is only 2.86%
of the original search space, and still find the best
strategy.
Table 6
Comparison of SVM, lookup table, and two-step tech-
nique in Example 2
C0
SVM
Lookup table
Two-step
1
2.2200
2.2470
2.1900
2
2.1170
2.1890
2.1170
3
2.1170
2.1620
2.0620
4
2.0620
2.1600
2.0620
5
2.0620
2.1500
2.0620
6
2.0620
2.1410
2.0620
7
2.0620
2.1440
2.0620
8
2.0620
2.1250
2.0620
9
2.0620
2.0620
2.0620
6 Conclusions
In this paper, we focus on strategy optimization for
controlled Markov process with descriptive complexity constraint. We first show that it is difficult to quantify the cardinal performance difference between simple and complex strategies. We
show that two existing methods in practice cannot handle the descriptive complexity constraint
well. One method solves the problem without the
descriptive complexity constraint first. And if the
solution strategy is complex, the method approximates the complex strategy by a simple strategy.
We show that this method may not produce simple strategies with good performance. Another
method searches among strategies with predetermined structures and thus may not find the best
2004
among the simple strategies. Then, we provide an
upper bound for the ordinal performance of the
best simple strategy. After that, we propose to
consider the descriptive complexity constraint directly, and develop the selective approximation to
best utilize the given memory space. By regarding the description of a strategy as a classification problem, we propose an SVM-based description mechanism, and quantify the according conditional KC. The numerical results justify the effect
of descriptive capacity on the strategies that can
be explored, and that the proposed SVM-based selective approximation can further improve the solution quality.
It should be pointed out that we assume the conditional KC of a strategy (conditioned on the SVM
description mechanism) can be accurately calculated. When the state space is extremely large,
we need to estimate this conditional KC. Then the
performance of our method depends on the accuracy of this estimate. In order to obtain an accurate estimate of the conditional KC, the problem
information is needed.
We also point out some possible future research
directions as follows. Note that when the descriptive capacity is large, it could be infeasible to enumerate all the simple strategies. Then we need
to combine the selective approximation with some
stochastic optimization algorithms to find the best
(or a good) simple strategy. Also note that we have
shown that when the descriptive capacity increases
the performance of the solution strategy can be improved. An interesting research problem of practical interest is to conduct a sensitivity analysis, i.e.,
to quantify how much performance improvement
can be achieved when the descriptive capacity increases.
We hope this work sheds some insight to strategy
optimization for controlled Markov process with
descriptive complexity constraint in general.
The authors would like to thank Prof. Y. C. Ho, Dr. L. Xia,
and three anonymous reviewers for the helpful comments on a
previous version of this manuscript.
JIA Q S et al. Sci China Ser F-Inf Sci | Nov. 2009 | vol. 52 | no. 11 | 1993-2005
1 Puterman M L. Markov Decision Processes: Discrete Stochastic Dynamic Programming. New York: John Wiley and Sons,
Inc., 1994
2 Bertsekas D P. Dynamic Programming and Optimal Control.
Belmont, MA: Athena Scientific, 2007
3 Li M, Vitányi P. An Introduction to Kolmogorov Complexity
and Its Applications. 2nd ed. New York: Springer-Verlag New
York Inc., 1997
4 Ho Y C, Zhao Q C, Pepyne D L. The no free lunch theorems:
complexity and security. IEEE Trans Automat Contr, 2003,
48(5): 783–793
5 Vapnik V N. Statistical Learning Theory. New York: John
Wiley and Sons, Inc., 1998
6 Bertsekas D P, Tsitsiklis J N. Neuro-Dynamic Programming.
Belmont, MA: Athena Scientific, 1996
7 Gunn S. Support Vector Machines for Classification and Regression. ISIS Technical Report. 1998
8 Burges C J C. A tutorial on support vector machines for pattern recognition. Data Min Knowl Disc, 1998, 2(2): 955–975
9 Ma J, Zhao Y, Ahalt S. Osu svm classifier matlab toolbox. Version 3.00. Available: http://www.ece.osu.edu/maj/osu svm/
10 Cho D I, Parlar M. A survey of maintenance models for multiunit systems. Eur J Oper Res, 1991, 51: 1–23
11 Dekker R. Applications of maintenance optimization models:
a review and analysis. Reliab Eng Syst Safe, 1996, 51: 229–240
12 Tan J S, Kramer M A. A general framework for preventive maintenance optimization in chemical process operations.
Comput Chem Eng, 1997, 21(12): 1451–1469
13 Wang H. A survey of maintenance policies of deteriorating systems. Eur J Oper Res, 2002, 139: 469–489
14 Xia L, Zhao Q C, Jia Q S. A structure property of optimal
policies for maintenance problems with safety-critical components. IEEE Trans Automat Sci Eng, 2008, 5(3): 519–531
15 Sun T, Zhao Q C, Luh P B, et al. Optimization of joint replacement policies for multi-part systems by a rollout framework. IEEE Trans Automat Sci Eng, 2008, 5(4): 609–619
16 Dekker R, Wildeman R E, Van Der Duyn Schouten F A. A review of multi-component maintenance models with economic
dependence. Math Method Oper Res, 1997, 45: 411–435
17 Van Der Duyn Schouten F A, Vanneste S G. Two simple control policies for a multi-component maintenance system. Oper
Res, 1993, 41: 1125–1136
JIA Q S et al. Sci China Ser F-Inf Sci | Nov. 2009 | vol. 52 | no. 11 | 1993-2005
2005