Data-driven Meta-heuristic Search
Ke Tang
USTC-Birmingham Joint Research Institute in Intelligent Computation and Its Applications (UBRI)
School of Computer Science and Technology
University of Science and Technology of China
December 2014 @ CityU of Hong Kong
1 Outline
• A data-driven perspective on Meta-heuristic Search
• Speciation in DDMS
• Algorithm Selection in DDMS
• Identification of Interacting Decision Variables in DDMS
• Summary
2 Outline
• A data-driven perspective on Meta-heuristic Search
• Speciation in DDMS
• Algorithm Selection in DDMS
• Identification of Interacting Decision Variables in DDMS
• Summary
3 A data-driven perspective on MS
• There are a lot of famous meta-heuristic (MS) search methods
– Simulated Annealing
– Tabu Search
– Scatter Search
– Genetic Algorithms
– Evolution Strategies
– Evolutionary Programming
– Particle Swarm Optimizer
– Ant Colony Optimization
– Differential Evolution
– Estimation of Distribution Algorithms
– etc.
4 A data-driven perspective on MS
• Despite of the different historical background, most MS methods
share a similar framework, i.e., they are Stochastic Generateand-Test algorithms.
• An MS method iteratively sample in a solution space, and thus
can be viewed as a data-generating process.
x1
Sampling
…
xD
fitness
Individual 1
…
Individual n
5 A data-driven perspective on MS
• The data consist a lot of information, such as:
– The candidate solutions (individuals)
– Their corresponding fitness
– The “origin” of an individual (e.g., an individual was
generated by applying which operator to which parents).
• Data-driven Meta-heuristic Search: To exploit the data
generated by MS during its search process to enhance
the performance of MS.
6 A data-driven perspective on MS
• Many existing works could be interpreted as DDMS:
– Surrogate Assisted Evolutionary Algorithms
– Many parameter adaptation/self-adaptation schemes
• Key questions of DDMS:
– Q1: Why will an MS benefit from historical data?
– Q2: What information is to be extracted from the data?
– Q3: How should the required information been extracted
from the data?
7 A data-driven perspective on MS
• The answers to Q1 and Q2 define the specific data analytics
problem that need to be addressed.
• Data analytics problems in DDMS are likely to be intractable
(e.g., NP-hard) themselves.
• Or, they may introduce substantial computational overhead.
(No free lunch)
• Hence, trade-off between overhead and the benefit of using
historical data should always been keep in mind when
answering Q3.
8 Outline
• A data-driven perspective on Meta-heuristic Search
• Speciation in DDMS
• Algorithm Selection in DDMS
• Identification of Interacting Decision Variables in DDMS
• Summary
9 Speciation in DDMS
• Challenge brought by a multimodal problem
– There might be more than one optima that are (roughly) equally good.
– Finds multiple optima of a problem.
• Why?
– Provides the user with a range of choices (more informed decisions)
– Reveals insights into the problem (inspire innovations)
10 Speciation in DDMS
• When employing EAs to find multiple optima, a procedure
called speciation is usually required.
• Speciation: partitioning a population into a few species.
– Niche: an region of attraction on the fitness landscape
– Species: a group of individuals occupying the same niche
– Species seed: the best (fittest) individual of a species
11 Speciation in DDMS
• A typical speciation procedure
12 Speciation in DDMS
• Most speciation methods relies on a sub-algorithm to determine
whether two individuals are of the same species.
• Speciation methods
– Distance-based: Determines whether two individuals are of the same
species according to their distance
– Topology-based: Determines whether two individuals are of the same
species according to the fitness landscape topography
Speciation
Distance-Based
Topology-Based
Hill-Valley
Recursive Middling
13 Distance-based Speciation
• Two individuals are assigned to the same species if their
distance is smaller than a predefined threshold called niche
radius.
• Introduce an additional parameter (i.e., niche radius), which is
difficult to tune.
• Make strong assumptions, i.e., equally sized and spherically
shaped niches.
14 Topology-based Speciation
–
Hill-Valley (HV)
–
Recursive Middling
– Make weaker assumptions
– Sampling new points in order to capture the landscape topography,
– When more FEs are spent on speciation, less are available for the
evolutionary algorithm to converge.
– Not very attractive especially when fitness evaluation is costly.
15 History-based Topological Speciation
• Research question: Could topology-based speciation be FEfree, so that their benefits can be better appreciated?
• Approach: History-Based Topological Speciation (HTS)
• Capture landscape topography exclusively based on search
history.
16 History-based Topological Speciation
• Topology-based speciation methods can be interpreted from the
perspective of a sequence of points.
– X is an infinite sequence and cannot be tested directly.
– RM “approximates” by sampling
a few points on ab.
X
• Basic idea: Approximate
X
only using history data/points.
• What is a “good” approximation?
bad
bad
good
17 History-based Topological Speciation
• Conceptually, HTS follows a two-step procedure
1. Construct a finite discrete approximate sequence
2. Test the approximate sequence to reach a final decision (trivial)
• More formally, the problem of finding the best approximation
can be stated as:
• where
18 History-based Topological Speciation
19 History-based Topological Speciation
20 HTS: Experiments
21
HTS: Experiments
• Different methods are integrated into the same
evolutionary framework for comparison
– Crowding Differential Evolution with Species Conservation
• Benchmark functions
– F1—F6: 6 two-dimensional functions with various properties
Number of optima: 2—10
– F7—F10: MMP functions in 4, 8, 16, 32 dimensions, respectively
Number of optima: 48
– F11: An composition of 50 random 32-dimensional shifted rotated
ellipsoidal sub-functions coupled via the max operator
Number of optima: 50
• The goal is to find all optima of the benchmark functions
22
HTS: Experiments
• Performance Measure: The distance error of the last generation
is then used to measure the performance of the algorithm
• Win/Draw/Lose of HTS versus every other method
DIS-
DIS
DIS+
HV1
HV3
HV5
RM
RM*
9/0/2
2/5/4
4/5/2
9/0/2
9/0/2
9/0/2
9/0/2
1/3/7
– Both t-test and Wilcoxon rank-sum tests are used
– Consider a difference to be statistically significant if it is asserted so by both
tests at the 0.05 significance level
– A draw is counted when no statistically significant difference is observed
23 Summary of HTS
• Q1: Why will an MS benefit from historical data?
A1: Because species could be formed without setting any parameter and
consuming any additional FEs.
• Q2: What information is to be extracted from the data?
A2: A few clusters of individuals (or tags of species for each individual)
• Q3: How should the required information been extracted from the data?
A3: To find an approximation to a line segment only using previously
evaluated candidate solutions.
• Representation of Data: Individuals and their fitness (e.g., n-by-(D+1)
matrix).
• Generalization issue is not required/considered.
24 For more details
• L. Li and K. Tang, “History-Based Topological Speciation for Multimodal
Optimization,” IEEE Transactions on Evolutionary Computation, in press
(Early Access).
• P. Yang, K. Tang and X. Lu, “Improving Estimation of Distribution
Algorithm on Multi-modal Problems by Detecting Promising Areas,” IEEE
Transactions on Cybernetics, accepted on 22 August 2014.
25 Outline
• A data-driven perspective on Meta-heuristic Search
• Speciation in DDMS
• Algorithm Selection in DDMS
• Identification of Interacting Decision Variables in DDMS
• Summary
26 PAP - Background A scenario frequently encounter in the real world:
Ø
Ø
Ø
A number of optimization problems
A time budget T
A number of optimization algorithms (e.g., GA, ES, EP, EDA,
DE, PSO…)
We want to obtain the best (or as good as possible) solutions for
all the problems with T.
27 PAP - Background Ø
Intuitively, the total time budget T can used for two purposes
(1) to identify the best algorithm
(2) to search for the best solution
Ø
In general, the more time we spent on (2), the better solution it
will achieve.
Ø
Different problem may favor different algorithm. Finding the
best algorithm for a problem can be very time consuming.
28 PAP - Background General thoughts:
Ø
Arbitrarily pick an algorithm for every problem?
– T will solely be used to search for solutions
– too risky
Ø
Carefully identify the best algorithm for each problem?
– A lot of time will be used for algorithm selection
– The time left for searching for good solutions might be insufficient.
Ø
Try to find a single algorithm suitable for all problems?
– Sounds like a good trade-off
– Advantages of having a set of different algorithms are not fully utilized.
29 PAP - Background Ø
How about establishing a good “portfolio” of algorithms (e.g.,
a combination of multiple algorithms) for all problems?
Advantages:
•
Making use of advantages of different algorithms, rather than
putting all the eggs (time) into a single basket (algorithm).
•
Hopefully not too time-consuming since only one portfolio is
needed for all problems.
30 PAP - Background Ø
Algorithm Portfolios “invests” limited time in multiple
algorithms to fully utilize the advantages of these algorithms
to maximize the expected utility of a problem solving episode.
Ø
Analogy to Economics: One allocates his/her money to
different financial assets (stocks, bonds, etc.) in order to
maximize the expected returns while minimizing risks.
Ø
Population-based Algorithm Portfolios (PAP)
– Conceptually similar to Algorithm Portfolios
– Aims to solve a set of problems rather a single one
– Focuses on population-based algorithms (e.g., EAs)
31 PAP - Background Ø
The General Framework of PAP
Select the cons:tuent algorithms from a pool of candidate algorithms Construct a concrete PAP instan:a:on with the cons:tuent algorithms Apply the PAP instan:a:on to each problem Output the best solu:on obtained for each problem 32 PAP - Background Which candidate algorithms should serve as constituent
algorithms depends on the way of building a PAP instantiation:
Ø
A PAP instantiation maintains multiple sub-populations.
Ø
Each sub-population is evolved with a constituent algorithm.
Ø
Information is shared among sub-populations by activating a
migration scheme periodically.
33 PAP - Background Pseudo-code of a PAP instantiation:
34 EPM-PAP On Choosing Constituent Algorithms
Ø
Let F = {fk | k = 1, 2, … , n} be a given problem set, A = {aj |j = 1, 2, ..., m} be
a set of candidate EAs, choosing constituent algorithm for PAP is formulated
as seeking the algorithm subset = {ai |i = 1, 2, ..., l} of A that leads to the best
overall performance on F, as given in Eq. (1)
~
~
Aopt = arg~ maxU ( A, F ,T )
A⊆ A
Ø
A most straightforward approach: enumerate all possible subset and employ a
procedure like statistical racing to find the best one.
Even more time consuming than selecting a single algorithm!
35 EPM-PAP Ø
Recall that we expect that a good PAP instantiation to underperform a candidate EA (say, ai) with small probability.
Ø
Assuming independence between constituent algorithms, the
above statement can be written for an algorithm j on problem
fk as:
l
R jk = ∏ (1− Pi,kj )
i=1
Ø
Averaging over all problems and all candidate EAs, we get
1 m n l
k
R=
(
1
−
P
∑∑∏
i, j )
mn j =1 k =1 i =1
(1)
36 EPM-PAP What is Estimated Performance Matrix
• A matrix that records the performance of each candidate EA.
• For each aj, the corresponding EPM, denoted by EPMj, is an r-by-n matrix.
• This matrix can be obtained by running aj on each of the n problems for r
times.
• Each element of EPMj is the objective value of the best solution that aj
obtained on a problem in a single run.
• Since each element of EPMj is obtained with a small portion of T, it can be
viewed as a conservative estimate of the solution quality achieved by
running aj with T on the same problem.
37 EPM-PAP Ø
With the help of some statistical tests, EPMs provide all
information that is needed to calculate Eq. (1)
Good news
Ø
No need to compare the performance of all possible subsets
with a tedious procedure like statistical racing.
Ø
Estimating the performance of a single candidate EA is
sufficient for constituent algorithm subset selection.
38 EPM-PAP Detailed steps for Choosing Constituent Algorithms
1. Apply each candidate EA aj to each problem for r independent
runs. The final population obtained in each run is stored.
2. Construct EPM for each aj based on the quality of the best
solution it obtained in each run.
3. All possible subset of A is enumerated and the corresponding R
is calculated using Eq. (1) and the EPMs.
4. The subset with the smallest R is selected as the constituent
algorithms for PAP.
39 EPM-PAP: Experiments Ø
4 Candidate EAs: CMA-ES, G3PCX, SaNSDE, wPSO
Ø
Benchmark problems
– 13 numerical problems from classical benchmark suite
– 14 numerical problems from CEC2005 benchmark suite
– Dimension: 30
40 EPM-PAP: Experiments Ø
Total Fitness Evaluations (FEs) for each problem: 400000,
800000, and 1200000, respectively.
Ø
25 independent runs on each problem
Ø
For the convenience of implementation, all constituent
algorithms of a PAP instantiation evolve with the same number
of generations.
Ø
Parameters of constituent algorithms are not fine-tuned.
Ø
migration_interval=MAX_GEN/20, migration_size=1
Ø
PAP with 2 and 3 constituent algorithms are considered.
41 EPM-PAP: Experiments Ø
Wilcoxon Test Results (Significance level 0.05): “w-d-l” stands
for “win-draw-lose”
EPM
-PAP-2 EPM
-PAP-3 Time
Budget SaNS
DE wPSO G3PCX CMA
-ES F-Race Intra
-AOTA T1 8-14-5 17-10-0 21-6-0 8-13-6 9-14-4 6-15-6 T2 7-14-6 16-10-1 20-7-0 9-14-4 7-15-5 5-18-4 T3 6-15-6 17-9-1 21-6-0 10-14-3 7-14-6 6-18-3 T1 9-11-7 19-7-1 21-5-1 10-10-4 10-13-4 5-17-5 T2 8-17-2 17-9-1 20-7-0 9-12-6 9-12-6 5-20-2 T3 9-16-2 17-10-0 21-6-0 9-14-4 9-14-4 6-20-1 42 EPM-PAP: Experiments Ø
Performance ranking of all possible EPM-PAP-2 and EPM-PAP-3
PAP
Rank
with 2
constituent
algorithms
with 3
constituent
algorithms
2 3 4 5 6 1 2 3 4 1
Time Budget = T1
SaNSDE + CMA-ES
SaNSDE + wPSO Time Budget = T2
SaNSDE + CMA-ES
SaNSDE + wPSO wPSO + CMA-ES
Time Budget = T3
SaNSDE + CMA-ES
SaNSDE + wPSO wPSO + CMA-ES
G3PCX + CMA-ES wPSO + G3PCX G3PCX + CMA-ES wPSO + G3PCX SaNSDE + G3PCX
SaNSDE+wPSO+CMA-ES
wPSO + CMA-ES
G3PCX + CMA-ES wPSO + G3PCX SaNSDE + G3PCX
SaNSDE + G3PCX
SaNSDE+wPSO+CMA-ES
SaNSDE+wPSO+CMA-ES
SaNSDE+G3PCX+CMA-ES
SaNSDE+G3PCX+CMA-ES
SaNSDE+G3PCX+CMA-ES
SaNSDE+wPSO+G3PCX
SaNSDE+wPSO+G3PCX
wPSO+ G3PCX+CMA-ES
wPSO+ G3PCX+CMA-ES
wPSO+ G3PCX+CMA-ES
SaNSDE+wPSO+G3PCX
43 EPM-PAP: Experiments Ø
Success Rates of the EPM-based selection procedure: How
likely did it select the best constituent algorithm subset?
EPM-PAP-2 EPM-PAP-3 Time Budget SR1 SR2 T1 40% 88% T2 56% 100% T3 72% 100% T1 16% 84% T2 36% 88% T3 56% 100% 44 Summary of EPM-PAP
• Q1: Why will an MS benefit from historical data?
A1: Because a better subset of algorithms could be identified.
• Q2: What information is to be extracted from the data?
A2: An m-dimensional binary vector
• Q3: How should the required information been extracted from the data?
A3: Invest additional FEs to accumulate statistically meaningful estimation
of the performance of algorithms.
• Representation of Data: The quality of solutions of each candicate algorithm
on each problem. (i.e., EPM).
• It is implicitly assumed that performance achieved with small number of
FEs could generalize to cases with larger number of FEs.
45 For more details
• F. Peng, K. Tang, G. Chen and X. Yao, “Population-based Algorithm
Portfolios for Numerical Optimization,” IEEE Transactions on Evolutionary
Computation, 14(5): 782-800, October 2010.
• K. Tang, F. Peng, G. Chen and X. Yao, “Population-based Algorithm
Portfolios with automated constituent algorithms selection,” Information
Sciences, 279: 94-104, September 2014.
46 Outline
• A data-driven perspective on Meta-heuristic Search
• Speciation in DDMS
• Algorithm Selection in DDMS
• Identification of Interacting Decision Variables in DDMS
• Summary
47 Background
• Although EAs have achieved great success in the domain of
optimization, most reported studies are obtained using small
scale problems (e.g., numerical optimization with less than 100
decision variables).
• Most existing EAs suffer from the “Curse of Dimensionality”
phenomenon.
• On the other hand, large scale problems have emerged in many
areas.
48 An example
• Birds Nest (China & Switzerland)
• The irregular ordering of the beams poses an insoluble problem
for the then-current CAD tools.
49 Large Scale Optimization Problems
• Research Target of LSGO:
To scale up EAs to problems that are at least one magnitude
larger than the state-of-the-art (i.e., with about 1000 variables).
• What Makes Large Scale Problems Difficult?
– Solution space often increases exponentially with the growth
of problem dimensionality.
– Problem complexity may increase with the growth of
dimensionality, e.g., the number of local optima.
– Candidate search directions often increase exponentially. EAs
might fail to find the promising search directions.
50 EACC-G
• Basic (and old) idea: divide-and-conquer.
• Cooperative Coevolution is an ideal approach for
implementing the idea:
– Decomposes the objective problem into some sub-problems;
– Evolves each sub-problem separately using EAs;
– Combines the solutions to all sub-problems to form the
solution to the original problem.
• By decompose, we mean to categorize/divide the D
decision variables into a few groups.
51 EACC-G
• More formally speaking…
• The above approach is named EACC-G, which involves a
predefined number of cycles.
• Each cycle consists of the following steps:
– Split D decision variables into m groups, each contains s variables.
– Optimize each sub-problem with an EA.
– Solutions for each sub-problem is evaluated by combining with the best
solution obtained for the other sub-problems.
52 EACC-G
• The key question: How to decompose?
• If a problem consists a nonseparable component, we say the
decision variables in this component are interacting variables.
• Intuitively, interacting variables should be grouped together by
the decomposition procedure.
• The simplest way for decomposition is to group decision
variables randomly.
– Sounds too straightforward to work properly.
– But not as “silly” as it seems to be.
53 Benefit of Random Grouping
EACC-G
The probability of EACC-G to assign two interacting variables xi
and xj into the same group for at least k cycles is:
!
Nature Computing
!! =
!!!
! 1 !
1 !!!
( ) (1 − )
!
! !
!
N: Number of Cycles; m: Number of Groups
• For example, given a 1000-D problem, when m = 10, P
1
=0.9948, P2 =0.9662
• Even the simple random grouping strategy has some chance to
group two interacting variables together.
54 EACC-G
• With the random grouping scheme, each cycle of EACC-G
becomes:
– Randomly Split D decision variables into m groups, each contains s
variables.
– Optimize each sub-problem with an EA.
– Solutions for each sub-problem is evaluated by combining with the best
solution obtained for the other sub-problems.
55 Experimental Studies
• Test Suite: 13 minimization problems (1000-dimensional).
• Applying Differential Evolution (DE) to the problem directly.
• DECC-G: using DE as basic optimizer.
• The numbers of FEs were set to 5e+06 for all algorithms.
• Results of 25 independent runs were collected for each problem.
56 Results
(Unimodal)
Experimental
Studies
Comparison between DECC-G and SaNSDE on functions f1 − f7
(unimodal), with dimension D = 1000, averaged over 25 runs.
Nature Computing
# of Dim
f1
f2
f3
f4
f5
f6
f7
1000
1000
1000
1000
1000
1000
1000
SaNSDE
6.97E+00
1.24E+00
6.43E+01
4.99E+01
3.31E+03
3.93E+03
1.18E+01
16
DECC-G
2.17E-25
5.37E-14
3.71E-23
1.01E-01
9.87E+02
0.00E+00
8.40E-03
57 Results
(MultiModal)
Experimental
Studies
Comparison between DECC-G and SaNSDE on functions f8 − f13
(multimodal), with dimension D = 1000, averaged over 25 runs.
Nature Computing
# of Dim
f8
f9
f10
f11
f12
f13
1000
1000
1000
1000
1000
1000
SaNSDE
-372991
8.69E+02
1.12E+01
4.80E-01
8.97E+00
7.41E+02
17
DECC-G
-418983
3.55E-16
2.22E-13
1.01E-15
6.89E-25
2.55E-21
58 Drawbacks of Random Decomposition
• The group-size needs to be predefined
- rather difficult.
• All groups are assumed to be of the same size
- probably unreasonable.
• The nature of random grouping limits the chance of
categorizing all interacting variables into the same group.
59 Variable Interaction Learning
• A bottom-up grouping approach
1. Start by treating each decision variable as a group
2. Learn the interaction between variables
3. Merge interacting variables/groups into the same group
4. Goto step 2 until a stopping criterion is met
• Benefits
– No need to specify the number of groups.
– Groups are can be of different sizes.
– Once the learning phase finishes, no need to re-group the
decision variables.
60 Variable Interaction Learning
• How to Learn the Interaction?
–
–
If two solution vectors, say x and x’ are different only on the ith
dimension, and the ith and jth decision variables are NOT interacting.
Then changing the value of the jth decision variable will NOT change
the relative order of f(x) and f(x’).
• Hence, we may say that the ith and jth variables are interacting
if the following condition holds:
• Every interaction learned by this mechanism is correct.
61 Variable Interaction Learning
62 VariableCCVIL:
Interaction
Learning
A Two-stage
Algorithm
Cooperative Coevolution with Variable Interaction Learning
1.! Initialization: Randomly initialize a population of solutions, and
randomly choose an individual from the population.
2.! Learning Stage: Repeat a number of learning cycles, each
leaning cycle consists of three steps:
(1) Randomly permute the sequence of decision variables
(2) Scan over the permuted decision variables sequence to check
the interaction between each pair of successive variables. If
evidence of interaction is discovered, mark the two variables as
”belonging to the same group”.
3. Optimization Stage:
(1) Categorize the decision variables according to the
information obtained in the learning stage
(2) Solve the problem using CC framework
Nature Computing
21
63 No FreeInteraction
Lunch: TheLearning
Learning Overhead
Variable
The Learning stage costs FEs and a trade-off between learning and
evolution (optimization) needs to be set.
Appropriate setting for learning cycle can deal with both separable
functions and non-separable functions:
Termination Conditions for Learning Stage
• If no interactions were learned after Kˇ cycles, we treat it as
separable function and thus the learning stage will terminate.
• If any interaction has been learned before reaching the Kˇ
cycles, we treat it as a non-separable function. In this case,
learning stage only stops if:
• all N dimensions have been combined into one group
• 60% of FEs has been consumed in learning stage
26
64 Experimental
Studies Results
Experimental
65 Summary of VIL
• Q1: Why will an MS benefit from historical data?
A1: Because interacting variables will be more likely to be grouped together
• Q2: What information is to be extracted from the data?
A2: A binary “interaction” matrix (D-by-D)
• Q3: How should the required information been extracted from the data?
A3: Invest additional FEs to perform tests between variables.
• Representation of Data: Individuals and their fitness (e.g., n-by-(D+1)
matrix).
• Generalization issue is not required/considered.
66 For more details
• Z. Yang, K. Tang and X. Yao, “Large Scale Evolutionary Optimization
Using Cooperative Coevolution,” Information Sciences, 178(15):
2985-2999, 2008.
• W. Chen, T. Weise, Z. Yang and K. Tang, “Large-Scale Global Optimization
using Cooperative Coevolution with Variable Interaction Learning,” in
Proceedings of the 11th International Conference on Parallel Problem
Solving From Nature (PPSN), Kraków, Poland, September 11–15, 2010, pp.
300–309.
67 Outline
• A data-driven perspective on Meta-heuristic Search
• Speciation in DDMS
• Algorithm Selection in DDMS
• Identification of Interacting Decision Variables in DDMS
• Summary
68 Summary
• Data-driven MS makes use of data analytics approach to gain
useful information from the data generated during search.
• Three examples of DDMS have been introduced.
• Different context in MS may induce significantly different data
analytics problems, where a lot of work could be done.
69 Collaborators
• Collaborators at UBRI (ubri.ustc.edu.cn)
– Mr. Lingxi Li (HTS)
– Dr. Fei Peng (EPM-PAP)
– Prof. Xin Yao (EPM-PAP)
– Prof. Guoliang Chen (EPM-PAP)
– Mr. Wenxiang Chen (CCVIL)
– Dr. Thomas Weise (CCVIL)
– Dr. Zhenyu Yang (CCVIL)
70 Thanks for your time!
Q&A?
71
© Copyright 2026 Paperzz