Fitness Landscape Models and Simulated Annealing Search A

Lecture 5: Fitness Landscape Models and Simulated Annealing Search
A fitness landscape is a geometry which maps economic inputs/strategies/market factors to profitability.
Each landscape has a peak which gives you max returns. Sewell Wright used them to illuminate evolution.
In standard economics/calculus, the landscape is a U-shaped (convex) cost curve or globally concave
(profits) function. To get the bottom or top you differentiate control variables, set them to 0 and solve. Negative
second derivatives/matrices of second partials guarantee you are at the max or min. If problem is too complicated
for a direct solution, do a gradient search using a hill-climbing algorithm. “keep going up or down until you get
(close enough) to where you want to be.”
But the world does not always generate nicely shaped landscapes.
Why? 1) Too much information relative to the time to process the information to determine the landscape
--“bounded rationality” ala Herb Simon. Example of hiring someone. Say there are 10 attributes that matter for
doing a good job, where either candidate has an attribute (1) or not (0). If CONFIGURATIONS -- the mix of
attributes -- matter, there are a lot of cases to explore. Configurations enter because someone who has lots of say
energy and basic knowledge but little experience may dominate someone with lots of experience and basic
knowledge but limited energy. 00011 10101 -- has some fitness, say 0.5; 10010 01100 -- has another fitness 0.3
There are 210 cases to explore -- 1024, if it took you ½ hour to assess each, working 40 hours a week, you
would spend half a year before deciding the ideal candidate.
Professor S, an ec professor refugee from Poland in WW2 days, told me he could not buy groceries in the US
because there were too many options. His wife would send him to pick up cheese, bread, coffee, cereal. An hour or
two later he would return home, flustered. There were 12 different cheeses, 10 brands of coffee, 17 breads, some
packaged, some store baked, and rolls and …. The optimization problem drove him mad. If only stores had one
brand of a good as in Poland or shortages of items your choice was simple: buy whatever they had.
Too much choice stressing us out or not? The Guardian, Stuart Jeffries Wednesday 21 October 2015 “From jeans to
dating partners and TV subscriptions to schools, we think the more choices we have the better. But too many options
create anxiety and leave us less satisfied. Could one answer lie in a return to the state monopolies of old?”
Paradox of Choice is more complex. See “Can There Ever Be Too Many Options? A Meta-Analytic Review of
Choice Overload” Scheibenhenne, Greifeneder, Todd (Journal of Consumer Research, 2010): The choice overload
hypothesis states that an increase in # of options to choose from may lead to … decrease in the motivation to choose
or the satisfaction with the finally chosen option. A number of studies found strong instances of choice overload in the
lab and in the field, but others found no such effects or found that more choices may instead facilitate choice and
increase satisfaction. In a meta-analysis of 63 conditions from 50 published and unpublished experiments (N=5,036),
we found a mean effect size of virtually zero but considerable variance between studies. While further analyses
indicated several potentially important preconditions for choice overload, no sufficient conditions could be identified.
Riefer, Peter S. et al Coherency-maximizing exploration in the supermarket Nature Human Behaviour 2017/01/09/ use
big data on consumers to see how much consumers search after they have buying product for long time even when
new alternatives appear. The longer you buy Mother's Cookies, the less likely you are to use the coupon for trying
Grandma's cookies. Interpreted as brand loyalty due to subjective choice, where “tastes” move to your choice.
Paper on Choice in Course Offerings: How does it differ among majors? Is the course shopping period a good idea?
Baumol-Gomory's Global Trade and Conflicting National Interest analyzes trade with “retainable industries” –
those where scale economies/tech advantages allow first country in the industry to dominate. With N goods and 2
countries and the comparative advantage each country produces at least one good, with 2N - 2 equilibrium and zones
of economic conflict where some equilibrium good for one country but not others.
Why 2N - 2? The N industries can be assigned to either country as long as at least one industry goes to each country.
Absent the constraint there are 2N possible assignments/combinations. But two of the combinations – all to country
one or to country two-- are invalid. The number of assignments is thus 2N – 2. If N=3 one country “ France” could
have industry A or B or C or 2 of the industries (A&B, B&C, A&C) but by comparative advantage the other country,
has to have at least one industry. This gives 6 possible equilibrium and lo and behold: For N = 3, 2N - 2 = 6
The general formula is
(N ) + (N) + ... + (N) which you can show is 2N - 2
( 1)
(2)
( N-1)
Rugged landscapes often too complex to determine good rules for finding optimum. And landscapes may change
over time so today’s “perfect mate” turns into tomorrow’s disaster. Because fitness landscapes are complex in
economics (think of the many attributes of firms in a changing economic world) and in biology, the landscape “has
mostly been used as a superficial metaphor because we know little about its structure” (J. Arjan G.M. de
Visser1 and Joachim Krug2, Nature Genetics, 2014). They trace modern thinking to John Maynard Smith,” Natural
selection and the concept of a protein space” Nature 225, 563–564 (1970), which developed a discrete genotypic
space with mutational pathways when they pass through functional genotype.
The 7 dimensional Calibi-Yau space below hows a complicated landscape, simplified to 4 dimensions (the diagram
rotates in a computer).
The definition of a landscape requires knowledge of the appropriate maximand, which can be difficult, per
accounting decisions and protocols used to report profits; and confusions of personal utilities.
A landscape also needs a distance metric, which depends on where you can move ala cellular automata. If
the allowable move is a one-step, you get one landscape; if it is two-steps, the number of neighbors grows–> a
different landscape. Similarly, if you change the measure of inputs you get a different landscape.
One way to categorize a landscape is in terms of # of local optimum -- points where immediate neighbors give lower
outcomes. The greater the number of local optimum, the more rugged is the landscape. This makes searching for
the best difficult, especially if some high values are found in isolated areas with no high neighbors.
Search algorithms explore landscapes. On a concave landscape, pick a point. Examine neighboring points
-- incremental single attribute changes -- and follow steepest path up. Derivatives/gradient search – hill-climbing.
Example: The space of strategies is (a,b,c,d,e) where you choose 0/1 for each letter and the value at any point is the
sum of the terms so the max is (1,1,1,1,1) with a value of 5. Pick a starting point -- (0,1,0,0,0). Look at neighbors
and notice that (1,1,0,0,0) does better. So you choose that point. Check again and again and you reach (1,1,1,1,1)
No best algorithm for all rugged landscapes. Wolpert and Macready show that you can devise a landscape where
Algorithm B works better than A and conversely, so no general ideal algorithm If 4 is max, best estimate is 4, but 4
will not work if max is 30. Uncertainty about the landscape → seek procedures that do well over many landscapes
rather than one that is best for the landscape you think you are on. Good to “explore a lot before making decision”.
Many algorithms seek to find the max but the path to the goal may also be important. The path can define a
different maximand. If you are searching over the space of dates, and you go out with people in real time, searching
for the perfect person could leave you unhappy most of your life. The algorithm that yields (0.7, 0.5, 0.8, 0.6, 0.6),
is inferior to an algorithm (0.1, 0.2, 0.3, 0.1, 1.0) if the goal is the perfect 1.0 . But if you are 92 when you meet the
perfect 1.0, the total score over the search period is 3.2 for the first search vs 1.7 for the 2nd. At reasonable discount
and life span, you are unlikely to reach the future 1.0s that make search for perfection the best strategy.
Kauffman’s NK Model
Consider three landscapes with N-input vectors and each input represents some attribute or characteristic that
determines value. Add geometry/moving rule that determines the nearness of the vectors from each other:
Random landscape: randomly assigns profits to entire N-input vectors, so that there is no pattern and only
a complete search finds the global optimum.
Example: Profits depend on 10 factors, each of which is 0/1. A point has N near neighbors that you reach
by modifying one value. The N neighbors and the current point could potentially be a local maximum. Absent any
information, every n-hood each point has a 1/(N+1) chance of being a local maximum. Then any combo has 1/11
chance of being a local max. There are 210 (1/11) = 93 local max unconnected in the space.
000 0000 111 might be assigned 89; 100 0000 111 might be 6; 010 0000 111 might be 25; etc
Concave Landscape. Sum the value of individual inputs where each input is either 1 or 0. The max will be
(111 1111 111) for 10 input. Nearest points will be vectors with 9 1s and one 0; 2nd nearest vectors 8 1s and two 0s.
Correlated Landscapes are in-between. Nearby points have similar values. There are bundles of inputs
that work together. A correlated landscape is defined relative to a moving or neighbor rule. A neighborhood
consists of points you reach by changing one element in your vector from 0 to 1.
NK landscape builds correlated landscape from simple model.
N = # factors that define your policy -- an N-dimensional vector (0/1, 0/1, 0/1, etc ... 0/1).
Profit associated with each factor w(s n)
Total profits are sum: W(s)= ∑(w(s n) (Kauffman divides by n).
Think of each element as a person you hired for a team project where you had a choice of a person with
personality 0 or 1. If W was a measure of your % of wins for a sports team, this makes the %wins depend on
individual production of each worker. But production of each person is affected by other persons on team. This
captured b a second parameter K = # of other factors that affect the profitability of any factor: w(s n) = f(0/1
attribute of K other factors). There are 2K+1 combinations that determine profit contribution of each factor: the
choice of the 0/1 attributes of the factor times the 2K possible combinations of other factors. The profitability of
SALES depends on RD and HR but not other departments.
The K parameter builds in non-linear interactions. If you have N=6 and K=2, the profit contribution of
point X depends on X and 2 other elements but not on the other 3 elements. Assume that value of a depends on a, e,
f, and that value of e and f depend on a and c. The profit associated with a, e, and f vary when you change a but the
value of b, c, and d, do not change. Scale the measure of so that profit of each unit is 0-1. Starting point is:
Attribute
a b c d e f
1 1 0 0 1 0
Contribution to Profit
a
b c d
e f
.6
.3 .5 .2 .7 .3 for 2.6
Now change a from 0 to 1 and get changes in profitability of a, e, f :
a b c d e f
a
b c d
e f
0 1 0 0 1 0
.5 .3 .5 .2 .9 .7 for 3.1
Switching a to 0 reduces its contribution to profits but raises the contribution of e and f. The landscape's structure
comes from consistent contribution to profits of b c d. The true MP of a is 3.1-2.6= 0.5, not the -0.1 change in a's
individual output.
In world of team production/complementary inputs it is difficult to assess an individual's MP. Most measures of
athletes are individual but basketball has +/- statistic that denotes team's net points while the player is on the court.
Papers/citations are usual measure of productivity for scientists but do not work that well when papers have multiple
authors. How to assess contribution of Arrow, Samuelson, and Mo from their paper, “Optimizing Laughs:
economics of the Three Stooges”.
K is the parameter that tunes the ruggedness of the landscape.
K=0 is a concave landscape, since each factor affects profits independently.
K=N-1 gives a random landscape. The value of each combination is unrelated to any other. If you change an
item, the value of ALL others can change. Configurations are not neighbors. The max change is N.
Example: N = 5, K = 4
(1,1,0,0,1) has profits (.5, .5, .6, .7, .2) --->2.5
(0,1,0,0,1) has profits (.3, .7, .9, .1, .3) --> 2.3.
Contribution to profits at all points based on a random draw would be unrelated what a, b, c, d, e do.
K = K* leaves the values of N-K* factors alone, so you have a stable base of (N-K*)/N percent of your value
function. Configurations that give big/little values are neighbors so you can’t decompose search for global optimum
into set of separable choices. This makes decentralization less advantageous and ORGANIZATION more important.
As K rises from 0 to N-1, you get more local peaks and the height of average peak falls –> loss of potential
incremental gain. The chance of an isolated global peak grows.
In Basketball with 5 players , imagine the coach takes one player off and puts you in, and you outscore person
opposite you, 5 to 0. Does that mean you are star? What if your Team was outscored by Opponent? Lots of
possible explanations? If you are tall but slow and everyone else is slow maybe just have the wrong configuration.
II. Propositions and Applications
1) Going slow can be best: Take a quadratic profits function (David Kane):
Profits = ∑aX +bX2 + c XiXj, where the coefficients a,b,c are randomly selected over the line -1 to 1. The
moving rule is to shift $ to K other inputs. Consider 3 search algorithms:
STEEPEST ASCENT: You look at all of your neighbors, defined as one flip away from you: if you are
00000, you look at 10000, 01000, 00100, 00010 and 00001 and you choose the one with the highest value. Then
you repeat this. Rush up the hill fast as you can – works if concave.
NEXT ASCENT: You look at your neighbors in some order, and pick the first one which gives you a better
solution. If you are 00000 you look at 10000. If this one gives you a gain, you choose it and start over again. If
1000 does not add more value, look at 01000. If this one gives you a gain, you choose it and start again. If not, you
look at 00100. This widens # points you look at. You go up with smaller positive gain.
RANDOM MUTATION ASCENT. Randomly flip a locus and whenever you get a gain, take it as your
starting point. With continuous functions you use derivatives of various forms to search in the direction of gains.
MEDIAN ASCENT -- pick the midpoint of gainers;
LEAST ASCENT -- pick the smallest gain.
Who does best? LEAST ASCENT
Number of connections per input
Why? With many local maximum going up quickly risks getting stuck at one of them and failing to find the
global maximum. By going slowly you explore the landscape more and do not lock yourself in at the outset.
2) If K is large you have a highly connected system where you need to change many elements to move to
better outcomes/higher peaks in the landscape –> discontinuous change.
Sweden had problems in reforming its big welfare state toward more market oriented outcomes; the US has
struggled to develop a medical system with costs comparable to those in other advanced countries; world has
wobbled in building a safe less predatory financial system. Fed Reserve of Dallas/others argue needs smaller banks
but others argue that better to have big banks because of advantages of economies of scale and knowledge.
After the fall of Soviet empire some economists recommended “big bang” move to markets (given failure of
mild reforms to improve economies much during communist rule), which generally did not work; while China's
more gradual move from state-planning to markets spurred huge growth.
Industrial Relations focuses on the landscape for HIGH PERFORMANCE workplaces. Current wisdom is
that you must change many things simultaneously to benefit from being “good” employer. You should have: JOB
ROTATION; TRAINING; EMPLOYEE INVOLVEMENT COMMITTEES, PROFIT-SHARING. One way of
identifying connected system – correlated landscape is that derivatives vary with the value of other variables. Here
is a graph on the effect of “shared capitalism” – profit sharing/ all employee stock options/ESOPs/ on quits.(from
Freeman, Blasi, Mackin, Kruse, “Creating a Bigger Pie? ”in NBER volume on Shared Capitalism:
Figure 4: Contingent Effects of Shared Capitalism on Likely Turnover
P ro b ab ility o f b ein g very likely to search h ard
fo r n ew jo b in n ext 12 m o s.
25%
Very closely supervised,
no HRM policies
20%
Avg. supervision,
no HRM policies
15%
Closely supervised,
covered by HRM
policies
10%
Not closely supervised,
no HRM policies
Avg. supervision,
covered by HRM policies
Not closely supervised,
covered by HRM policies
5%
0%
0
10
Shared Capitalism Index
3. Big K means that it takes longer to improve/have fewer fitter neighbors.
Start with 00000 and assume that policy with 1 in last digit gets high profits only if you have 11001, then
you need three changes to benefit from adopting policy 5. With more connections -- larger K -- there are fewer
fitter neighbors and it takes a longer time to find them: Here are graphs showing: a) (ln) time it takes to find
better neighbor (which rises with K) and b)# of fitter neighbors you find over time is smaller the larger is K)
4.Big K –> discontinuous changes – long periods with no improvement followed by burst of gains because you
need many changes to improve. If changes are random, they will occur infrequently. Say chance of change for a
particular factor occurs at a rate of 50% per year and you need 5 changes to improve. The probability for all 5 changes
occurring in a year is 0.55 = 0.03125. So once every 32 years you get a big change. With small K, you may need only
two changes, so improvements will be more gradual.
Clinical Trials Simulation (Eppstein, Horbar, Buzas, Kauffman, 2012) compares RCT, where you accept practice if it
passes statistical test for difference across many parts of landscape vs “quality improvement collaborative (QIC)”,
where you accept changes that work in neighbors. “We find that a search strategy modeled after QICs yields
robust improvement in simulated patient outcomes“.
Their QIC agents adopt the feature observed to yield higher survival at their locale, without regard to
statistical significance, and generally get “greater improvements in health outcomes under a wide range of conditions
than one modeled after RCTs.”
Why? “Due to a combination of reduced sensitivity to sample size (ie accepting things that work but do not
meet significance due to small n) and an increased ability for agents to respond differently in different local contexts”
(ease in finding local optimum). But “Because the exact nature of the true clinical fitness landscape is unknown, we
have examined the sensitivity of the QIC and RCT search strategies to some important model assumptions.”
Simulated Annealing 1(aka the value of going in the wrong direction)
Have to go down before you go up – easy to get locked into a local optimum on rugged landscape. You check out
potential mates at your local high school and pick the best for you. But maybe there are more compatible folks at
college. The only way to escape is to go the wrong way: drop Mr/Ms HS compatible and search anew.
Simulated annealing is algorithm for going in the wrong direction to get off a local peak and find something better.
NIST defines it as “technique to find a good solution to an optimization problem by trying random variations of the
current solution. A worse variation is accepted as the new solution with a probability that decreases as the computation
proceeds. The slower the cooling schedule, or rate of decrease, the more likely the algorithm is to find an optimal or
near-optimal solution.”
Say you want to minimize the cost function f(s). You start at S. Pick a different point s depending on the prob:
P(s) = exp (-δ/t) -- probability of going in wrong higher cost direction, cost goes up: δ = f(s) - f(S) > 0
for instance, to find wages that reduce costs of production you raise wages in the hope that you attract better
workers/motivate current workers to work better (“efficiency wage theory”) but in the short term costs rise δ >0
t (temperature) measures willingness to go the wrong way. Bigger t --> more willing to go wrong way
If δ is large, so you get a big increase in cost, the probability you will choose going the wrong way is small.
More likely to move in the wrong direction than to wiggle out of a local optimum.
1In metallurgy, annealing involves heating and controlled cooling of a material to increase the size of its crystals and reduce defects. Heat
causes atoms to become unstuck from their initial position and wander randomly through states of higher energy; slow cooling gives them
chance of finding configurations with internal energy lower than the initial one.
As t--> ∞, P--> 1 for positive deviation, so you have probability of picking a point in the wrong way
Example: t =5
t =5
δ= 10, then P(s) = exp (-2) = .14
δ = 25, then P(s) = exp (-5) = .007
t= 10
t= 10
δ = 10, then P(s) = exp (-1) = .37
δ = 25, then P(s) = exp(-2.5) = .08
Reading across, as t rises, P(s) goes up. Reading down, as δ goes up, P(s) falls
Note that if the point you pick lowers cost so that δ < 0, P(s) = exp (-δ/t) you just move to that point and continue. So
this does not reject a better solution. It reduces the rejection rate of a worse solution.
Start with a big t and then reduce it so that you search widely at the outset to find potential peaks, then narrow in on
the best one. t determines how much Simulated Annealing explores new areas vs areas of known profitability.
Since algorithm always accepts a new point that is better, you will approach local optimum. Since you sometimes
accept a move in the other direction, you can escape from local optima. The algorithm makes no assumptions about
the function to be optimized, so it is robust with respect to different landscapes.
Here are two descriptions of the algorithm, where a(t) reduces t over time and thus measures the extent to which you
reduce your willingness to search widely as you proceed:
https://www.mathworks.com/help/gads/examples/minimization-using-simulated-annealing-algorithm.html?
requestedDomain=www.mathworks.com
We know that sim annealing works well by trying it on problems with solutions and seeing how close it comes to the
optimum. It works well on traveling salesman http://www.math.uu.nl/people/beukers/anneal/anneal.html gives an applet
for this. You pick cities on a grid and the annealing program searches for the minimum distance to travel to reach all
of them. Another problem is minimizing Rosenbruck function, which has a minimum at 1.1 that is hard to find it lies
in a banana shaped valley. In Mathematica Simulated annealing is implemented as NMinimize[f, vars, Method ->
"Simulated Annealing"]. See http://www.sph.umich.edu/csg/abecasis/class/2006/615.19.pdf and Mathematica discussion.
Wikipedia discussion is good. http://katrinaeg.com/simulated-annealing.html – also is a nice discussion.
Problem with this/other searches – no theory about when it works better/worse than others