An Application of Genetic Algorithms to Uplift Modelling

An Application of Genetic
Algorithms to Uplift
Modelling
David P. Hofmeyr
Master of Science by Coursework
University of Edinburgh
2011
Declaration
I declare that this thesis was composed by myself and that the work contained
therein is my own, except where explicitly stated otherwise in the text.
(David P. Hofmeyr )
ii
Abstract
This paper means to tackle the problem of Uplift Modeling - i.e. modeling change
in behaviour as a direct result of treatment - using randomised methods, namely
evolutionary algorithms; both for variable generation and variable selection. We
give a detailed description of the evolutionary methods entailed as well as some
of the key aspects of uplift modeling such as the Qini coefficient and some current
methods of modeling. We then apply this evolutionary approach to an example
problem published by Kevin Hillstrom in his blog (MineThatData) and discuss
how our results compare favourably with those from the winning submission.
iii
I would like to thank my supervisor, Dr. Nicholas J.
Radcliffe, for his support, guidance and expertise,
without which this dissertation would not have been
possible. His extensive knowledge in interests we share
has been a great boon, and has aided hugely my
understanding of these subjects.
I would also like to thank Stochastic Solutions Ltd. for
the use of their uplift software.
iv
Contents
Abstract
iii
1 Introduction:
1
2 Uplift Modelling
2.1 The Basics of Uplift Modelling . . . . . . . . . . . . . . . . . . . .
2.2 Significance Based Uplift Trees . . . . . . . . . . . . . . . . . . .
2.3 Evaluating Uplift Models (The Qini Coefficient) . . . . . . . . . .
2
3
4
5
3 Evolutionary Algorithms (Genetic Algorithms)
3.1 The Evolutionary Process . . . . . . . . . . . . . . .
3.2 Genetic Programs . . . . . . . . . . . . . . . . . . . .
3.2.1 Trees . . . . . . . . . . . . . . . . . . . . . . .
3.2.2 Terminals and Functions . . . . . . . . . . . .
3.2.3 The Initial Population . . . . . . . . . . . . .
3.2.4 Evolutionary Operators for Genetic Programs
3.2.5 A Word on LISP and Symbolic Expressions .
3.3 The Power of Genetic Algorithms . . . . . . . . . . .
3.3.1 The Schema Theorem . . . . . . . . . . . . .
3.3.2 Forma Analysis . . . . . . . . . . . . . . . . .
3.3.3 Convergence of Genetic Algorithms . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
8
9
10
11
12
13
14
15
17
17
19
20
4 Application of Genetic Algorithms to Uplift Modelling and the
Hillstrom Challenge
4.1 Application of Genetic Algorithms to Uplift Modelling . . . . . .
4.1.1 The Genetic Programming Approach . . . . . . . . . . . .
4.1.2 The Polynomial Approach: . . . . . . . . . . . . . . . . . .
4.2 The Hillstrom Challenge and Results . . . . . . . . . . . . . . . .
4.2.1 The Data and Questions . . . . . . . . . . . . . . . . . . .
4.2.2 Summary of Results from Winning Submission . . . . . . .
4.2.3 Results of the Genetic Algorithm Approach . . . . . . . .
21
21
21
23
24
24
26
30
5 Conclusion:
56
A Variables and Models
A.1 Men’s Model. Genetic Programming Approach . . . . . . . . . . .
A.2 Women’s Model. Genetic Programming Approach . . . . . . . . .
59
59
59
v
A.3 Men’s Model. Fitness Proportionate Selection. Combined Genetic
Programming and Uplift Tree . . . . . . . . . . . . . . . . . . . .
A.4 Women’s Model. Fitness Proportionate Selection. Combined Genetic Programming and Uplift Tree . . . . . . . . . . . . . . . . .
vi
59
61
Structure Beginning in chapter 2, we direct our attention to the basic components of uplift modelling including the inspiration for its development, its application, some of the mechanical apsects of model building and a means for evaluating
models. In chapter 3 we turn our focus to the topic of evolutionary algorithms
(specifically genetic algorithms) − their medium, structure, use and application
with particular emphasis on genetic programming. In chapter 4 we merge the
two and show directly how the evolutionary processes described in section 4 can
be applied to the field of uplift modelling. We go on to apply these methods to
a challenge published by Kevin Hillstrom in his blog, MineThatData, and compare our results with those from the winning submission. Finally we conclude in
chapter 5 with a discussion of our findings, the shortcomings of our method and
advice for future implementation.
vii
Chapter 1
Introduction:
Uplift modelling is a class of predictive modelling techniques in its relative infancy
compared with many others. It follows a lineage of modelling methods in the field
of customer relations management, particularly in the field of direct marketing,
dating back to the introduction of data mining in the 1950’s.
Aimed at predicting the incremental impact of a targeted action, uplift modelling
draws on the shortcomings of its predecessors in this area. These pre-existing
techniques do well, through the use of control groups, to measure the incremental
effects of an action after the event but fail to model it predictively.
Due to its being a new discipline, and the difficulty in modelling the second
order nature of incrementality, there is a shortage of documented modelling techniques that currently predict incremental impact. In this paper we will attempt
to build uplift models using evolutionary algorithms.
Inspired by John Holland’s Adaptation in Natural and Artificial Systems, the
practice of evolving solutions by “genetic” interaction has itself evolved well beyond its canonical form as presented therein.
Genetic programming allows us to build models of unbounded complexity, and
we hope to utilise this power in building uplift models. First, by developing the
models directly within the genetic programming paradigm in its pure form and
second, by combining it with an existing modelling technique in an interactive
evolutionary process, we hope to obtain a useful and effective way of developing
uplift models.
We are given a useful metric for comparing uplift models, namely the Qini coefficient, and using this we will be able to compare our findings with models built
using deterministic methods alone, as well as with some built using a simpler
evolutionary process.
1
Chapter 2
Uplift Modelling
Uplift is defined as a measure of the change in behaviour of an individual (or
group of individuals) as a direct result of some action. For example, in the context of marketing, the uplift associated with some campaign can be understood
as the incremental sales volume generated by it.
Uplift modelling is concerned with identifying individuals or subsets of a population for which a (usually binary) influence variable has the greatest incremental
impact. As is always the case with predictive modelling accuracy is paramount
(i.e. we seek to have manageable error in the model), but in addition to this we
seek to isolate the effect of the influence variable as well as have its predicted
effect be as non uniform as possible. A fairly simple regression model will handle
such a variable and be able to isolate its effect adequately, however in this case
(for a binary influence variable) the two submodels will be parallel (i.e. the predicted effect of the variable will be uniform across the population) and so will be
unable to identify those for which the effect is greatest.
Uplift modelling was born out of the failure of traditional marketing strategy
models to recognise the distinction between ”targeting people who are likely to
buy if they are included in a campaign” and ”targeting people who are only likely
to buy if they are included in a campaign.” Modelling on the former will at best
do well to measure and assess incremental sales as a result of some campaign. On
the other hand, understanding the latter allows one to actually maximise them.
The traditional models concentrate on associating purchases with treatment by
some marketing campaign. While this method can certainly give evidence that a
customer was in some way influenced by the campaign, there is no guarantee of
it. It may be that such a customer would have bought whether or not they were
targeted by the campaign and so any increment in purchasing is not recognised.
Moreover, the potential for negative influencing by campaigns is completely ignored. While it may seem absurd that targeted marketing could significantly
negatively influence a potential customer’s likelihood of purchasing, these phenomena are well documented ([1]).
In this section we will discuss some of the basic ideas behind Uplift modelling,
2
modelling methods that preceded it (so-called response models) and some of their
shortcomings as well as a metric for evaluating Uplift models (namely the Qini
coefficient) which will be used in our analysis later on.
2.1
The Basics of Uplift Modelling
Consider a population P divided into two disjoint subpopulations, T and C. Suppose then that members of T are exposed to some treatment, while members of C
are not (we refer to T as the treated population and C as the control population).
We will begin with the binary case and consider the outcome variable O ∈ {0, 1},
where O = 1 is seen as the “desireable” outcome.
A conventional response model attempts to model
P (O = 1|x ; T );
that is the probability of an individual, described by the variables x , returning
the desireable outcome given that they are in subpopulation T (i.e. were treated).
Notice that this technique ignores the subpopulation C (i.e. those not treated).
Uplift models, on the other hand, instead model
U (x ) := P (O = 1|x ; T ) − P (O = 1|x ; C)
the increase in probability that an individual will return the desired response
given that they are in subpopulation T (i.e. were treated) over the relevant probability were they in subpopulation C (i.e. were not treated).
When O is continuous (or at least discrete but not binary) then we can instead
consider expectations. In this case, uplift models attempt to model
U (x ) := E[O|x ; T ] − E[O|x ; C];
the increase in expected value of the outcome O given that the individual,
described by the variables x , is in subpopulation T over that were they in subpopulation C.
The difficulty in modelling this arises when we realise that the uplift for an
individual is not an observable quantity, since no indiviual can be in both T and
C. The most obvious way to model this is to fit a response model for each of the
terms independently (that is, a model for individuals in T and another for those
in C) and then to subtract the one from the other. In theory there is nothing
intrinsically wrong with this method, however in practice it is not always reliable,
and can in fact be quite poor ([1]).
Instead we can consider models that impose a segmentation on the population
and base expectations on averages within those segments. In this case we encounter difficulties surrounding reliability, as estimates for every segment need to
be statistically robust and thus considerably more observations are required than
in some other modelling methods.
3
2.2
Significance Based Uplift Trees
When we consider a segmentation of the population as suggested above, a natural
choice of model structure is tree-based models, as they are intrinsically based on
segments. In fact the only packaged uplift modelling software currently available
is based on a tree building algorithm, the basics of which will be discussed herein.
The key features of the tree based model are:
• Significance based splitting:
The goal of generating useful splits during the growth of the tree model is
to both maximise the difference in uplift between the two subpopulations
and minimise the difference in their sizes. In general these two objectives
are at odds with one another however and a method of satisfying both
satisfactorily is not trivial. The significance-based splitting criterion fits a
linear model to each of a set of potential splits and uses significance of the
interaction term as a measure for selection. Modelling the outcome variable
as
Oij = µ + αi + βj + γij
where Oij indicates the expected outcome returned by an individual in
subpopulation i (either treated (T) or control (C)) on side j (L or R) of
the split being considered. µ is a constant related to the mean overall
outcome across the entire population. αi relates to the effect of treatment
in subpopulation i (naturally we set αC to be zero). βj quantifies the effect
of the split by indicating the base difference in outcome between the two
sides. Finally, γij measures the interaction between the treatment and the
split.
There is no loss of generality in setting
βL = γCL = γCR = γT L = 0
Which leaves γT R as the difference in uplift between the subpopulations
either side of the split, which is precisely the quantity of interest. The usefulness of taking the significance of this estimate rather than its magnitude
as a selection criterion lies in the fact that the t-statistic on which it is
based implicitly combines both magnitude and population size.
• Variance-based pruning
The additional difficulties encountered in modelling uplift over many other
quantities lies in the fact that overall uplift due to the treatment is often
small compared with other relationships in the data, the control population
is often significantly smaller than the treated, and finally its second-order
nature (i.e. it being a measure of the change in outcome, rather than a
basic measure of magnitude of an observable quantity) and as such larger
errors tend to arise with estimation. As a result, stability of models is often
difficult to achieve.
4
Radcliffe ([1]) recommends resampling the training population k times (k =
8 is common), training on the first sample and evaluating the stability with
reference to the other k − 1 samples. At each node in the model, the uplift
is measured in population 1 and its standard deviation is estimated from
the other k − 1 samples. Any split for which a child node exhibits standard
deviation greater than some threshold is discarded, along with the subtree
descending from it.
• Bagging
The practice of bagging is a further concession to the rife instability inherent
in building uplift models. It refers to building a number of models (typically
10 or 20) as described above, each using different resamplings, and basing
predictions on the averages achieved over all the models. Radcliffe ([1])
notes that this approach can succeed in cases where building only a single
tree failed.
• Pessimistic qini-based variable selection
The major factors regarding variable selection in any model building technique are reducing dimensionality (particularly to prevent overfitting on
the data used for building the model), avoiding multicorrelation (strong
correlations between predictor variables can lead to unstable results and
can lead to misinterpretations of the relationships inherent in the data that
are recognised by the model), improving model quality and stability and
improving model interpretability.
Again the stability issue is more apparent in uplift modelling than in general.
Basing variable selection on pessimistic qini-based estimates refers to ranking the candidate variables according to a quality measure (the Qini coefficient, see below) and choosing either a fixed number of variables, or all
which are above a specified threshold level. Radcliffe ([1]) recommends an
adjustment to the basic Qini coefficient to be more pessimistic and, in doing
so, reduce the likelihood of selecting variables which might lead to instability in the model. As in the case of bagging, this is done by resampling the
training population repeatedly and subtracting a multiple of the standard
deviation of Qini values obtained from the initial Qini estimate.
2.3
Evaluating Uplift Models (The Qini Coefficient)
We have discussed a small amount about what is desirable in an uplift model, but
given some collection of models how do we determine which is “best”? There is a
tendency in practice to resort to ad hoc methods, most often graphical. Several
authors suggest comparing uplift in the top k deciles to the overall uplift, however different choices of k and can reverse the comparison ([2]) and without some
conformity in choosing a cut off level this method does not appear very robust.
This method also ignores the opposite end of the spectrum, those for which uplift
5
is considerably lower than the overall, or is even negative. Moreover, it is very
possible, and has been observed in practice ([2]), that decile segmentation does
not generate any statistically significant differences and a more coarse segmentation is needed.
Radcliffe ([2]) proposes the Qini coefficient, an analogue of the Gini coefficient
([11]) for uplift. Where the Gini coefficient measures the (anti- ) correlation between outcome and targeting depth (based on an ordering induced by the model),
the Qini instead measures the (anti- ) correlation between uplift rate and targeting depth.
Let M be a model that induces a total ordering on a population, divided into T
T
and C as above. Define fM
(x) : [0, 1] → R to be the response rate at depth x,
according to the ordering induced by M for the treated population. Similarly,
C
define fM
for the control population. (Note that for a finite population, f will
not be defined over the continuous interval [0, 1], in such a case we consider the
piecewise continuous approximation formed by spreading the probability density
associated with discrete scores over the equvalent interval of continuous ranks.)
We can then define the predicted uplift pointwise, for a given depth x
T
C
uM (x) := fM
(x) − fM
(x)
And with this we can define the cumulative outcome and uplift rates
Z x
T
T
FM (x) =
fM
(y) dy
0
C
FM
(x) =
Z
x
C
fM
(y) dy
0
UM (x) :=
T
FM
(x)
C
− FM
(x).
Condsider a model which imposes a random ordering on the population, R.
We have
E[UR (x)] = ūx, ∀x ∈ [0, 1],
R1
where ū = 0 u(x) dx, the average uplift in the population. The Qini coefficient, Q, is defined by the area between UM and this diagonal, i.e. the average
excess cumulative uplift resulting from the model M, normalised by the equivalent value for the best possible ordering.
Radcliffe acknowledges ([2]) that a model inducing such a best possible ordering
might not even be theoretically achievable: for example, no model can separate
individuals described identically by the predictor varaibles whose outcomes are
independent of treatment. To incorporate this, he also defines q0 , which is based
on the ”best case” model with no negative uplift.
6
Note that the transformation from Q to q0 is monotone, and so the choice of
one over the other is essentially down to interpretation.
7
Chapter 3
Evolutionary Algorithms
(Genetic Algorithms)
The concept behind evolutionary algorithms stems from our recognition of nature’s ability to produce (evolve) species and individuals which appear well adapted
to their environment. Regardless of how harsh an environment may be, life finds
a way to propagate, and even flourish.
The evolutionary algorithm in its essence is a search method (heuristic) which
draws on ideas akin to Darwinian evolution to produce increasingly appealing
(fit) solutions to a problem. The power of evolutionary algorithms (specifically
genetic algorithms) was initially formalised in John Holland’s pioneering work
([3]), in which he describes the notion of schema analysis. Schema analysis operates within the genotype (representation) space and shows how individuals sharing
similarities in genetic structure (schemata) which display sufficiently above average fitness* have a tendency to proliferate. While this has given us tremendous
insight into the usefulness of genetic algorithms, a crucial limitation of schema
analysis lies in the specificity of the genotype space, that is the requirement that
“chromosomes” (representations) be linearly arranged strings of a fixed number
of genes, each of which coming from a predefined set of alleles. Greene ([4]) discusses the usefulness of considering nonlinear arrangements of chromosomes and
how Holland’s Schema Theorem can be extended to the case of a general connected graph arrangement. Forma analysis ([5]) tackles very similar ideas from
a much more general perspective and allows for a far more universal application.
Formae refer to equivalence classes implied by any equivalence relation over the
phenotype (solution) space, and so similarities between subscribing individuals
need not be restricted to schemata, thus many of the limitations associated with
schema analysis are handily avoided.
In this chapter we will introduce and briefly discuss some important issues and
ideas in the scope of evolutionary algorithms that are relevant to this paper. We
will then delve into a slightly more detailed account of Genetic Programming, a
particular type of evolution. Finally we will touch on some theory showing the
effectiveness of genetic algorithms.
8
* The level of fitness above average depends on the nature of the schema in terms
of its length and degree of specificity.
3.1
The Evolutionary Process
In this section we will encounter a broad structure of the evolutionary process
associated with the algorithms. We being this section with some necessary terminology.
Definition The Search Space, S, is the set of all possible solutions to a problem.
S is also referred to as the solution space and the phenotype space.
Definition The Representation Space, G, is the set of all (gene) representations
of members of S.
G is also referred to as the genotype space and the chromosome space.
Definition The Solution Map, F : G → S, associates each element of the representation space with its associated solution.
F should be onto, i.e. ∀ s ∈ S : F −1 ({s}) 6= ∅. In other words every possible
solution has a representation in G. Ideally we would want for F also to be one
to one, i.e. ∀ s ∈ S : |F −1 ({s})| = 1, where | · | denotes the cardinality of a set.
However, this uniqueness of representations is, in general, not satisfied.
Q
Definition The Evaluation Function, E : S → i∈I Pi , where I is some index
set, maps from the solution space to a product of totally ordered sets.
For the purposes of this paper we only consider real valued evaluation functions and refer to E(s) as the fitness of s. Weise devotes a significant part of ([6])
to showing how higher dimensional evaluation functions can be reduced by the
use of Pareto Dominance. In the field of optimisation these higher dimensional
functions usually arise in the presence of multi-objective problems.
Definition An Evolutionary Process is a 5-tuple, (S, G, O, F, E), where S is the
solution space, G is the representation space, F is the solution map, E is the
evaluation function and O is a collection of so-called evolutionary operators.
Evolutionary operators take collections of elements from the representation
space (or solution space), along with some parameter(s) (from some predefined
control set) and return a collection of elements from the representation space
(or solution space). The nature of the control set depends on the operator and
should be designed in a way that, for a given operator, collection of operands and
parameters the output is unique.
In this paper we will consider three types of evolutionary operators. Firstly,
selection operators, which take an entire population of individuals and return
some subset of that population. Selection operators can be used to determine
breeding pools (collections of individuals chosen to undergo recombination) as
well as for culling populations in order to control the population size.
9
Definition An adjustment to the culling operator referred to as elitism insures
that individuals defined as elite cannot be selectd for removal from the population.
In general elitism refers specifically to retaining the “fittest” member of the
population., however we prefer this slightly more general definition which allows
retention of any predefined set of individuals.
Secondly, recombination operators, which take pairs of individuals and return
(according to the control parameters) some number of “child” solutions. Much
of ([5]) is devoted to useful properties of recombination operators in the field of
forma analysis. In general these properties require background beyond the scope
of this paper, however one such property which does not require any background
knowledge and which is important in the context of convergence is the notion of
purity.
Definition A recombination operator, R, is said to be Pure if
∀g ∈ G, c ∈ CR : R(g, g, c) = g.
That is a recombination operator is pure if the offspring of (genetically) identical parents are identical to those parents, regardless of the control parameters.
Finally, mutation operators, which take single individuals and return, according again to control parameters, another single individual. In general mutation
will change the make-up of an individual, and may result in an individual which
possesses characteristics otherwise not present in the population. It is this that
is both the bane and the boon of genetic algorithms. Mutation allows for a much
broader search of the solution space and so the power of an algorithm is greatly
amplified by it. On the other hand, it is a great thorn in the side of analysts as
understanding convergence becomes far more difficult.
Note: Recombination and Mutation are also referred to as genetic operators.
The mechanism of the evolution process is the iterative application of evolutionary operators to some initial subset of the representation space. The age of
the computer has allowed us to perform absurdly many simple tasks in a short
space of time and it is this power that allows us to utilise evolutionary algorithms
in a meaningful way. Of course we cannot emulate the vastness of time that life
has taken to evolve from single celled organisms to the biologically diverse and
complex world we live in, but we are able to observe improvements in even the
most complex of problems.
3.2
Genetic Programs
Genetic programs are a very specific type of genetic algorithm in which the individuals undergoing evolution are, themselves, computer programs. The relevance
10
of genetic programming to this project is that the variables built in our evolution of uplift models can be expressed as tree-based structures like those used to
describe the computer programs herein. As such, the mechanical workings of the
evolutionary processes of the two are essentially the same.
Koza ([7]) discusses how extremely varied problems can be solved by the discovery of a computer program which produces a particular output from a collection of inputs. It will be this ability to build highly non-smooth transformations
(ones which might not be possible using conventional modelling methods nor
simpler evolutionary processes) that will be useful in constructing models and
variables in our application of genetic programming to uplift modelling.
3.2.1
Trees
As mentioned above, the medium for representation of computer programs is via
tree graphs and we will now devote a small amount of time to defining the nature
of these trees and how the evolution thereof takes place.
Definition A tree is a connected directed graph with no cycles. That is it is a
graph in which any two nodes are connected by a unique path*.
*we use path to mean a sequence of nodes for which there is an edge connecting each pair of consequent nodes that does not visit the same node more than
once. This is sometimes referred to as a simple path.
In genetic programming we consider hierarchical trees, in which order of nodes
is important and we refer to immediate superiors (i.e. those adjacent and higher
up the hierarchy) as parent nodes, and immediate inferiors as child nodes.
Definition We refer to a node with no child nodes as a terminal, a node with
no parent nodes as a root and a node with at least one parent and at least one
child node as an intermediate node.
Definition The descendants of a node N include all the children of N as well as
all descendants of those children.
Definition A subtree of a tree T subtended by a node N is the tree consisting of
N and all the descendants of N, connected in the same manner as they are in T.
Definition The depth of a tree is the greatest distance (in terms of number of
nodes passed through) along a path from a root node to a terminal node.
11
Figure 3.1: Tree
In the above tree, the node labeled ”1” is the root node, it has 2 child nodes
”2” and ”3”. Nodes ”2”, ”4” and ”5” are terminal nodes. ”3” is an intermediate
node and is the parent of ”4” and ”5”. The tree has depth 3 and size 5.
3.2.2
Terminals and Functions
Of crucial importance to the evolutionary process is the correct selection of potential terminals and indermediate (and root) nodes. In order to have the output
of our programs vary according to input (something which is obviously desirable
as a constant output is far from interesting) we include in the set of potential
terminals a collection of variables to represent the inputs, as well as any necessary
interpretations thereof (such as the state of some system in the presence of the
input variables). One can also incude numerical values, the boolean values True
and False, etc. Especially in the case where different inputs range on very different scales it is sometimes useful to include a constant terminal which can relate
one to the other by some operator (e.g. dividing the one with the larger scale
or multiplying the one with the smaller by some appropriately sized constant).
Intermediate nodes contain elements from a set of functions, usually comprised
of some collection of:
• arithmetic operators (+, -, *, etc.),
• mathematical functions (trigonometric identities, logarithms, etc.),
• boolean operators (AND, OR, etc.),
• conditional operators (IF-THEN-ELSE, etc.),
• relations (=, ≤, etc.),
• any functions that induce iteration or recursion, etc.
The representation space is then the collection of all potential structures that
can be composed recursively from the collection of terminals (T ) and functions
(F ). We will denote this space G(T,F ) .
The terminal and function sets should also be designed so as to satisfy the closure
and sufficiency properties.
12
Definition A terminal, function pair (T, F ) is said to be closed if each f ∈ F is
defined over T as well as all possible outputs of members of F .
An example of a non-closed pair is any with function set containing both −
and / since for any t1 , t2 ∈ T , t1 /(t2 − t2 ) does not exist. In other words, the
function / is not defined for the pair of values (t1 , t2 − t2 ) = (t1 , 0). Mathematical
functions such as square-root and logarithms are not defined for negative values,
and so here too possible non-closure is a problem. In such instances it is common
to replace these functions with ones that are equivalent where the function is
defined, but takes on some other form elsewhere (often merely taking the value
zero).
Definition A terminal, function pair (T, F ) is sufficient for a problem P if the
solution to P lies in G(T,F ) .
This point may appear trivial, however the identification of the requirements of
the problem are not always obvious. Borrowing an example from ([7]), Kepler’s
Third Law, which was discovered in 1618, states that the cube of a planet’s
distance from the sun is proportional to the square of its period around the sun.
If F = {+, −} then a computer program that predicts the period of a planet
around the sun could not result. Similarly, without knowledge of Kepler’s Law,
one would not know that distance from the sun is the sole predictive variable
when determining the period of a planet and so might construct a terminal set
comprised of only information about the planet iteslf; say diameter, density and
rotational speed, which would not be sufficient for the problem.
3.2.3
The Initial Population
The genetic programming evolutionary process produces increasingly complex individuals due to the nature of the evolutionary operators used. Initialising the
process with the inaugural population is an important step in the process and
there are two commonly used methods for generating initial structures, namely
“grow” and “full”.
The “grow” method creates trees of varying shape for which the distance between
a single root node and any terminal node is no greater than some predetermined
depth. Starting with a randomly selected root node (from the set T ∪ F ), whenever a node is added that is in F , if the current depth is strictly less than the
maximum, attach a number of randomly selected nodes (from T ∪ F ) equal to the
number of arguments that the node takes. If the depth is equal to the maximum,
instead append nodes only from T .
The “full” method creates trees which are dense, i.e. the distance from the
single root node to every terminal node is equal to some predetermined depth.
They are more uniform in shape and size than grown trees, with variations only
arising as a result of functions taking a different number of arguments. They are
constructed in the same way as ”grown” trees, except when a function node is
13
added, if the current distance to the root is strictly less than the predetermined
depth then only nodes from F are appended.
The “ramped half-and-half” method suggested by Koza ([7]) generates an initial population using a combination of the two generative methods. For every
depht between 2 and some chosen maximum, the method generates an equal
number of grown and full trees of that depth and adds them to the population.
This creates an initial population varied in both size and shape.
3.2.4
Evolutionary Operators for Genetic Programs
For the purposes of this paper we consider the following recombination operator:
R : G × G × Z2+ → G; where R(g1 , g2 , n, m) is g2 with the subtree subtended
by the n(mod size(g1 ))th node replaced with the subtree of g2 subtended by the
m(mod size(g2 ))th node. Here the size of a tree is the number of nodes it contains,
and the nodes are numbered depth first (since in general the control operands are
chosen randomly, the numbering is in this case arbitrary. In practice the set of
control operands is bounded since generating random elements from an infinite
set is problematic). Notice that in this case the control set is Z2+ .
Figure 3.2: Crossover
Mutation will be defined as:
M : G × G × Z → G; where M(g, h, n) is g with the subtree subtended by the
14
n(mod size(g))th node replaced with h. We have taken a slight liberty in the
notation here since in general the size of h will depend on g and n in that we
will restrict the depth of h to the depth of the subtree it replaces. We would
prefer not to have operands depend on one another in this way, however there
should be no loss of interpretation. Furthermore, because again the operands are
chosen randomly, it would be possible to insert some pruned version of h in the
case where it is larger than the desired size. This would bias it’s size upwards,
however would actually result in mutated trees closer in size to the original. The
control set in this case is G × Z.
Figure 3.3: Mutation
3.2.5
A Word on LISP and Symbolic Expressions
LISP (LISt Processing) is a programming language based on two types of entities: atoms and lists. Atoms in LISP are things like constants, like the number
7, and variables, like TIME. Lists in LISP are ordered sets of items enclosed in
parentheses. Examples of lists are (A 3 TIME) and (< 2 7).
A Symbolic Expression (S-expression) is a list or atom in LISP. S-expressions
are the only syntactic form in LISP, and so the programs of LISP are themselves
S-expressions.
LISP evaluates constant atoms to themselves (e.g., 7), and variable atoms to their
current value (e.g., TIME). LISP evaluates lists by treating the first element as a
function and applying it to the evaluations of the remaining elements of the list.
15
These elements may be atoms, or lists themsleves.
For example, the S-expression (< (+ 1 (* 2 3)) 4) is evaluated by LISP as applying the comparison function “<” to the entities (+ 1 (* 2 3)) and 4, which it
evaluates in turn. The first is evaluated as applying the numerical operator “+”
to the entities 1 and (* 2 3). The second of these is evaluated as applying the
numerical operator “*” to the entities 2 and 3. Working back up this chain, LISP
evaluates *(2, 3) = 2 * 3 = 6; +(1, 6) = 1 + 6 = 7; and finally <(7, 4) = 7 < 4
= False.
Symbolic Expressions as Trees
In order for the programs we are evolving as trees to be understood by a computer,
we need to express them in a language that the computer can interpret. There is
a simple transformation from the tree structure to S-expression as follows:
If the root is a terminal
return the atom equivalent
else
create a list with first element equivalent to the root
for each subtree subtended by child nodes, repeat process
append.
Example Consider the tree depicted below
Figure 3.4: Example Tree
Root node “+” is not a terminal
create list (+)
first child “*” is not a terminal
create list (*)
first child “a” is a terminal
append (* a)
second child “b” is a terminal
append (* a b)
16
append (+ (* a b))
second child “Q” is not a terminal
create list (Q)
first child “*” is not a terminal
create list (*)
first child “c” is a terminal
append (* c)
second child “d” is a terminal
append (* c d)
append (Q (* c d))
append (+ (* a b) (Q (* c d)))
So the tree above has S-expression (+ (* a b) (Q (* c d))), where Q represents
a function which takes only one argument.
3.3
The Power of Genetic Algorithms
As mentioned before, the power of genetic algorithms was initially formalised
by John Holland in the form of schema analysis. In this section we will cover
briefly the notion of schema and how the Schema Theorem shows how improvements in the evolution arise. We will then introduce Formae as alternative ideas
to schema and how forma analysis yields similar improvement results. Finally,
we will touch on some ideas of convergence of evolutionary algorithms from a
topological perspective.
3.3.1
The Schema Theorem
Definition Let G be a representation space in which indiviuals are strings of
some fixed length. A Schema, H, is a template which defines a subset of G
based on its members sharing genes at specific positions. The defining length
of a schema, δ(H), is the difference in index between the first and last specified
position. The order of a schema, o(H), is the number of specified positions.
For example, if G is the collection of binary strings of length 6, then 100
is the schema consisting of all elements of G with 1 in the first position and zero
in the third and sixth positions. It has order 3 and defining length 6 − 1 = 5.
Theorem
Let NH (t)
generation
generation
3.3.1. The Schema Theorem:
be the number of members of a population displaying schema H at
(iteration) t. Let fH (t) be the average fitness of those displaying H at
t and f (t) be the average fitness in the entire population. Then
E[NH (t + 1)] ≥ NH
X
fH (t)
(1 −
dω )
f (t)
ω∈Ω
where dω represents the likelihood of disruption of the schema H by the evolutionary operator ω for each operator in the set being applied in the evolution;
Ω.
17
The Schema Theorem in its usual form requires that selection be fitness proportionate, and also assumes certain things about the operators in use such as
transmission (c.f. [5]) of genes by recombination operators. For many recombination operators, such as the commonly used one-point crossover (c.f. [3]), the
likelihood of disruption is proportional to the defining length of the schema and
similarly the likelihood of disruption is generally proportional to the order of the
schema. Thus short schema with low order which are highly fit have the greatest
likelihood of proliferation.
In the case that tournament selection is used (in which pairs (or groups) of individuals are selected to “compete” with one another directly and those of higher
fitness are given some predefined probability of victory. c.f. section 4.1.1 later),
more knowledge about the distribution of fitness is needed for an analytic result.
With only knowledge of the average fitness we can argue the absurd case that all
the fitness among members of a schema H is attributable to a single individual,
and so if that schema is of any considerable size, the likelihood of a randomly
chosen member of H “beating” some other random individual not in H is approximately (1 − β), where β is the competition bias. We can, however, see how if the
average fitness of members of a certain schema, H, is above the average in the
overall population and the distribution is not too skewed, then that schema will
have a tendancy to proliferate in future generations (as described in the schema
theorem). With more thought we realise that in any case a distribution of fitness
among H members which is not too skewed is far more desirable, since we then
believe more strongly that above average fitness arises from the schema rather
than the high fitness being incidental.
It can easily be shown that if pk is the k th percentile of fitnesses among members of P − H, then H tends to proliferate if the proportion of H members with
fitness above pk is sufficiently above 50/k. The problem with this result is that
is far less precise than the schema theorem in that the inequality will generally
be quite slack. Furthermore, this result is meaningless for quantiles below the
median since 50/k > 1 for k < 50.
Holland (among others) describes the short length, highly fit shcemata as “building blocks”, the combinations of which are intended to produce highly effective
individuals within the search space. This is well described by David Goldberg,
another proponent of genetic algorithms:
“Short, low order, and highly fit schemata are sampled, recombined [crossed
over], and resampled to form strings of potentially higher fitness. In a way by
working with these particular schemata [the building blocks], we have reduced the
complexity of our problem; instead of building high-performance strings by trying
every conceivable combination, we construct better and better strings from the
best partial solutions of past samplings.”
∼ Goldberg 1989
18
3.3.2
Forma Analysis
It has been shown that the choice of describing subsets of the representation
space by schema (with regards to the improvement shown by the schema theorem) is essentially arbitrary, and in fact a very similar result is provable where
H instead describes any subset of the representation space, provided the disruption coefficients are tractable for the chosen subset ([9]). The importance of
this result is that, where in the case of schemata there may be no recognisable
similarities within the solution space between subscribing members, here the designer/implementer of the algorithm is able to both identify, via the progress
of the algorithm, regions of the search space which might be fruitful as well as
prescribe subsets to be exploited based on prior knowledge of the problem at hand.
Forma analysis is concerned with defining a collection of equivalence relations on the search space, the equivalence classes of which give rise to genetic
representations of members of the search space.
Definition An equivalence relation ∼ on a set X is a relation which is
1. reflexive: x ∼ x ∀x ∈ X
2. symmetric: x ∼ y ⇒ y ∼ x ∀x, y ∈ X
3. transitive: x ∼ y and y ∼ z ⇒ x ∼ z ∀x, y, z ∈ X
The canonical equivalence relation is the comparison “=”, since trivially equality satisfies reflexivity (x = x), symmetry (if x = y then y = x) and transitivity
(if x = y and y = z then x = z).
If ∼ is an equivalence relation on the set X, then for x ∈ X we define the
equivalence class of x with respect to ∼ as
[x]∼ = {y ∈ X|x ∼ y}
The intersection of two equivalence relations, ∼1,2 = ∼1 ∩ ∼2 is defined by
x ∼1,2 y iff x ∼1 y and x ∼2 y.
Given a collectionTof equivalence relations E we can use this to define the
span of E; S(E) := { ∼∈I ∼ |I ∈ P(E)}, where P(E) denotes the power set of
E. If we consider equivalence relations over the search space S then E induces
a representation space in which the representation of an s ∈ S is given by the
vector of equivalence classes to which s belongs, with respect to the elements of E.
Radcliffe refers to the equivalence classes created by elements of S(E) as formae. Ideally one designs E in a way that S(E) separates the points of S; i.e.
∀s ∈ S : [s]T∼∈E ∼ = {s}, or equivalently ∀s, t ∈ S : ∃ ∼ ∈ E s.t. s 6∼ t.
By allowing the freedom to choose the elements of E, with some restrictions, the
designer of the evolutionary algorithm is able to predefine regions of the search
space which are believed to potentially possess useful solutions. This definition
also allows for improved interpretation of results as the equivalence relations used
19
are generally more meaningful than schema at a higher level.
In the context of genetic programming, where the variable size and shape
of individuals in the representation space precludes the use of schema analysis,
forma analysis potentially allows for a better understanding of a similar progress
result to the schema theorem. In particular, we may wish to consider a forma,
say F , members of which contain a specific subtree. We would like to analyse
its tendency to exist and grow in future generations. The difficulty arises in
understanding the disruption rates as additional knowledge about the members
of this forma is required, such as the maximum size of the members of F (or
at least an upper bound thereof). In some cases one might wish to impose
a maximum size on the programs being evolved for the sake of computational
efficiency, in which case the maximum size of members of F at least has an upper
bound and so pessimistic disruption rates with respect to the genetic operators
defined previously can easily be determined.
3.3.3
Convergence of Genetic Algorithms
Even though genetic algorithms are widely used in the field of optimisation, in
general problems to which they are applied are not actually solved to optimum.
Instead the algorithms are often used in highly unstable and large spaces as partial search algorithms in which we rely on progress suggested by analogues of the
schema theorem to find solutions which are ”better” than those achievable with
more analytically based algorithms.
In spite of their often being used for improvement rather than optimisation,
it is important to understand the convergence properties (should they exist) in
regarding when they can guarantee the discovery of an (at least local) optimum
within the search space.
Holland shows in ([3]) that if crossover is the only genetic operator employed
then the evolutionary process described as a stochastic process has a stationary
distribution, however we seek a more general result.
Rudolph ([10]) also uses a stochastic process representation to show that in
the presence of mutation, genetic algorithms like those which Holland describes
will never converge to the global optimum. However, he also shows that a modification of the algorithm in which the best solution within the population is always
maintained will converge to the global optimum.
20
Chapter 4
Application of Genetic
Algorithms to Uplift Modelling
and the Hillstrom Challenge
In this section we will show how genetic algorithms can be applied to the problem
of uplift modelling. We will then apply these algorithms to a specific problem,
posed by Kevin Hillstrom in his MineThatData blog ([12]), for which uplift modelling is ideally suited, and compare our results with those from the winning
entry.
4.1
Application of Genetic Algorithms to Uplift
Modelling
In order to implement a genetic algorithm, we need to describe firstly each element
of the evolutionary process and secondly any adjustments and additional steps
and methods used.
4.1.1
The Genetic Programming Approach
• The Solution Space
The solution space S will consist of all possible variables with values in
hX ∪ Q, (+, −, ∗, /)i*, the set generated by X (the collection of all values
the basic descriptive variables of the problem can assume) and Q with the
operators (+, -, *, /), capable of at most separating data points which are
discernable by at least one basic descriptive variable. (In other words if
two data points are described by the same set of variables x , then every
member of S will assign the same value to them)
*If variables take on values in R − Q, we tend in practice to only consider
their rational approximations, in which case S can be described as all rational valued variables. However as a concession to being concise regarding
the sufficiency of the problem, we choose to be more explicit.
21
• The Representation Space
Our intention is to generate composite variables (of the basic descriptive
variables defined in the problem) for use in building uplift models. As we
do not wish to restrict the level of complexity these variables entail, we
will use genetic programming to evolve them as symbolic expressions. The
terminal set T will be the collection of basic variables of the problem as
well as constant terms which relate the scale of the variables where applicable and those which divide the variables in an apparently useful way
(e.g. measures of the “centre” (usually mean or median) of variables, or
in the case of low order discrete variables, values which separate explicitly
the different values). In the case where these variables are non quantitative, we convert them using collections of indicator variables into discrete
numerical variables. The function set F will consist of the basic binary
numerical operators (+, -, *, /), with division safeguarded by the added
identity a/0 = 0 ∀ a ∈ R, the relations which exploit the ordered structure
of R (=, ≤, <, ≥, >), accepting the convention of numerical binary logic
variables True = 1 and False = 0, and the conditional operator (IF-THENELSE).
It is easy to see that (T, F ) is closed. To show that it is sufficient, we
can adopt the absurd argument that if a genetic program can manifest the
constant 1 and the function set contains +, − and / then every rational
number can occur. Explcitly the expression (= a a) for any a ∈ T has
the value 1, and so we can generate every rational number and hence every
number in the set generated by Q ∪ X.
• The Evolutionary Operators
The evolutionary operators will contain recombination and mutation as described in section 3.2.4 and the selection operators described below.
Our selection operators both for culling of the population and for choosing
breeding pairs will be based on so-called tournament selection defined as:
Cβ : G × G × [0, 1] → G; which returns the fitter of the two individuals
if the final operand is less than the selection bias β ∈ [0, 1], otherwise it
returns the less fit. In general the final operand is selected from a U (0, 1)
distribution and so this can be understood as a competition in which the
fitter individual has a probability β of winning.
The culling operator is then defined by:
Kβ : P(G) × Z2 × [0, 1] → P(G); where
K(P, n, m, r) = P − {C1−β (P [n], P [m], 1 − r)}
that is it removes the “loser” of the competition between the nth and mth
members of a population (if n is greater than the size of P then we consider
n(mod|P |), and similarly for m).
The operator for selecting a breeding pair is:
22
Bβ : P(G) × Z4 × [0, 1]2 → G × G; where
Bβ (P, n, m, o, p, r, s) = (Cβ (P [n], P [m], r), Cβ (P [o], P [p], s)
that is it returns the winners of two tournaments between members of P
indexed by the integer operands.
• The Solution Map
The solution map F takes each g ∈ G and assigns it the variable g(x ), the
output of g for a given vector of basic descriptive variables x .
• The Evaluation Function
The evaluation function E is defined by E(s) = q0 (s), where q0 (s) is the
“no negative-uplift” adjusted Qini coefficient from the ordering imposed on
the data set by the variable s.
• In addition to the usual structure of the evolutionary process, we include a
dynamic adjustment to the evaluation function as follows:
At regular intervals we select subsets of the population (using various selection methods) to be passed into a significance based uplift-tree building
algorithm. In the event of discovering a “better” model than has so far
been found (we will formalise what we mean by better in the next section),
we will adjust the fitness of the variables used in the model by a factor
according to how “good” the model is. The intention is that by adjusting
the fitness of variables which have proven useful in building models, we
will extend their longevity, increase their likelihood of reproducing and also
increase their likelihood of being selected for future models.
• Finally, we will grant temporary elitism to the variable with the highest
basic fitness level (as described by the static evaluation function) and those
variables included in the “best” model found so far.
4.1.2
The Polynomial Approach:
• The Solution Space
The Solution space consists of all two dimensional polynomial transformations of the basic descriptive variables of degree 4 with zero constant term.
As before non quantitative variables will be redefined numerically.
• The Representation Space
P
The representation space consists of all strings in R14 . Note that 4i=0 4i −
1 = 14, the number of two dimensional combinations of order less than or
equal to 4 excluding the zero order constant.
• The Evolutionary Operators
Crossover will be defined as follows:
C : G × G × [0, 1]14 → G; where C(g, h, x )i = xi gi + (1 − xi )hi . We select each
[0, 1] operand from a U (0, 1) distribution and so C can be seen as selecting
a point randomly between the two points in G.
23
Mutation applies the following function to each element of an individual
in G independently:
Mα : R × [0, 1] × R → R; where Mα (x, λ, r) = x + r if λ is less than the
mutation rate α ∈ [0, 1], and Mα (x, λ, r) = x otherwise. We choose r from
a N (0, 1) distribution, and λ from a U (0, 1) distribution.
Selection operators will be as in the Genetic Programming approach above.
• The Solution Map
The solution map F assigns the elements of g ∈ G as coefficients in a
polynomial in S.
• The Evaluation Function
The evaluation function will be as in the Genetic Programming approach
above.
• As before we include a few adjustments to the standard evolutionary process. Because the polynomials
being built are two dimensional, we simul
N
taneously evolve 2 separate populations, where N is the number of basic
variables in the problem.
• At regular intervals we combine the populations and select a subset probabilistically (according to fitness) to be passed into a significance based
uplift-tree algorithm. As before, whenever a new ”best” model is found,
the fitnesses of the variables used are adjusted.
• Once again we grant temporary elitism to the variables with the highest
basic fitness level in each population, as well as those included in the ”best”
model so far.
• Initial coefficients were chosen randomly between −10 and 10. Because
the models are evaluated based on an ordering, the scale of coefficients is
arbitrary and so we could have chosen any range centered at 0.
4.2
The Hillstrom Challenge and Results
In March 2008 Kevin Hillstrom made available, through his blog, MineThatData,
a dataset describing two email campaigns and a control group and issued a challenge to analyse the data with respect to a set of questions. In this section we
will cover briefly the nature of the data set, the questions posed and a summary
of the results from the winning submission ([8]). We will then discuss the results
obtained using our genetic algorithm, its shortcomings and where it was able to
produce improved results on those published.
4.2.1
The Data and Questions
The data represent 64,000 entries each describing a customer. A third of the
customers were chosen randomly to receive an email called the Men’s Email, a
24
second randomly chosen third to receive the Women’s Email, and the remaining
customers served as a control group, receiving neither email.
The individuals were described by three outcome (dependent) variables:
- Visit: A binary variable indicating whether the customer visited the site within
a two week outcome period.
- Conversion: A binary variable indicating whether the customer whether or not
they purchased at the site during that period. Obviously Visit being false implies
Conversion is false.
- Spend: A real valued variable indicating the amount spent during the outcome
period. Obviously spend = 0 whenever Conversion is false.
They were also described by 8 independent variables (referred to previously as
basic descriptive variables):
- Recency: An integer valued variable indicating the number of months since each
customer’s most recent purchase prior to the outcome period.
- History Segment: A categorical variable dividing the population according disjoint ranges of money spent in the preceding year.
- History: An actual monetary amount spend by each customer in the preceding
year.
- Mens: A binary variable indicating if the customer had purchased in the Men’s
department in the last year.
- Womens: As for Mens, but regarding the Women’s department.
- Zip Code: Classifies customers according to either Urban, Suburban or Rural.
- Newbie: A binary variable indicating if the customer made their first purchase
within the previous year.
- Channel: Describes the channels through which the customer bought previously
(either by Phone, Internet, or both).
Finally a variable indicating which customers were treated with the Mens-Email,
the Womens-Email and which formed part of the control group.
The questions posed in the challenge were as follows:
1. Which e-mail campaign performed the best, the Mens version, or the Womens
version?
2. How much incremental sales per customer did the Mens version of the e-mail
campaign drive? How much incremental sales per customer did the Womens version of the e-mail campaign drive?
3. If you could only send an e-mail campaign to the best 10,000 customers, which
customers would receive the e-mail campaign? Why?
4. If you had to eliminate 10,000 customers from receiving an e-mail campaign,
which customers would you suppress from the campaign? Why?
5. Did the Mens version of the e-mail campaign perform different than the Womens version of the e-mail campaign, across various customer segments?
6. Did the campaigns perform different when measured across different metrics,
like Visitors, Conversion, and Total Spend?
7. Did you observe any anomalies, or odd findings?
8. Which audience would you target the Mens version to, and the Womens version to, given the results of the test? What data do you have to support your
recommendation?
25
Questions 3 and 4 relate directly to uplift modelling since uplift models seek
to maximise incremental sales by means of imposing an order structure on a
population according to their incremental spend behaviour as a result of some
treatment. We will focus our analysis on developing reliable uplift models and in
doing so implicitly answer these two questions. The remainder of the questions
above refer more to measuring incremental sales than to maximising them, and
so we defer in those cases to the results obtained by Radcliffe in his winning
submission, summarised in the next section.
4.2.2
Summary of Results from Winning Submission
Radcliffe uses uplift modelling to tackle three different formulations of the problem posed by Hillstrom. He successfully analyses the effectiveness of the campaigns as well as identifies those for which they were most (and least) effective.
A brief summary of these analyses will be presented here.
It was found that both campaigns had a positive effect overall, however in the
case of the Women’s Email, there appeared to be segments of the population for
which the campaign served in decreasing spend behaviour, rather than increasing
it. The average spend (among those who purchased) increased dramatically with
the Women’s campaign, but decreased with the Men’s. That the Men’s campaign
drove more incremental sales overall was attributed to increasing purchase rate
sufficiently to compensate for this decrease.
The three formulations of the problem are driven by the three response variables present in Hillstrom’s data; modelling incremental visit frequency, modelling
incremental purchase frequency and finally modelling incremental spend volume.
It was found that the Men’s campaign outperformed the Women’s on all three
counts.
Table 4.1: Uplift summary
As can be seen in the above table, the Mens’ campaign outperformed the
Women’s with an increase in visit rate of 7.66% (over 4.52% for the Women’s),
in purchase frequency of 0.68% (over 0.31% for the Women’s) and an increase in
average spend by 77c (over 42c for the Women’s).
While these monetary amounts per head sound small, because of the low overall
purchase rate, in the case of the Men’s campaign it in fact more than doubled
average spend.
26
A splitting of the population into random segments (of 10% each) showed that
these estimates are quite unstable, however, likely caused by the low frequency of
both visits and purchases. This poses a potential problem when identifying the
best (and worst) 10,000 candidates.
The final formulation (regarding increased spend behaviour) is the one most
closely associated with campaign success in general (and implicitly includes the
other two) and is the objective we will focus on in our evolutionary approach,
and so will form the main focus henceforth. Details of the other two models can
be found in ([8]).
The objective is to build a relaible uplift model and choose the 10,000 population members with the highest spend uplift and the 10,000 with the lowest (or
most negative, should that be the case).
Two approaches were considered; one a direct continuous approach and the other
based on creating a binary outcome variable ’spend over x’, for some amount
x. The continuous approach faces the difficulty of massive skewness in the spend
variable, where close to 99% of people have zero spend. Tree models, however, are
fairly well equipped to handle skewness in the variables and this method proved
superior for the Women’s campaing. The binary response model was preferable
for the Men’s.
The model for the Men’s campaign is described as follows:
The model assigns 1 point for each of the following criteria
• historic spend over $160.
• historic spend over $350.
• customer is a multi-channel user.
The model splits the population into 4 segments according to this score, with
the highest being those multi-channel users with historic spend over $350. The
summary of spend behaviour identified by this model is summarised below.
Table 4.2: Spend behaviour identified by Men’s model
27
Figure 4.1: Spend uplift by score for Men’s model
The q0 for the model is 19.80% over the entire data set. Following is the Qini
graph described by the model ordering.
Figure 4.2: Qini graph for Men’s model
The model for the Women’s campaign predicts the incremental spend directly,
however is not as easy to describe. The following table allows for some interpretation by showing the average model score assigned to each bin of the variables
used.
28
Table 4.3: Mean score by binned variables for Women’s model
The table shows that the model favours especially those with high historical
spend (above $500), and to a lesser extent those with relatively low historical
spend, multichannel users and newbies.
The model has a q0 value of 118.00% on the training set and 60.30% on the
validation. The figure below shows the average uplift by model score band.
Figure 4.3: Spend uplift by score band for Women’s model
Selecting the best 10,000 customers to target naturally begins with those scoring 3 according to the Men’s model. This amounts to 5,249 people, and estimating
their uplift from the 3,498 not included in the Women’s campaign we estimate
the average spend uplift at $1.54.
Next are those with score 2 according to the Men’s model. This includes more
people than are needed to select 10,000 and so we consider the segregation of this
segment:
29
Table 4.4: Spend uplift for different segments with score 2 by Men’s model
We can see that Web users with historic spend between $160 and $350 provide
the highest average uplift. This adds a further 4,684 people with estimated uplift
of $1.61. The remaining 67 customers are then chosen randomly from the Phone
users with historic spend between $160 and $350. The estimaed average uplift
overall for these 10,000 people is $1.58.
Next, the worst 10,000 customers were chosen as follows. Because the Men’s campaign performed better, the worst 10,000 according to that model were selected.
The bottom score (0) identifies 32,237 people with an average uplift estimated
at 55c. Of those, trying to select the 10,000 with the lowest score according to
the Women’s model (estimated uplift of 44c on average, corresponding roughly
to bands 0 to 4) results in 12,952 people. Estimating the uplift as a result of the
Men’s campaign among these by removing those who received the Women’s email
comes to just 5c. So the choice of 10,000 people to exclude from the campaign is
to select 10,000 randomly from with score of 0 from the Men’s model and score
< 5 from the Women’s model.
From the Men’s model these are people who
• have historical spend less than $160
• are not multichannel
and from the Women’s model, indicated by Table 4.3
• are not newbies
• use the Phone channel.
If we consider this subpopulation exactly, we find 8,023 people for whom it
was estimated that the Men’s Email would depress their spend by an average of
just over 25c.
4.2.3
Results of the Genetic Algorithm Approach
Because of the high computational time of running genetic algorithms it is costly
to generate statistically robust results, however we believe those presented herein
are sufficient for drawing meaningful conclusions and for making relevant comparisons with the results discussed above.
For each technique employed, we did ten runs and include a summary of those
30
as well as a detailed account of the best found in each case.
For both the Men’s and Women’s emailing campaigns we used the following
common parameters and methodology:
• Maximum Population size: 100. Initial tests showed overwhelming homogeneity in the upper regions of the population (among the fittest inidviduals). This appeared to be due to large redundant subtrees in many
population members, changes within which did not affect the individuals’
performances, meaning that recombining these individuals often produced
offspring essentially equivalent to them. In order to combat this we utilised
pruning methods to remove these subtrees, however these proved to be unnecessarily costly for the length of our runs and so were abandoned. Instead
we excluded individuals with basic fitness levels equal to any for existing
population members. We recognise full well that this method limits the
power of the algorithm, by reducing the region of the search space that is
accessible, however for the purposes of this problem it showed to be the most
effective method in terms of computational cost. Population sizes tend to
be between 60 and 80 members, with the majority centered between these
two numbers.
We generated initial populations using the ramped-half-and-half method
described in section 3.2.3 for depths ranging from 2 to 6.
• Terminal Set consisted of the variables history (historic spend), recency
(most recent purchase in months), newbie (indicator variable to show whether
the customer’s first purchase was within the last year), mens (indicator variable to show whether the customer bought in the men’s department within
the last year), womens (similar to mens for women’s department), binHist
(a binned version of history with cuts at 50, 100, 200 and 400), NumZip (a
numerical version of zip code. Rural = 0, Suburban = 1, Urban = 2) and
NumChannel (numerical version of channel. Web = 0, Phone = 1, Multichannel = 2). In addition we included the constant values 2 (to separate
Multichannel and Urban explicitly. 0 and 1 occur frequently with the use of
numerical True and False and so weren’t included explicitly. It also allows
us to obtain ascending low order constants more rapidly than we would
otherwise), 6 (a measure of the center of the recency variable) and 200 (a
measure of the centre of the history variable.)
• Competitive bias for selecting breeding pairs = 0.7.
• Competitive bias for survival = 0.95.
• Mutation rate = 0.05.
• We terminated the process after 1500 iterations.
31
• The results for each method used are summarised in graphical form as
follows:
The “Average” line indicates the average q0 value over the entire data set
for the best model found in each run. The dotted lines show this average
adjusted by one standard deviation to give an indication of a reasonable
range to expect using this method. The “Maximum” and “Minimum” lines
show the highest and lowest of these q0 values. Finally, the “Lower Average”
line shows average q0 of the worst split from the models selected from each
run. This line is intended to give an indication of an extreme worst case
scenario using this method.
Method 1: Building Models Directly via Genetic Algorithms Initially
we used our genetic algorithms directly to build uplift models. Observing the
effectiveness of models arising directly from the evolutionary process is important
in understanding the relevance of evolutionary algorithms in the field of uplift
modelling at a basic level.
32
Genetic Programming Approach
• Mens
Figure 4.4: Men’s Mailing. Building models directly via Genetic Programming
We can see that the q0 values are a vast improvement over that for the
model built deterministically and presented in section 4.2.2, even on the
worst splits, in the latter part of the evolution these exceed 19.80%. The
average q0 on the entire data set is approximately 31
The Symbolic Expression form of the model does not interpret well due to
its complexity, and so instead we (as in section 4.2.2) defer to relationships
with the basic variables independently. An explicit expression of the model
can be found in the appendix.
The model ranges from 0 to 4 taking only integer values, and Table 4.5
below shows a summary of the spend behaviour associated with each score.
Table 4.5: Spend behaviour identified by Men’s model
Where saying that we have a higher Qini value indicates we are able to better separate the population according to uplift, it is not obvious the degree
33
of improvement. Table 4.5 shows in absolute terms this uplift and so we
can already compare the outcome of this model with the deterministic one.
The model identifies two entire segments (amounting to 2,912 customers
between the Men’s Emailing group and the control set, a full 4,375 in the
entire data set) for whom the estimated uplift is significantly above any
segment identified before.
Figure 4.5: Spend uplift by score for Men’s model
The Qini graph below also shows graphically this improvement over the
deterministic model.
Miro Log 003 (2011/08/17)
Figure 4.6: Qini graph for Mens model
Qini (GP_Model_Mens)
0.8
0.7
0.6
Uplift pp photp
0.5
0.4
0.3
0.2
0.1
0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
proportion treated
Command completed in 0.0505 seconds.
Table 4.6 below shows the average model score for ranges of the basic vari[345]> qini -g Probabilistic_15_Mens
ables. We can see that the model favours particularly those with high
Proportion Conversions C
Segment Probabilistic_15_Mens nT
nC
Treated
T
34
1
>= 6.86
172
202
45.99%
1,316.97
2
[6.06, 6.86)
0
0
<null>
0.00
3
[5.26, 6.06)
0
0
<null>
0.00
4
[4.47, 5.26)
0
0
<null>
0.00
5
[3.67, 4.47)
0
0
<null>
0.00
historic spend (over $400), newbies, those who have shopped in the men’s
department within the previous year, those who haven’t shopped in the
women’s department in the previous year, Multichannel users and those
who live in urban areas.
Table 4.6: Mean model score by binned variables
• Womens
Figure 4.7: Women’s Mailing. Building models directly via Genetic Programming
While the winning submission to Hillstrom’s challenge did not explicitly
include a Qini coefficient for the model over the entire data set, over a
training/validation split showed q0 values of 118% and 60.30% respectively.
Though certainly the value over the entire data set needn’t be the average
35
of these two numbers, experience from experimenting shows that it is a reasonable estimate for the sake of making comparisons. The Genetic Program
evolution was able to produce individual models with comparable q0 values
(in the region of %88), but they were discarded by the algorithm for having
too high variance over the validation splits. The resulting models, though
not as powerful as the deterministic one then, do offer apparent improved
robustness. The best model found using this method had a q0 of 67.47%
and a lower estimate (average minus standard deviation over the validation
splits) of 55.08%.
The model takes on integer values between −6 and 12 inclusive. Table 4.7
below shows that the model identified a large segment of the population
for whom the uplift is strongly negative (−$0.93). We can see that uplift
is not strictly increasing with model score and simply reordering the scores
would vastly improve the Qini coefficient. Reordering the binned segments
in the table below improves this to a value of 81.17% and simply reordering
the values should not effect the robustness of the model, in fact combining
segments as we have done in a few cases here should improve it.
Table 4.7: Spend behaviour identified by Women’s model
Figure 4.8: Spend uplift by score for Women’s model
36
Performance Summary: score2
Overall Uplift (pp):
42.19%
Estimated Incremental conversions:
9,023.83
The Qini graph below shows clearly this bad ordering by the model
uQini: in that 0.41
the gradient is not uniformly decreasing.
q: 17.1248%
Miro Log 004 (2011/08/17)
q0:
81.17%
Probability T/C split is invalid: not indicated
Figure 4.9: Qini graph for Womens model (left) and adjusted version (right)
Qini (GP_Model_Womens)
Qini (score2)
0.6
0.6
0.5
0.5
Uplift pp photp
Uplift pp photp
0.4
0.3
0.4
0.3
0.2
0.2
0.1
0.1
0.0
0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
proportion treated
proportion treated
Command completed in 0.0629 seconds.
Command completed in 0.0652 seconds.
[23]> def rand (if (random-int-field) 1)
Our interpretation of the model is similar to that offered for the determin-
Defining
rand
(random-int-field)
. . . strongly those with high historical
istic field
model
in(ifthat
it emphasise1)most
done.
spend and newbies as well as multichannel users. We have included more
comparisons than before and observe strong emphasis on those who have
previously and those who have not
shopped
in the men’s
department.
Command
completed
in 0.0157
seconds.
Field
"rand" in
(int)
1 nulls department
= 31,749
shopped
the 1women’s
[24]> sel rand -
Table 4.8: Mean model score by binned variables
hillstrom.miro: 64,000 records; 21,416 (33.46%) selected; 27 fields.
Selection: (Treated -) and (rand -)
Command completed in 0.0161 seconds.
[25]> qini GP_Model_Womens
Segment GP_Model_Womens nT
nC
Proportion Conversions Conversions Conv
Treated
T
C
Rate T
Conv
Rate C
1
>= 10.20 419 352
54.35%
611.22
0.00 145.88% 0.00%
2
[8.40, 10.20) 452 471
48.97%
861.87
150.72 190.68% 32.00%
3
[6.60, 8.40) 856 870
49.59%
1,076.82
215.59 125.80% 24.78%
4
[4.80, 6.60) 1,007 907
52.61%
1,886.05
621.25 187.29% 68.50%
5
[3.00, 4.80) 1,203 1,208
49.90%
882.72
449.66 73.38% 37.22%
6
[1.20, 3.00) 1,340 1,346
49.89%
1,452.11
1,486.57 108.37% 110.44%
7
[-0.60, 1.20) 2,324 2,290
50.37%
2,135.38
1,476.51 91.88% 64.48%
8
[-2.40, -0.60) 368 380
49.20%
221.33
0.00 60.14% 0.00%
For both
campaigns
the algorithm was able
to build1,901.66
useful models
which
9
[-4.20, -2.40) 1,269 1,249
50.40%
1,332.53 149.86% 106.69%
compare well with those built using deterministic methods. In the case of the
-4.20
1,567 found
1,538 a useful
50.47%
427.33 of the
2,232.55
Women’s 10
campaign, the <final
model
segmentation
popula-27.27% 145.16%
tion but failed to order the segments correctly. It is perhaps an oversight not to
Performance Summary: GP_Model_Womens
37
Overall Uplift (pp): 29.59%
Estimated Incremental conversions: 3,197.19
uQini:
0.61
q: 18.1186%
q0: 122.46%
have included in the process (at least in the evaluation function) a transformation
which asserts increasing ordering by uplift, as models which produce even more
useful segmentations might have been lost.
We now turn our attention to questions 3 and 4 of the Hillstrom challenge
using these models. Again our focus is on the Men’s campaign due to its being
most effective. We naturally choose those with score ≥ 3 first (4,375 individuals,
for whom we estimate uplift from those not receiving the Women’s Email to be
$2.38). Adding those with score of 2 by the Men’s model gives us an additional
7,406, giving us more than 10,000 overall. Choosing a random 5,625 from this
group will give us a collection of 10,000 customers for whom we estimate the
average uplift to be $1.79. This is an improvement of 21c over that found using
the deterministic modelling methods.
To find the worst 10,000 people to target, we begin with the worst segment identified by the Men’s model. As this amounts to more than half the population,
however, we need to consider the Women’s model as well in order to separate out
the worst group. Choosing the lowest segment according to the Women’s model
(score < −4) gives us 6,841 customers with an estimated average uplift (from the
4,512 who did not receive the Women’s email) of −55c. The next lowest segment
from the Women’s model corresponds to a score between 2 and 3. This comes to
5,892 people in total with estimated uplift caused by the Men’s Email of −41c.
Selecting the requisite 5,488 randomly to satisfy the 10,000 cut off gives us a sample for whom the Men’s Email decreased spending by and average of 47c. This
compares very favourably to the sample found using the deterministic model for
whom decreased spend as a result of the Men’s campaign was estimated at 25c
on average.
Polynomial approach As we restricted the polynomials both in dimension
and order we do not expect as good results as those from the Genetic Programming approach. We restricted them thus to indicate the usefulness of polynomials
at a basic level for building uplift models. Because of their simplistic form the
computational cost compared with genetic programming is far lower. If we do not
restrict the order of these polynomials, then they too would become increasingly
complex and so the cost benefit will be mitigated (and potentially lost completely). Later on we will combine the two dimensional polynomials to increase
the dimensionality of the models built to better understand their usefulness.
• Mens
The best polynomial model found for the Men’s campaign is given by
0.27x + 2.84x2 − 3.5x3 + 0.6x4 + 3.62y − 4.83xy + 0.09x2 y − 0.67x3 y
−2.21y 2 + 3.22xy 2 + 4.93x2 y 2 − 2.92y 3 − 3.95xy 3 − 2.09y 4
where x = binHist and y = N umZip. The q0 value for this model is
38
29.58%, which is comparable with those built via genetic programming and
an improvement over the deterministic model.
• Womens
For the Women’s campaign the best model is given by
−2.2053351821x + 1.96775115108x2 − 3.62381607147x3 + 0.674810214017x4
where here x = binHist. The q0 coefficient is 61.29%. It should be noted
that a polynomial which generated the same output over the data set arose
in all 10 runs, sometimes appearing already in the initial population. While
the model offers some merit with a fairly high Qini value, it appears that
building continuous models of low dimension using discrete variables in this
way might not be appropriate as a random search would have performed at
least as well in this instance.
In order to consider the usefulness of this method we will have to wait until
we combine the polynomials to produce higher dimensional models, since we have
observed that building low dimensional models of discrete variables in this way
potentially does not perform better than random search.
Method 2: Combining the Genetic Algorithms with Significance-Based
Uplift Tree We now consider a combined evolutionary algorithm which couples the genetic evolutions used in Method 1 with an algorithm for building
significance-based uplift trees.
A warm up period of 120 iterations was used, whereafter models were built every
20 iterations. We allow the initial warm up period so that models are only built
once more interesting variables should begin to emerge. By allowing the population to evolve between each model building we hope to achieve more dynamism
with lower computational cost.
Models were built over the entire data set and validated by randomly splitting
the data in half 10 times and evaluating the Qini coefficient of the model on (both
halves of) each split. The selection of one model over another was based on the
average minus standard deviation of the varying Qini values. By incorporating
both the location (average) and spread (standard deviation) of these values we
hope to obtain models which are both useful and robust in the face of variability.
Whenever a better model was found, adjustments to the fitnesses of individuals
used in the model were done by multiplying their current fitness by (1 + Qlow ),
where Qlow represents the average minus standard deviation of Qini values discussed above for that model. Such adjustments were only made in the event of a
strictly superior (by our evaluation metric) model being found. By incorporating
the level of “goodness” of the model, variables useful in the better models will
have their fitness adjusted by a greater factor. It could be argued that this increase is too large, however as the general fitness of the population increases, it
is necessary to improve longevity.
39
Passing the Entire Population into the Model Building Algorithm At
each 20th iteration we allow the model building algorithm selection from the
entire population of individuals. The results are as follows:
• Mens
Figure 4.10: Men’s Mailing. Full population passed into tree building algorithm
This method shows a considerable improvement over method 1 and an improvement over the results using the deterministic modelling techniques in
the winning entry. The fairly low range of best model qini coefficients suggests reliability of this method in finding good results, and the closeness
of lower average to absolute average suggests that these models are fairly
robust.
In spite of these encouraging points, we can see by the graph that little
improvement is made beyond about iteration 900. When passing the entire
population into the tree building algorithm, there will be a tendancy to
select the same individuals more often since the trees are built using greedy
algorithms, which is a likely cause of this stagnation.
40
• Womens
Figure 4.11: Women’s Mailing. Full population passed into tree building algorithm
In the case of the womens, unlike the mens above, this method produced
quite varied results as we see by the final values for the different lines. That
said, it still represents a significant improvement in terms of power over
method 1, and especially in the case of the best model found through this
method, presents a more stable model than was found via deterministic
methods.
The potential for the process to become bogged down by having the model
building algorithm select the same variables repeatedly is realistic, and the graph
for the Men’s campaign (Figure 4.10) indicates this might have effected progress
already. It is useful then to explore the option of only passing parts of the
population into the algorithm.
Passing the Fittest 10 Population Members Into the Model Building
Algorithm Instead of passing the entire population into the model building
algorithm as above, we now restrict this selection to only the ten fittest members.
By restricting the selection thus, we hope to “force” the algortihm to consider
newly generated highly fit individuals. Immediately, however, we have a concern
that our fitness adjustments based on finding better models will at least for a
period outweigh the effects of improved fitness through evolution and as a result
the 10 individuals with the highest (adjusted) fitness might remain unchanged for
extended periods (until ”evolution catches up”). This could result in a step-like
process including extended periods of non-improvement and essentially wasted
computation time.
41
• Mens
Figure 4.12: Men’s Mailing. Fittest 10 passed into tree building algorithm
The lack of improvement after about iteration 400 could be a result of
the concerns raised above. For runs of this length (1500 iterations) this
method appears inferior to the previous one in this instance. Furthermore,
the higher variability of q0 values on different splits indicated by the lower
average line do not instil confidence that (even given more time) this will
find an adequately robust model.
• Womens
Figure 4.13: Women’s Mailing. Fittest 10 passed into tree building algorithm
42
In the case of the womens there is very little to choose from in terms of
outcome comparing this with the previous method. The results here are
marginally better, and factoring in the lower computational cost (associated
with passing fewer variables to the model building algorithm) it appears
preferable. However, the inconsistent results in the mens case give us pause
for recommending this ultimately.
While we are able to increase the likelihood of newly spawned individuals who
are highly fit being included in the models, in doing so we exclude a large part
of the population almost absolutely (except in the rare event that for a period
significantly more tournaments are lost by fitter individuals). Further, while we
expect variables with a high basic fitness to be useful in building models, it is
often the case that combinations of “weaker” individuals will vastly outperform
them when combined in an uplift tree, and this method almost excludes this
possibility entirely.
Passing the Fittest 10 of a Random Sample of Size 20 Into the Model
Building Algorithm In order to hopefully improve the dynamism of the process we consider generating random samples of size 20 from the population and
selecting the fittest 10 of those to be passed into the model building algorithm.
We hope that by doing this a greater variety of individuals will be considered by
the model building algorithm while maintining an emphasis on fitter individuals.
• Mens
Figure 4.14: Men’s Mailing. Fittest 10 of a Random Sample of size 20 passed
into tree building algorithm
The graph shows continued improvement throughout the run, unlike in the
previous method, however ultimately there is only marginal improvement,
43
and the results still dictate something worse than passing the entire population into the model building algorithm. The decrease in computational
cost when passing so fewer variables into the algorithm is certainly not
negligible, however, and this method does perhaps offer merit.
• Womens
Figure 4.15: Women’s Mailing. Fittest 10 of a Random Sample of size 20 passed
into tree building algorithm
For the women’s campaign this method shows a vast improvement over
those considered so far. The expected range is fairly narrow, suggesting
reliability, and the lower average is also reasonably proximal to it indicating
robustness in the models built.
Ideally we would have a single method which provides a good model for both
campaigns, and we persevere with our ideas of improving dynamism of the process.
Passing a Probabilistic Sample of Sizes 10 and 15 Into the Model Building Algorithm An alternative approach to increasing the diversity of individuals passed into the model building algorithm is to select subsets of the population
using a fitness proportionate selection method. The benefit of this over the last
method is that when the fitness levels in the population are fairly close there is a
greater emphasis on diversity of selection, but as the fitnesses become more varied (as the fitness adjustments are greater) there is more of an emphasis on fitter
individuals. These fitter individuals later on represent those which have already
proven useful in building models as well as newly introduced highly (basically)
fit individuals. This shift in emphasis will hopefully prove useful for progress of
the algorithm.
44
• Mens
Figure 4.16: Men’s Mailing. Probabilistic Sample of size 10 passed into tree
building algorithm
Figure 4.17: Men’s Mailing. Probabilistic Sample of size 15 passed into tree
building algorithm
• Womens
45
Figure 4.18: Women’s Mailing. Probabilistic Sample of size 10 passed into tree
building algorithm
Figure 4.19: Women’s Mailing. Probabilistic Sample of size 15 passed into tree
building algorithm
For both men’s and women’s campaigns this method has generated improved
results. In the case of the men’s this manifested as a shift upwards while maintaining a similar level of reliability, whereas in the womens, the narrowing of the
ranges suggests improved reliability of the models found. Allowing the model
building algorithm slightly more selection (15 over 10 variables) appears to produce better results. As higher numbers are passed in, however, there will be a
tendancy for the algorithm to resemble the method of passing the entire population into the algorithm, both in terms of results and computational cost. We
46
choose only to consider these two sample sizes as they illustrate an improvement over the previous methods for the same number of variables (10, where this
method surpasses the previous 2) and an improvement in quality over all previous
methods and (at least) comparable computational costs, shown here by the 15
variable probabilistic sample method.
Explicit Development of Variables Used in Models We now consider a
method which indirectly uses forma analysis. The process follows the same path
as above, except each time a better model is found we initialise new parallel populations based around the variables used. For each such variable, we generate
a population using the ramped half-and-half method and then ”breed” (apply
recombination to) each member with that variable. We then evolve those populations independently for 50 iterations, after which we combine them along with
the base population and then cull the population down to 100 individuals. We
hope that by investigating the search space around these useful variables, elements of them which proved useful will be picked up on and developed and that
consequent models will improve as a result.
• Mens
Figure 4.20: Men’s Mailing. Explicit development of variables used in models.
47
• Womens
Figure 4.21: Women’s Mailing. Explicit development of variables used in models.
This method unfortunately produced worse results for both campaigns and
especially due to increased computational costs over the previous model does not
seem to offer any use for us.
Method 7: Polynomial Models Finally, even though the polynomial models
did not individually represent powerful models, this is potentially due to their being restricted to two dimensions, and we hope that combining the polynomials as
variables in a tree model will produce improved results. For this method we simultaneously evolve independent populations, each one representing a two element
combination from the basic variable set. As before, every twenty iterations (after
100) a model is built of the variables being evolved. For selection we combine
all populations and select a fitness proportionate sample of size 20, recognising
the usefulness of this selection method from before and also the increased total
population size suggesting a larger sample might be useful.
48
• Mens
Figure 4.22: Men’s Mailing. Polynomial method, probabilistic sample of size 20
passed into tree building algorithm
The narrow range is a result of 6 of the 10 runs finding models which
produce the same output exactly over the data set. In fact, as can be seen
by the contraction of the lines around iteration 400, all ten of the runs found
such models during the process, and only in the case of 4 of them was any
progress made beyond that. In none of the runs was there any progress
made after iteration about 700, and overall this method does not appear to
offer any improvement in spite of the lower computational cost (cost which,
for larger basic variable sets might not be lower due to the high number of
parallel populations being evolved in that case).
49
• Womens
Figure 4.23: Women’s Mailing. Polynomial method, probabilistic sample of size
20 passed into tree building algorithm
Unlike the Men’s above, for the Women’s campaign this method produced
extremely varied results. That said, the levels are significantly lower than
for all other methods, and furthermore the lower average line suggests that
the models being built are not very robust.
The polynomial method, while faster in computation, did not offer powerful
nor robust models compared with other models already found and as such does
not appear as useful as the Genetic Programming method in general.
The Best Individual Models Found We found that for both the Men’s and
the Women’s campaigns, models built by selecting fitness proportionate samples
from the population regularly and passing those into the tree building algorithm
produced the best results. Though the optimal size of sample to be chosen each
time will vary according to the nature of the data being modelled, for the purposes
of this particular problem, samples of size 15 gave us both powerful and robust
models. The variables used in the model expressed as Symbolic Expressions do
not lend themselves well to interpretation, nor do the models themselves, and so
we will rather investigate the relationships between the basic variables and the
output of the model to hopefully gain insight into what the model is telling us
at a basic level. The explicit forms of variables and models can be found in the
appendix.
• Mens
The model assigns one of 8 possible values according to the terminal values
in the significance-based uplift tree, which is a full tree of depth 4. Table
50
4.9 below shows a summary of the spend behaviour in each segment. The
model identified a small subset of the population (562 individuals, 374 of
whom did not receive the Women’s Email) for whom the uplift is estimated
to be $7.66. A further 1,939 have estimated uplift higher than any of the
segments identified in models so far presented in this paper. The model also
identified a large segment of the population who were negatively affected
by the campaign, something which previous models failed to do directly for
the Men’s campaign. This model certainly appears to offer a far superior
segmentation of the population for generating increased uplift.
Table 4.9: Spend behaviour identified by Men’s model
Figure 4.24: Uplift by score for Men’s model
The q0 value for the model is 76.45%, a considerable improvement over
the model found only using the genetic programming evolution. The lower
range of q0 is 44.99% which suggests the model is fairly robust.
The Qini graph for this model (Figure 4.25) shows graphically this improvement. We can see from the graph that by targeting only about 35% of
the population we could generate the same total uplift as if we targeted the
entire population. Furthermore, by excluding a full 30% of the population
we could increase the overall uplift considerably.
51
Figure 4.25: Qini Graph for Mens Model
As before we look to relationships between the output of the model and the
individual basic variables for insight on how we can interpret the model.
Table 4.10 below tells a similar story to previous models, except that now
there is a far more varied emphasis on different bands of historical spend.
Where before the models only identified strongly with very high historical
spend, this model shows other bands which have a strong relationship, particularly the range $216.20 to $287.10. There is also a stronger relationship
with customers who purchased recently (recency of 1 and 2 particularly).
The emphasis on newbies and shoppers in the men’s department is dampened in this model compared with earlier ones. Other emphases are similar.
Table 4.10: Mean model score by binned variables
It is interesting that some emphases on intermediate historic spend values
(i.e. those not very high nor very low) were shown by this model, and yet
52
it appears fairly robust. One would expect that these intermediate ranges
would be more severely affected by varying splits.
• Womens
The uplift tree for the Women’s has the same shape as that for the Men’s,
and so it too divides the population into 8 segments and assigns the scores as
seen in Table 4.11. The model identified two extremal segments with large
positive and negative uplift respectively. The magnitude of these exceeds
the extremal uplift estimates from both the deterministic and pure genetic
programming approaches giving us confidence in its being a more powerful
model.
Table 4.11: Spend behaviour identified by Women’s model
Figure 4.26: Uplift by score for Women’s model
The q0 for this model is 117.38%, considerably higher than previous models
(for the full data set), and the lower range estimate is 68.48%. This suggests
quite variable values over the different splits, but not to an extent as to cause
concern about its reliability.
53
uQini:
0.57
q:
23.9769%
q0:
113.65%
Probability T/C split is invalid: virtually certain
Figure 4.27: Qini Graph for Womens Model
Qini (Probabilistic_15_Womens)
0.6
0.5
Uplift pp photp
0.4
0.3
0.2
0.1
0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
proportion treated
Command completed in 0.0638 seconds.
The Qini graph
shows
that
could generate the same total uplift with just
[21]>
qini
-g we
GP_Womens
under 30% of the population as the whole, and furthermore by excluding
certain segments
could generate
anmatch
extra field
50% ofGP_Womens
the total uplift
already
Miro Error:
Couldn't
(ignored).
enjoyed as a result of the campaign.
Command completed in 0.0021 seconds.
The relationships
variables tell a similar story to the earlier
[22]> with
qiniindividual
-g GP_Model_Womens
models built on the Women’s Emailing campaign, with an
added emphaProportion
Conversions Con
Segment GP_Model_Womens nT
nC
Treated
sis on low recency and a lesser one on high historical spend
(though still T
strong). Though the1relationships are
similar,
there
are
subtle
differences 1,441.41
>= 10.20 776 730
51.53%
which seem to have produced
a [8.40,
stronger
model.897 905
2
10.20)
49.78%
1,064.99
3
[6.60, 8.40) 1,684 1,744
49.12%
2,162.95
1,969variables
1,843
51.65%
2,643.29
Table 4.12: 4Mean model[4.80,
score6.60)
by binned
5
[3.00, 4.80) 2,423 2,425
49.98%
2,912.18
6
[1.20, 3.00) 2,752 2,709
50.39%
2,573.43
7
[-0.60, 1.20) 4,536 4,596
49.67%
5,134.77
8
[-2.40, -0.60) 723 729
49.79%
343.71
9
[-4.20, -2.40) 2,555 2,611
49.46%
3,585.43
10
< -4.20 3,072 3,014
50.48%
1,175.95
Performance Summary: GP_Model_Womens
Overall Uplift (pp):
42.19%
Estimated Incremental conversions:
9,023.83
uQini:
0.34
To compare these models with those built earlier q:
on a 14.2331%
relatable term, we will
again identify the 10,000 best and worst individuals.
the same method as
q0:Using 67.47%
Probability T/C split is invalid: not indicated
54
before we first select 2,501 individuals who had scores according to the Men’s
model above 3, and estimate their uplift from those not included in the Women’s
campaign as being $4.02 on average. If we merely select randomly from those
who were assigned the next highest score by the Men’s model a further 7,499
individuals we achieve an average uplift of among the 10,000 as a result of the
Men’s campaign of $2.74. This represents a considerable improvement over both
the deterministic method and the pure genetic program, being $1.16 and 95c respectively.
To see that this model presents a preferred selection of individuals to exclude,
we first consider the lowest scores for the Men’s model. These represent 12,611
individuals. The Men’s and Women’s models are not correlated by uplift and so
we first select three segments according to the Women’s model which present the
lowest uplift by the Men’s campaign, these are associated with Women’s scores of
−2.17405, 0.09473 and 0.91522. These give us a sample of size 8,346 for whom we
estimate the average uplift caused by the Men’s campaign to be -82c. Selecting
the remaining 1,664 individuals from those with Women’s score of −0.33134 and
−0.1977, we estimate their uplift as 75c. Overall we have found a sample of size
10,000 whom we estimate the Men’s Emailing decreased spend by an average of
56c, which represents only a marginal improvement (9c difference on average)
over the models built using genetic programming alone.
55
Chapter 5
Conclusion:
We have shown how genetic algorithms can be applied to the field of uplift modelling to evolve models which are able to detect the second order nature of the
quantity and model it accurately and effectively. By placing an emphasis on
validity, we found that evolving these models as Symbolic Expressions using genetic programming alone produced models which compared favourably in terms
of their power (which we evaluate via the Qini coefficient) and their robustness
to models built using deterministic model building techniques.
Next, we combined the genetic program approach with a model building algorithm which builds uplift trees based on splits determined by the significance of
the interaction between uplift and the division of the population imposed by the
split. We applied a dynamic adjustment to the fitness of the programs being
evolved based on their use in these models. We progressively considered different methods of variable selection to be input into the model building algorithm,
adjusting our method according to potential pitfalls of the previous selection
methods. Ultimately we found that fitness proportionate selection produced the
best models overall.
We found that combining the heuristic genetic program with the deterministic
model building algorithm produced further improvements. Especially by having
the two aspects interact with one another, in terms of probabilistic variable selection according to fitness and imposing a dynamic adjustment to these fitness
levels according to the variables’ usefulness in building models we believe that
this method offers a reliable and powerful uplift modelling technique.
Where the Qini coefficient offers little interpretability, we were able to compare
our models with deterministic models on a monetary level (with reference to
an example in the field of marketing, proposed by Kevin Hillstrom in his blog,
MineThatData). By identifying the best and worst 10,000 customers to target
using a choice of two marketing campaigns, we found an estimated 13% improvement ($1.79 over $1.58 average uplift) in the best 10,000 by the base genetic
program and an estimated 73% improvement ($2.74 over $1.58 average uplift)
using the combined method. In terms of identifying the worst customers to target the base genetic program identified 10,000 individuals for whom the better
56
of the two campaigns appeared to have caused a decrease in average spend by
47c (compared with a decrease of 25c found by the deterministic model). For
the combined method we identified a sample of 10,000 for whom we estimate the
campaign decreased spend by a further 9c to 56c. While these amounts per head
seem marginal, compared with the overall uplift rate for the better of the two
campaigns, $0.77, we see that relatively they are significant and in the face of
a larger influence the modelling techniques should produce comparably inflated
effects.
To take this method further, it could be recommended that in the case of using genetic programming alone, the evaluation function takes the Qini coefficient
for the best ordering by segment identified by the model as it is very possible
that useful models are being lost in the process. When applying the combined
heuristic plus deterministic modelling method, because the optimal sample size
of variables to pass into the model building algorithm will vary according to the
nature of the problem, it would be useful to consider an adaptive method which
increases or decreases the sample size according to the progress of the process.
57
References:
1. N. J. Radcliffe and P. D. Surrey. Real-World Uplift Modelling with SignificanceBased Uplift Trees. Portrait Technical Report, Stochastic Solutions, 2011. Available at http://www.stochasticsolutions.com/pdf/sig-based-up-trees.pdf.
2. N. J. Radcliffe and P.D. Surrey. Quality Measures for Uplift Models.
3. J. H. Holland. Adaptation in Natural and Artificial Systems. University
of Michigan Press, 1975.
4. W.A. Greene. A Non-Linear Schema Theorem for Genetic Algorithms. University of New Orleans Computer Science Department. Available at http://citeseerx.ist.psu.edu/
5. N. J. Radcliffe. The Algebra of Genetic Algorithms. Annals of Maths and
Artificial Intelligence, vol. 10, 1994.
6. T. Weise. Global Optimization Algorithms - Theory and Application. Available at http://www.it-weise.de/projects/book.pdf.
7. J. R. Koza. Genetic Programming: On the Programming of Computers by
Means of Natural Selection. MIT Press, 1992.
8. N. J. Radcliffe. Hillstroms MineThatData email analytics challenge: An approach using uplift modelling. Technical report, Stochastic Solutions Limited,
2008. Available at http://stochasticsolutions.com/pdf/HillstromChallenge.pdf.
9. N. J. Radcliffe. Equivalence Class Analysis of Genetic Algorithms. Complex Systems, 1991.
10. G. Rudolph. Convergence Properties of Canonical Genetic Algorithms. IEEE
Transactions on Neural Networks, 1994.
11. C. Gini. Variabilita‘ e mutabilita‘. In T. Pizetti E, Salvemini, editor,
Reprinted in Memorie di Metodologica Statistica. Libreria Eredi Virgilio Veschi
(Rome), 1912.
12. http://minethatdata.blogspot.com, March 20th, 2008.
58
Appendix A
Variables and Models
A.1
Men’s Model. Genetic Programming Approach
(+ (* mens newbie) (* (+ (* womens NumZip) (≥ (* (+ (* womens) (≥ (*
NumZip NumZip) binHist)) 6 NumZip) binHist)) mens))
A.2
Women’s Model. Genetic Programming Approach
(+ (* newbie mens) (if (≤ (≤ NumZip (> )) (if (= 0 (* womens)) 0 (/ 2 womens)))
(− (if (= 0 (* newbie)) 0 (/ NumZip newbie)) (* (+ mens) 6)) 0) NumZip (if
newbie (* (− (+ (* mens) (if newbie (+ (* binHist) NumZip (if newbie (if (+
NumZip) 6 0) 0)) 0)) (+ NumZip (if newbie (if (− (≤ (≤ NumZip (> )) (if (= 0
(* womens)) 0 (/ 2 womens))) (≤ mens (− NumZip binHist))) 6 0) 0)))) 0))
A.3
Men’s Model. Fitness Proportionate Selection. Combined Genetic Programming and
Uplift Tree
Variable1:
(− (if (= 0 (* (= (= (< (− NumZip womens) (− womens newbie)) (if history
newbie 0)) (> )))) 0 (/ (+ (= (> (≤ (> ) (− binHist recency)) (≥ )) (≥ 2
NumZip)) history) (= (= (< (− NumZip womens) (− womens newbie)) (if history newbie 0)) (> )))) (if (= 0 (* (− (− history NumChannel) (− )))) 0 (/ (−
binHist NumZip) (− (− history NumChannel) (− )))))
Variable2:
(− (if (= 0 (* (= (= (< (− ) (− womens newbie)) (if newbie binHist 0)) (> ))))
0 (/ (+ history history) (= (= (< (− ) (− womens newbie)) (if newbie binHist
0)) (> )))) (if (= 0 (* (− (− (if (= 0 (* (= (= (< ) (if history newbie 0)) (> ))))
59
0 (/ (+ (= (> (≤ (> ) (− mens womens)) (≥ )) (> )) history) (= (= (< ) (if
history newbie 0)) (> )))) (if (= 0 (* (− (− history (= )) (− (if (= 0 (* newbie))
0 (/ (if (= 0 (* (if 2 (if NumZip NumZip 0) 0))) 0 (/ (+ 200) (if 2 (if NumZip
NumZip 0) 0))) newbie)))))) 0 (/ (− binHist (if recency history 0)) (− (− history
(= )) (− (if (= 0 (* newbie)) 0 (/ (if (= 0 (* (if 2 (if NumZip NumZip 0) 0))) 0
(/ (+ 200) (if 2 (if NumZip NumZip 0) 0))) newbie))))))) history))) 0 (/ (− (−
(≤ recency 6) (if (= 0 (* recency)) 0 (/ NumZip recency))) NumZip) (− (− (if
(= 0 (* (= (= (< ) (if history newbie 0)) (> )))) 0 (/ (+ (= (> (≤ (> ) (− mens
womens)) (≥ )) (> )) history) (= (= (< ) (if history newbie 0)) (> )))) (if (= 0
(* (− (− history (= )) (− (if (= 0 (* newbie)) 0 (/ (if (= 0 (* (if 2 (if NumZip
NumZip 0) 0))) 0 (/ (+ 200) (if 2 (if NumZip NumZip 0) 0))) newbie)))))) 0 (/
(− binHist (if recency history 0)) (− (− history (= )) (− (if (= 0 (* newbie))
0 (/ (if (= 0 (* (if 2 (if NumZip NumZip 0) 0))) 0 (/ (+ 200) (if 2 (if NumZip
NumZip 0) 0))) newbie))))))) history))))
Variable3:
(− (if (= 0 (* (= (= (< (− NumZip womens) (− womens newbie)) (if history
newbie 0)) (> )))) 0 (/ (+ (= (> (= (> (≤ (> ) (if 0 0 0)) (≥ (≤ ) (− NumZip
womens))) (> )) (≥ (≤ ) (− recency newbie))) (− womens)) history) (= (= (<
(− NumZip womens) (− womens newbie)) (if history newbie 0)) (> )))) (if (= 0
(* (− (− history NumChannel) (− )))) 0 (/ (− binHist NumZip) (− (− history
NumChannel) (− )))))
Variable4:
(− (if (= 0 (* (= (= (< (− NumZip womens) (− womens newbie)) (if history
newbie 0)) (> )))) 0 (/ (+ (= (> )) binHist) (= (= (< (− NumZip womens) (−
womens newbie)) (if history newbie 0)) (> )))) (if (= 0 (* NumChannel)) 0 (/ (−
(if (+ history) (− (if mens NumChannel 0) (if (= 0 (* (− (if womens (if (if mens
NumZip 0) NumZip 0) 0)))) 0 (/ (− history) (− (if womens (if (if mens NumZip
0) NumZip 0) 0))))) 0) NumZip) NumChannel)))
Variable5:
(− (if (= 0 (* (= (= (< (− NumZip womens) (− womens newbie)) (if history
newbie 0)) (> )))) 0 (/ (+ (= (= (< (> ) (− newbie)) womens) (> )) 2) (=
(= (< (− NumZip womens) (− womens newbie)) (if history newbie 0)) (> ))))
(if (= 0 (* (− (− history NumChannel) history))) 0 (/ (− (if (+ history) (− (if
mens NumChannel 0) (if (= 0 (* (− (− history (* history binHist)) history))) 0
(/ (− (if recency (if newbie (− (if newbie (− (if mens (if womens history 0) 0)
womens) 0) NumZip) 0) 0) history) (− (− history (* history binHist)) history))))
0) NumZip) (− (− history NumChannel) history))))
Variable6:
(− (if (= 0 (* (= (= (< (− NumZip womens) (− womens newbie)) (if history
newbie 0)) (> )))) 0 (/ (+ (= (> (≤ (> ) (if 0 0 0)) (≥ (≤ recency NumChannel)
(− NumZip womens))) (> )) history) (= (= (< (− NumZip womens) (− womens
newbie)) (if history newbie 0)) (> )))) (if (= 0 (* (− (− history NumChannel)
history))) 0 (/ (− binHist NumZip) (− (− history NumChannel) history))))
60
Variable7:
(− (if (= 0 (* (if 2 mens NumZip))) 0 (/ (+ 200 mens) (if 2 mens NumZip))) (if
(= 0 (* (− mens (= (= (< (− newbie womens) (− womens (− (if mens NumChannel 0) (if (= 0 (* (− (if womens (if (if mens NumZip 0) NumZip 0) 0)))) 0 (/
(− history) (− (if womens (if (if mens NumZip 0) NumZip 0) 0))))))) (if history
newbie binHist)) (> ))))) 0 (/ (− 200 NumZip) (− mens (= (= (< (− newbie
womens) (− womens (− (if mens NumChannel 0) (if (= 0 (* (− (if womens (if
(if mens NumZip 0) NumZip 0) 0)))) 0 (/ (− history) (− (if womens (if (if mens
NumZip 0) NumZip 0) 0))))))) (if history newbie binHist)) (> ))))))
Model:
A.4
Women’s Model. Fitness Proportionate Selection. Combined Genetic Programming
and Uplift Tree
Variable1:
(− (if (= 0 (* (= (* mens) (= womens (− (+ NumZip (> (* (− newbie (if (if
mens 200 0) (+ NumZip) 0))) (if womens history 0))) (= womens (= womens
mens))))))) 0 (/ newbie (= (* mens) (= womens (− (+ NumZip (> (* (− newbie (if (if mens 200 0) (+ NumZip) 0))) (if womens history 0))) (= womens (=
womens mens))))))) (> mens (= recency (< (* newbie binHist) (* (< newbie
NumChannel))))))
61
Variable2:
(− (− (if (= 0 (* (= womens (= (= recency newbie) mens)))) 0 (/ newbie (=
womens (= (= recency newbie) mens)))) (if (if (+ (if (= 0 (* (≤ (− recency 6)
(* binHist NumChannel)))) 0 (/ newbie (≤ (− recency 6) (* binHist NumChannel))))) mens 200) (+ binHist) 0)) (if (if recency (if (> mens (= 2 NumZip)) 200
0) 0) (+ 200) 0))
Variable3:
(+ (* binHist (if newbie 2 0)) (− (if (if (> ) mens 200) (+ binHist) 0)))
Variable4:
(+ NumChannel (* (− newbie (if (* (− newbie (if (if (+ 200) (if 200 200 0) 0)
(+ NumZip) 0)) (> 6 recency)) (+ 200) 0)) 2))
Variable5:
(+ NumChannel (* (− newbie (if (if (> mens (= 2 NumZip)) (if mens 200 0) 0)
(+ 200) 0)) 2) (* (− womens (if (if (> mens (= 2 NumZip)) (if mens 200 0) 0)
(+ 200) 0)) (+ NumZip)))
Variable6:
(if (= 0 (* (* binHist recency))) 0 (/ (≤ (= ) (≤ )) (* binHist recency)))
Variable7:
(if (= 0 (* (* binHist))) 0 (/ (if (= 0 (* (= womens (= (= recency newbie)
mens)))) 0 (/ newbie (= womens (= (= recency newbie) mens)))) (* binHist)))
Model:
62