Dummy Variable Coding - Research

Hutcheson, G. D. (2011). Dummy Variable Coding. In L. Moutinho and G. D.
Hutcheson, The SAGE Dictionary of Quantitative Management Research. Pages
98-102.
Dummy Variable Coding
Introduction
Dummy coding is used to represent categorical variables (e.g. sex, geographic location, ethnicity) in a
way that enables their use in a number of statistical analyses. Models such as OLS regression and
other Generalized linear models such as logistic regression and the proportional odds model require
variables to be measured on a continuous (numeric) scale, a requirement which is, unfortunately, not
met by all social science data. It is possible, however, to include dichotomous, or binary, data in a
model if it is appropriately coded. The ability to include dichotomous data enables variables such as
male-female, dead-alive, rich-poor, passed-failed, high-low etc., to be used in predictive models.
Multi-category categorical explanatory variables such as drinking levels (high, medium and low),
location (Europe, North America, South America, Africa), educational level (unqualified, high school,
university), and different treatments (treatment 1, treatment 2, treatment 3, etc.) can also be included if
they are appropriately coded. The process of transforming discontinuous data into a form which can be
entered into a regression model is called dummy coding. There are a number of methods of dummy
coding data, however, only two of the more common methods, indicator and deviation coding (also
known as treatment and sum), will be discussed in detail here.
Key Features
•
Ordered and unordered categorical can be appropriately represented in regression equations.
•
Dummy variable coding enables any ANOVA design to be represented using a regression
equation.
•
Different coding schemes are available to enable different comparisons to be made
•
It helps with understanding more complex model interpretations (eg. MNL and proportional
odds)
Treatment Coding
Perhaps the easiest dummy coding method to explain is treatment coding (also known as indicator
coding) as it involves simply transforming categorical data into a number of dichotomies. For this
technique the dichotomy is coded 0 and 1 (either explicitly in the data set, or internally by software) to
indicate the presence or absence of a particular attribute. For example, in the case of gender, if a code
of 0 refers to `not female', a code of 1 refers to `female'. Alternatively, if a code of 1 refers to `male', a
code of 0 refers to `not male'. Of course, someone who is designated `not female' will be `male' which
allows a comparison to be made between male and female, even though we have, technically speaking,
only coded for the presence of one of the categories. By using the values 0 and 1 we are merely
describing the presence or absence of a particular attribute, rather than defining its level.
Table 1 shows a categorical variable (the different categories are designated by the letters A, B, C and
D) with four categories that have been recoded into a series of dichotomies. In the table, the original
variable (A, B, C, D) is represented as four separate dummy variables, each indicating the presence or
absence of a particular category. For example, dummy variable 'dum_gpA' only records whether or not
the original variable is equivalent to A, dummy variable 'dum_gpB' only records whether or not the
original variable is equivalent to B, and so on. The information contained in the original variable is
now represented by a number of discrete dummy variables. This method of coding is commonly
called treatment coding.
Table 1: Treatment Coding
Original Codes
Dummy Codes
Group
dum_gpA
dum_gpB
dum_gpC
dum_gpD
A
1
0
0
0
B
0
1
0
0
C
0
0
1
0
D
0
0
0
1
Table 1 shows how a four category variable (Group) can be represented as a number of dummy
variables ('dum_gpA' to 'dum_gpD'). It should be noted, however, that the use of dummy variables in
regression models is not completely straightforward because the inclusion of all of them at the same
time leads to a situation where perfect multicollinearity exists. For example, the dummy variable
representing group D ('dum_gpD') in Table 1 can be perfectly predicted from the other dummy
variables and is therefore redundant. If any of the three dummy variables 'dum_gpA' to 'dum_gpC'
have the value of 1, then 'dum_gpD' necessarily takes the values of 0, and if the three dummy
variables 'dum_gpA' to 'dum_gpC' all have the value of 0, then 'dum_gpD' necessarily takes the value
of 1. It is, therefore, not possible to estimate parameters for all of the dummy variables in Table 1. In
general, if we have j categories, a maximum of j-1 dummy variables can be entered into a model. The
dummy variable which is omitted is called the reference category and is the category against which
other dummy variables are compared. It should be noted that the choice of reference category is often
quite arbitrary, although sometimes there will be reasons that a particular reference category is chosen.
For example, when comparing the effectiveness of a number of procedures, it might make sense to
compare each with the 'standard' procedure currently used (see Hardy, 1993, for a more in-depth
discussion of reference category choice). The choice of reference category does not affect the model fit
as this remains the same no matter which category is designated as the reference. Changing the
reference category merely alters which comparisons are made between the dummy variables.
Example
The following example of treatment coding is taken from Hutcheson and Moutinho, 2008. The
regression model here predicts the price of whiskey using type of company ownership as an
explanatory variable. The three-category variable indicating ownership has been coded into two
dummy variables ('private' and 'state') with the reference category being 'private-state partnership'.
The OLS regression model for this analysis is:
Price of whiskey = α + β1 private + β2 state
The regression analysis provides two parameters to represent the ownership variable; one that
indicates private ownership and one that indicates state ownership. These parameters are shown in
Table 2 below (this analysis has been obtained using the software package R (see R development core
team, 2008).
Table 2: Regression model parameters (treatment coding)
Coefficients:
Estimate Std. Error t value Pr(>|t|) (Intercept) 4.7120 0.1023 46.078 < 2e­16 Ownership[T.private] 0.1970 0.1446 1.362 0.18439 Ownership[T.state] ­0.4170 0.1446 ­2.883 0.00763 In Table 2 the ownership variable has been dummy-coded into two elements 'Ownership[T.private]'
and 'Ownership[T.state]'. The 'T' indicates that treatment coding has been used. The estimate of
0.1970 indicates the comparison between 'private ownership' and the reference category of 'stateprivate partnership'. Compared to companies who are owned by a state-private partnership, those
owned privately charge 0.1970 units more for their whiskey. Similarly, those companies that are state
owned charge 0.4170 units less than companies who are owned by a state-private partnership. The t
values also indicate that the difference between state-owned and state-private partnership is significant
to the 0.01 level.
Sum Coding
It is sometimes appropriate to compare each category with an average value from all categories, rather
than a specific category. This is possible using a different dummy coding scheme where the codes are
assigned according to the scheme laid out in Table 3.
Table 3: Sum Coding
Original Codes
Dummy Codes
Group
dum_gpA
dum_gpB
dum_gpC
A
1
0
0
B
0
1
0
C
0
0
1
D
-1
-1
-1
Using these codes, each category can be compared to the average of all categories. The only thing to
have changed from the treatment coding above is that comparisons are now made against an average
of all categories rather than a specific identified category. Similar to the treatment coding method
discussed above, only j-1 categories can enter into a model due to the problems associated with
multicollinearity.
Example
Using the same data and model as above, sum coding is used to represent the ownership variable in an
OLS regression model predicting the price of whiskey. The model parameters for this analysis are
shown in Table 3:
Table 3: Regression model parameters (sum coding)
Coefficients:
Estimate Std. Error t value Pr(>|t|) (Intercept) 4.63867 0.05904 78.567 < 2e­16
Ownership[S.partnership] 0.07333 0.08350 0.878 0.38754 Ownership[S.private] 0.27033 0.08350 3.238 0.00318
In Table 3 the ownership variable has been dummy-coded into two elements
'Ownership[S.partnership]' and 'Ownership[S.private]'. The 'S' indicates that sum coding has been
used. The estimate of 0.0733 indicates the comparison between 'private ownership' and the average of
all ownership categories. Compared to the average, those companies owned by a partnership charge
0.07333 units more for their whiskey. Similarly, those companies that are privately owned charge
0.27033 units more for their whiskey compared to the average. The t-values also indicate that the
difference between state-owned and state-private partnership is significant to the 0.005 level.
Alternative coding schemes
There are a number of coding schemes available for representing categorical data, many of which are
available in software packages. Interested readers are directed to Fox 2002 and Aguinis, 2004 for
further information on these coding schemes.
Conclusion
Dummy variable coding is an important part of data manipulation as it enables categorical variables to
be included in a wide variety of statistical models. It's use greatly increases the utility of regression
models and understanding how the coding operates helps greatly with the interpretation of the models.
Further Reading
Aguinis, H. (2004). Regression Analysis for Categorical Moderators. Guilford Press.
Fox, J. (2002). An R and S-Plus Companion to Applied Regression. London: Sage Publications.
Hardy, M. A. (1993). Regression with dummy variables. London: Sage Publications
Hutcheson and Sofroniou, 1999
Hutcheson and Moutinho, 2008
R Development Core Team (2008). R: A Language and Environment for Statistical Computing.
Vienna, Austria: R Foundation for Statistical Computing. http://www.R-project.org
Graeme Hutcheson
Manchester University