Using the Propensity Score Method to Estimate Causal Effects: A

Using the Propensity Score
Method to Estimate Causal
Effects: A Review and
Practical Guide
Organizational Research Methods
00(0) 1-39
ª The Author(s) 2012
Reprints and permission:
sagepub.com/journalsPermissions.nav
DOI: 10.1177/1094428112447816
http://orm.sagepub.com
Mingxiang Li1
Abstract
Evidence-based management requires management scholars to draw causal inferences. Researchers
generally rely on observational data sets and regression models where the independent variables
have not been exogenously manipulated to estimate causal effects; however, using such models
on observational data sets can produce a biased effect size of treatment intervention. This article
introduces the propensity score method (PSM)—which has previously been widely employed in
social science disciplines such as public health and economics—to the management field. This
research reviews the PSM literature, develops a procedure for applying the PSM to estimate the causal effects of intervention, elaborates on the procedure using an empirical example, and discusses the
potential application of the PSM in different management fields. The implementation of the PSM in
the management field will increase researchers’ ability to draw causal inferences using observational
data sets.
Keywords
causal effect, propensity score method, matching
Management scholars are interested in drawing causal inferences (Mellor & Mark, 1998). One
example of a causal inference that researchers might try to determine is whether a specific management practice, such as group training or a stock option plan, increases organizational performance.
Typically, management scholars rely on observational data sets to estimate causal effects of the
management practice. Yet, endogeneity—which occurs when a predictor variable correlates with the
error term—prevents scholars from drawing correct inferences (Antonakis, Bendahan, Jacquart, &
Lalive, 2010; Wooldridge, 2002). Econometricians have proposed a number of techniques to deal
1
Department of Management and Human Resources, University of Wisconsin-Madison, Madison, WI, USA
Corresponding Author:
Mingxiang Li, Department of Management and Human Resources, University of Wisconsin-Madison, 975 University Avenue,
5268 Grainger Hall, Madison, WI 53706, USA
Email: [email protected]
Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014
2
Organizational Research Methods 00(0)
with endogeneity—including selection models, fixed effects models, and instrumental variables, all
of which have been used by management scholars. In this article, I introduce the propensity score
method (PSM) as another technique that can be used to calculate causal effects.
In management research, many scholars are interested in evidence-based management (Rynes,
Giluk, & Brown, 2007), which ‘‘derives principles from research evidence and translates them into
practices that solve organizational problems’’ (Rousseau, 2006, p. 256). To contribute to evidencebased management, scholars must be able to draw correct causal inferences. Cox (1992) defined a
cause as an intervention that brings about a change in the variable of interest, compared with the
baseline control model. A causal effect can be simply defined as the average effect due to a certain
intervention or treatment. For example, researchers might be interested in the extent to which training influences future earnings. While field experiment is one approach that can be used to correctly
estimate causal effects, in many situations field experiments are impractical. This has prompted
scholars to rely on observational data, which makes it difficult for scholars to gauge unbiased causal
effects. The PSM is a technique that, if used appropriately, can increase scholars’ ability to draw
causal inferences using observational data.
Though widely implemented in other social science fields, the PSM has generally been overlooked by management scholars. Since it was introduced by Rosenbaum and Rubin (1983), the
PSM has been widely used by economists (Dehejia & Wahba, 1999) and medical scientists (Wolfe
& Michaud, 2004) to estimate the causal effects. Recently, financial scholars (Campello, Graham,
& Harvey, 2010), sociologists (Gangl, 2006; Grodsky, 2007), and political scientists (Arceneaux,
Gerber, & Green, 2006) have implemented the PSM in their empirical studies. A Google Scholar
search in early 2012 showed that over 7,300 publications cited Rosenbaum and Rubin’s classic
1983 article that introduced the PSM. An additional Web of Science analysis indicated that over
3,000 academic articles cited this influential article. Of these citations, 20% of the publications
were in economics, 14% were in statistics, 10% were in methodological journals, and the remaining 56% were in health-related fields. Despite the widespread use of the PSM across a variety of
disciplines, it has not been employed by management scholars, prompting Gerhart’s (2007) conclusion that ‘‘to date, there appear to be no applications of propensity score in the management
literature’’ (p. 563).
This article begins with an overview of a counterfactual model, experiment, regression, and endogeneity. This section illustrates why the counterfactual model is important for estimating causal
effects and why regression models sometimes cannot successfully reconstruct counterfactuals. This
is followed by a short review of the PSM and a discussion of the reasons for using the PSM. The third
section employs a detailed example to illustrate how a treatment effect can be estimated using the
PSM. The following section presents a short summary on the empirical studies that used the PSM in
other social science fields, along with a description of potential implementation of the PSM in the
management field. Finally, this article concludes with a discussion of the pros and cons of using the
PSM to estimate causal effects.
Estimating Causal Effects Without the Propensity Score Method
Evidence-based practices use quantitative methods to find reliable effects that can be implemented by practitioners and administrators to develop and adopt effective policy interventions.
Because the application of specific recommendations derived from evidence-based research is
not costless, it is crucial for social scientists to draw correct causal inferences. As pointed out
by King, Keohane, and Verba (1994), ‘‘we should draw causal inferences where they seem appropriate but also provide the reader with the best and most honest estimate of the uncertainty of that
inference’’ (p. 76).
Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014
Li
3
Counterfactual Model
To better understand causal effect, it is important to discuss counterfactuals. In Rubin’s causal model
(see Rubin, 2004, for a summary), Y1i and Y0i are potential earnings for individual i when i receives
(Y1i ) or does not receive training (Y0i Þ. The fundamental problem of making a causal inference is
how to reconstruct the outcomes that are not observed, sometimes called counterfactuals, because
they are not what happened. Conceptually, either the treatment or the nontreatment is not observed
and hence is ‘‘missing’’ (Morgan & Winship, 2007). Specifically, if i received training at time t, the
earnings for i at t þ 1 is Y1i . But if i also did not receive training at time t, the potential earnings for i
at t þ 1 is Y0i . Then the effect of training can be simply expressed as Y1i Y0i . Yet, because it is
impossible for i to simultaneously receive (Y1i ) and not receive (Y0i Þ the training, scholars need
to find other ways to overcome this fundamental problem. One can also understand this fundamental
issue as the ‘‘what-if’’ problem. That is, what if individual i does not receive training? Hence, reconstructing the counterfactuals is crucial to estimate unbiased causal effects.
The counterfactual model shows that it is impossible to calculate individual-level treatment
effects, and therefore scholars have to calculate aggregated treatment effects (Morgan & Winship,
2007). There are two major versions of aggregated treatment effects: the average treatment effect
(ATE) and the average treatment effect on the treated group (ATT). A simple definition of the ATE
can be written as
ATE ¼ EðY1i jTi ¼ 1; 0Þ EðY0i jTi ¼ 1; 0Þ;
ð1:1aÞ
where E(.) represents the expectation in the population. Ti denotes the treatment with the value of 1
for the treated group and the value of 0 for the control group. In other words, the ATE can be defined
as the average effect that would be observed if everyone in the treated and the control groups
received treatment, compared with if no one in both groups received treatment (Harder, Stuart, &
Anthony, 2010). The definition of ATT can be expressed as
ATT ¼ EðY1i jTi ¼ 1Þ EðY0i jTi ¼ 1Þ:
ð1:1bÞ
In contrast to the ATE, the ATT refers to the average difference that would be found if everyone in
the treated group received treatment compared with if none of these individuals in the treated group
received treatment. The value for the ATE will be the same as that for the ATT when the research
design is experimental.1
Experiment
There are different ways to estimate treatment effects other than PSM. Of these, the experiment is
the gold standard (Antonakis et al., 2010). If the participants are randomly assigned to the treated or
the control group, then the treatment effect can simply be estimated by comparing the mean difference between these two groups. Experimental data can generate an unbiased estimator for causal
effects because the randomized design ensures the equivalent distributions of the treated and the
control groups on all observed and unobserved characteristics. Thus, any observed difference on outcome can be caused only by the treatment difference. Because randomized experiments can successfully reconstruct counterfactuals, the causal effect generated by experiment is unbiased.
Regression
In situations when the causal effects of training cannot be studied using an experimental design,
scholars want to examine whether receiving training (T) has any effect on future earnings (Y). In this
case, scholars generally rely on potentially biased observational data sets to investigate the causal
Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014
4
Organizational Research Methods 00(0)
effect. For example, one can use a simple regression model by regressing future earnings (Y) on
training (T) and demographic variables such as age (x1 ) and race (x2 ).
Y ¼ b0 þ b1 x1 þ b2 x2 þ tT þ e:
ð1:2Þ
Scholars then interpret the results by saying ‘‘ceteris paribus, the effect due to training is t.’’ They
typically assume t is the causal effect due to management intervention. Indeed, regression or the
structural equation models (SEM) (cf. Duncan, 1975; James, Mulaik, & Brett, 1982) is still a dominant approach for estimating treatment effect.2 Yet, regression cannot detect whether the cases are
comparable in terms of distribution overlap on observed characteristics. Thus, regression models are
unable to reconstruct counterfactuals. One can easily find many empirical studies that seek to estimate causal effects by regressing an outcome variable on an intervention dummy variable. The findings of these studies, which used observational data sets, could be wrong because they did not adjust
for the distribution between the treated and control groups.
Endogeneity
In addition to the nonequivalence of distribution between the control and treated groups, another
severe error that prevents scholars from calculating unbiased causal effects is endogeneity. This
occurs when predictor T correlates with error term e in Equation 1.2. A number of review articles
have described the endogeneity problem and warned management scholars of its biasing effects
(e.g., Antonakis et al., 2010; Hamilton & Nickerson, 2003). As discussed previously, endogeneity
manifests from measurement error, simultaneity, and omitted variables. Measurement error
typically attenuates the effect size of regression estimators in explanatory variables. Simultaneity
happens when at least one of the predictors is determined simultaneously along with the dependent
variable. An example of simultaneity is the estimation of price in a supply and demand model
(Greene, 2008). An omitted variable appears when one does not control for additional variables that
correlate with explanatory as well as dependent variables.
Of these three sources of endogeneity, the omitted variable bias has probably received the most
attention from management scholars. Returning to the earlier training example, suppose the
researcher only controls for demographic variables but does not control for an individual’s ability.
If training correlates with ability and ability correlates with future earnings, the result will be biased
because of endogeneity. Consequently, omitting ability will cause a correlation between training
dummy T and residuals e. This violates the assumption of strict exogeneity for linear regression
models. Thus, the estimated causal effect (tÞ in Equation 1.2 will be biased. If the omitted variable
is time-invariant, one can use the fixed effects model to deal with endogeneity (Allison, 2009). Beck,
Brüderl, and Woywode’s (2008) simulation showed that the fixed effects model provided correction
for biased estimation due to the omitted variable.
One can also view nonrandom sample selection as a special case of the omitted variable problem.
Taking the effect of training on earnings as an example, one can only observe earnings for individuals who are employed. Employed individuals could be a nonrandom subset of the population. One
can write the nonrandom selection process as Equation 1.3,
D ¼ aZ þ u;
ð1:3Þ
where D is latent selection variable (1 for employed individuals), Z represents a vector of variables
(e.g., education level) that predicts selection, and u denotes disturbances. One can call Equation 1.2
the substantive equation and Equation 1.3 the selection equation. Sample selection bias is likely to
materialize when there is correlation between the disturbances for substantive (e) and selection
equation (u) (Antonakis et al., 2010, p. 1094; Berk, 1983; Heckman, 1979). When there is a correlation between e and u, the Heckman selection model, rather than the PSM, should be used to calculate
Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014
Li
5
causal effect (Antonakis et al., 2010). To correct for the sample selection bias, one can first fit the
selection model using probit or logit model. Then the predicted values from the selection model will
be saved to compute the density and distribution values, from which the inverse Mills ratio (l)—the
ratio for the density value to the distribution value—will be calculated. Finally, the inverse Mills
ratio will be included in the substantive Equation 1.2 to correct for the bias of t due to selection.
For more information on two-stage selection models, readers can consult Berk (1983).
The Propensity Score Method
Having briefly reviewed existing techniques for estimating causal effects, I now discuss how PSM
can help scholars to draw correct causal inferences. The PSM is a technique that allows researchers
to reconstruct counterfactuals using observational data. It does this by reducing two sources of bias
in the observational data: bias due to lack of distribution overlap and bias due to different density
weighting (Heckman, Ichimura, Smith, & Todd, 1998). A propensity score can be defined as the
probability of study participants receiving a treatment based on observed characteristics. The PSM
refers to a special procedure that uses propensity scores and matching algorithm to calculate the causal effect.
Before moving on, it is useful to conceptually differentiate PSM from Heckman’s (1979) ‘‘selection model.’’ His selection model deals with the probability of treatment assignment indirectly from
instrumental variables. Thus, the probability calculated using the selection model requires one or
more variables that are not censored or truncated and that can predict the selection. For example,
if one wanted to study how training affects future earnings, one must consider the self-selection
problem, because wages can only be observed for individuals who are already employed. Using the
predicted probability calculated from the first stage (Equation 1.3), one can compute the inverse
Mills ratio and insert this variable to the wage prediction model to correct for selection bias. In contrast to the predicted probability calculated in the Heckman selection model, propensity scores are
calculated directly only through observed predictors. Furthermore, the propensity scores and the predicted probabilities calculated using Heckman selection have different purposes in estimating causal
effects: The probabilities estimated from the Heckman model generate an inverse Mills ratio that can
be used to adjust for bias due to censoring or truncation, whereas the probabilities calculated in the
PSM are used to adjust covariate distribution between the treated group and the control group.
Reasons for Using the PSM
Because there are many methods that can estimate causal effects, why should management scholars
care about the PSM? One reason is that most publications in the management field rely on observational data. Such large data can be relatively inexpensive to obtain, yet they are almost always observational rather than experimental. By adjusting covariates between the treated and control groups,
the PSM allows scholars to reconstruct counterfactuals using observational data. If the strongly
ignorable assumption that will be discussed in the next section is satisfied, then the PSM can produce
an unbiased causal effect using observational data sets.
Second, mis-specified econometric models using observational data sometimes produce biased
estimators. One source of such bias is that the two samples lack distribution overlap, and regression
analysis cannot tell researchers the distribution overlap between two samples. Cochran (1957, pp.
265-266) illustrated this problem using the following example: ‘‘Suppose that we were adjusting for
differences in parents’ income in a comparison of private and public school children, and that the
private-school incomes ranged from $10,000–$12,000, while the public-school incomes ranged
from $4,000–$6,000. The covariance would adjust results so that they allegedly applied to a mean
income of $8,000 in each group, although neither group has any observations in which incomes are
Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014
6
Organizational Research Methods 00(0)
at or near this level.’’ The PSM can easily detect the lack of covariate distribution between two
groups and adjust the distribution accordingly.
Third, linear or logistic models have been used to adjust for confounding covariates, but such
models rely on assumptions regarding functional form. For example, one assumption required for
a linear model to produce an unbiased estimator is that it does not suffer from the aforementioned
problem of endogeneity. Although the procedure to calculate propensity scores is parametric, using
propensity scores to compute causal effect is largely nonparametric. Thus, using the PSM to calculate the causal effect is less susceptible to the violation of model assumptions. Overall, when one is
interested in investigating the effectiveness of a certain management practice but is unable to collect
experimental data, the PSM should be used, at least as a robust test to justify the findings estimated
by parametric models.
Overview of the PSM
The concept of subclassification is helpful for understanding the PSM. Simply comparing the mean
difference of the outcome variables in two groups typically leads to biased estimators, because the
distributions of the observational variables in the two groups may differ. Cochran’s (1968) subclassification method first divides an observational variable into n subclasses and then estimates the
treatment effect by comparing the weighted means of the outcome variable in each subclass. He used
two approaches to demonstrate the effectiveness of subclassification in reducing bias in observational studies. First, he used an empirical example (death rate for smoking groups with country of
origin and age as covariates) to show that when age was divided into two classes more than half the
effect of the age bias was removed. Second, he used a mathematical model to derive the proportion
of bias that can be removed through subclassification. For different distribution functions, using five
or six subclasses will typically remove 90% or more of the bias shown in the raw comparison. With
more than six subclasses, only small amounts of additional bias can be removed. Yet, subclassification is difficult to utilize if many confounding covariates exist (Rubin, 1997).
To overcome the difficulty of estimating the treatment effects using Cochran’s technique, Rosenbaum and Rubin (1983) developed the PSM. The key objective of the PSM is to replace the many
confounding covariates in an observational study with one function of these covariates. The function
(or the propensity score) captures the likelihood of study participants receiving a treatment based on
observed covariates. The estimated propensity score is then used as the only confounding covariate
to adjust for all of the covariates that go into the estimation. Since the propensity score adjusts for all
covariates using a simple variable and Cochran found that five blocks can remove 90% of bias due to
raw comparison, stratifying the propensity score into five blocks can generally remove much of the
difference due to the non-overlap of all observed covariates between the treated group and the control group.
Central to understanding the PSM is the balancing score. Rosenbaum and Rubin (1983) defined
the balancing score as a function of observable covariates such that the conditional distribution of X
given the balancing score is the same for the treated group and the control group. Formally, the balancing score bðX Þ satisfies X ?T jbðX Þ, where X is a vector of the observed covariates, T represents
the treatment assignment, and ? refers to independence. Rosenbaum and Rubin argued that the propensity score is a type of balancing score. They further proved that the finest balancing score is
bð X Þ ¼ X , the coarsest balancing score is the propensity score, and any score that is finer than the
propensity score is the balancing score.
Rosenbaum and Rubin (1983) also introduced the strongly ignorable assumption, which implies
that given the balancing scores, the distributions of the covariates between the treated and the control
groups are the same. They further showed that treatment assignment is strongly ignorable if it satisfies the condition of unconfoundedness and overlap. Unconfoundedness means that conditional on
Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014
Li
7
observational covariates X, potential outcomes (Y1 and Y0) are not influenced by treatment assignment (Y1 ; Y0 ?T jX ). This assumption simply asserts that the researcher can observe all variables that
need to be adjusted. The overlap assumption means that given covariates X, the person with the same
X values has positive and equal opportunity of being assigned to the treated group or the control
group ð0 < prðT ¼ 1jX Þ < 1Þ.
Strongly ignorable assumption rules out the systematic, pretreatment, and unobserved differences
between the treated and the control subjects that participate in the study (Joffe & Rosenbaum, 1999).
Given the strongly ignorable assumption, the ATT defined in Equation 1.1b can be estimated using
the balancing score. Because the propensity score e(x) is one form of balancing score, one can estimate the ATT by subtracting the average treatment effect of the treated group from that of the control group at a particular propensity score. Thus, Equation 1.1b could be rewritten as
ATT ¼ EfY jT ¼ 1; eð xÞg EfY jT ¼ 0; eð xÞg.
If there are unobserved variables that simultaneously affect the treatment assignment and the outcome variable, the treatment assignment is not strongly ignorable. One can compare the failure of
the strongly ignorable assumption with endogeneity in the mis-specified econometric models. One
can view this as the omitted or unmeasured variable problem (cf. James, 1980). Specifically, when
one calculates the propensity scores, one or more variables that may affect treatment assignment and
outcomes are omitted. For example, suppose an unobserved variable partially determines treatment
assignment. In this case, two individuals with the same values of observed covariates will receive the
same propensity score, despite the fact that they have different values of unobserved covariates and,
thus, should receive different propensity scores. If the strongly ignorable assumption is violated, the
PSM will produce biased causal effects.
Estimating Causal Effects With the Propensity Score Method
If the treatment assignment is strongly ignorable, scholars can use the PSM to remove the difference
in the covariates’ distributions between the treated and the control groups (Imbens, 2004). This section details how scholars can apply the PSM to compute causal effects. Generally speaking, four
major steps need to take place to estimate causal effect (Figure 1): (1) Determine observational covariates and estimate the propensity scores, (2) stratify the propensity scores into different strata and
test the balance for each stratum, (3) calculate the treatment effect by selecting appropriate methods
such as matched sampling (or matching) and covariance adjustment, and (4) conduct a sensitivity
test to justify that the estimated ATT is robust.
To demonstrate how scholars can use the proposed procedure listed in Figure 1 to gauge causal
effect, I analyze three sources of data sets that have been widely used by economists (Dehejia &
Wahba, 1999, 2002; Heckman & Hotz, 1989; Lalonde, 1986; Simith & Todd, 2005). These data
sets include both experimental and observational data. Given that the unbiased treatment effect
can be computed from the experimental design, it is possible to compare the discrepancy between
the estimated ATT using observational data and the unbiased ATT calculated from the experimental design.
The National Supported Work Demonstration (NSW) data were collected using an experimental
design in which individuals were randomly chosen to provide data on work experience for a period
of around 6 to 18 months in the years from 1975 to 1977. This federally funded program randomly
selected qualified individuals for training positions so that they could get paying jobs and accumulate work experience. The other set of qualified individuals was randomly assigned to the control
group, where they had no opportunity to receive the benefit of the NSW program. To ensure that
the earnings information from the experiment included calendar year 1975 earnings, Lalonde
(1986) chose participants who were assigned to treatment after December 1975. This procedure
reduced the NSW sample to 297 treated individuals and 425 control individuals for the male
Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014
8
Organizational Research Methods 00(0)
Determine
observational
covariates
Step 1: estimating
propensity score
Estimate PScore:
1.
2.
3.
4.
High-order
covariates
Logit/probit
Ordinal probit
Multinomial logit
Hazard
Stratify PScore to
different strata
Step 2: stratifying and
balancing propensity
score
Test for balance
of covariate
Covariate is not
balanced
Covariate is
balanced
Estimate causal effect:
1. Matched sampling
2.
Step 3: estimating
causal effect
1) Stratified matching
2) Nearest neighbor matching
3) Radius matching
4) Kernel matching
Covariate adjustment
Sensitivity test:
1. Multiple comparison groups
2. Specification
3. Instrumental variables
4. Rosenbaum bounds
Step 4: sensitivity
test
Figure 1. Steps for estimating treatment effects
Note: PScore ¼ propensity scores.
participants. Dehejia and Wahba (1999, 2002) reconstructed Lalonde’s original NSW data by
including individuals who attended the program early enough to obtain retrospective 1974 earning
information. The final NSW sample includes 185 treated and 265 control individuals.
Lalonde’s (1986) observational data consisted of two distinct comparison groups in the years
between 1975 and 1979: the Population Survey of Income Dynamics (PSID-1) and the Current Population Survey–Social Security Administration File (CPS-1). Initiated in 1968, the PSID is a nationally representative longitudinal database that interviewed individuals and families for information
on dynamics of employment, income, and earnings. The CPS, a monthly survey conducted by
Bureau of the Census for the Bureau of Labor Statistics, provides comprehensive information on the
unemployment, income, and poverty of the nation’s population. Lalonde further extracted four data
sets (denoted as PSID-2, PSID-3, CPS-2, and CPS-3) that represent the treatment group based on
Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014
Li
9
Table 1a. Description of Data Sets and Definition of Variables
Data Sets
Sample Size
NSW Treated
185
NSW Control
260
PSID-1
2,490
PSID-2
253
PSID-3
128
CPS-1
15,992
CPS-2
2,369
CPS-3
429
Description
National Supported Work Demonstration (NSW) data were
collected using experimental design, where qualified individuals
were randomly assigned to the training position to receive pay
and accumulate experience.
Experimental control group: The set of qualified individuals were
randomly assigned to this control group so that they have no
opportunity to receive the benefit of NSW program.
Nonexperimental control group: 1975-1979 Population Survey of
Income Dynamics (PSID) where all male household heads under
age 55 who did not classify as retired in 1975.
Data set was selected from PSID-1 who were not working in the
spring of 1976.
Data set was selected from PSID-2 who were not working in the
spring of 1975.
Nonexperimental control group: 1975-1979 Current Population
Survey (CPS) where all participants with age under 55.
Data set was selected from CPS-1 where all men who were not
working when surveyed in March 1976.
Data set was selected from CPS-2 where all unemployed men in
1976 whose income in 1975 was below the poverty line.
Variables
Definition
Treatment
Set to 1 if the participant comes from NSW treated data set, 0
otherwise
The age of the participants (in years)
Number of years of schooling
Set to 1 for Black participants, 0 otherwise
Set to 1 for Hispanic participants, 0 otherwise
Set to 1 for married participants, 0 otherwise
Set to 1 for the participants with no high school degree, 0 otherwise
Earnings in 1974
Earnings in 1975
Earnings in 1978, the outcome variable
Age
Education
Black
Hispanic
Married
Nodegree
RE74
RE75
RE78
simple pre-intervention characteristics (e.g., age or employment status; see Table 1a for details).
Table 1a reports details of data sets and the definitions of the variables.
Step 1: Estimating the Propensity Scores
To calculate a propensity score, one first needs to determine the covariates. Heckman, Ichimura, and
Todd (1997) demonstrated that the quality of the observational variables has a significant impact on
the estimated results. Having knowledge of relevant theory, institutional settings, and previous
research is beneficial for scholars to specify which variables should be included in the model (Simith
& Todd, 2005). To appropriately represent the theory, scholars need to specify not only the observational covariates but also the high-order covariates such as quadratic effects and interaction effects.
From a methodological perspective, researchers need to add high-order covariates to achieve strata
balance. The process of adding high-order covariates will be discussed in the section detailing how
to obtain a balance of propensity scores in each stratum. A recent development called boosted
regression can also be implemented to calculate propensity scores (McCaffrey, Ridgeway, &
Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014
10
Organizational Research Methods 00(0)
Table 1b. Summary Statistics
Sample
Statistics
Age
NSW Treated M
25.82
SD
7.16
NSW Control M
25.05
SD
7.06
SB
10.73
M
34.85
PSID-1a
SD
10.44
SB
100.94
M
30.96
PSID-1Mb
SD
9.46
SB
61.35
Percentage
39.22
reduction
in SB
Education
Black
10.35
2.01
10.09
1.61
14.12
12.12
3.08
68.05
11.14
2.59
34.29
49.61
0.84
0.36
0.83
0.38
4.39
0.25
0.43
147.98
0.70
0.49
33.13
77.61
Hispanic Married Nodegree
0.06
0.24
0.11
0.31
17.46
0.03
0.18
12.86
0.05
0.22
3.48
72.94
0.19
0.39
0.15
0.36
9.36
0.87
0.34
184.23
0.45
0.42
64.37
65.06
0.71
0.46
0.83
0.37
30.40
0.31
0.46
87.92
0.41
0.49
62.44
28.98
RE74
RE75
N
2,095.57 1,532.06
185
4,886.62 3,219.25
2,107.03 1,266.91
260
5,687.91 3,102.98
0.22
8.39
19,428.75 19,063.34 2,490
13,406.88 13,596.95
171.78
177.44
1,1386.48 9,528.64 1,103
9,326.64 8,222.72
124.79
128.07
27.36
27.82
Note: SB ¼ standardized bias estimated using Formula 2.1; N ¼ number of cases.
a
PSID-1: All male house heads under age 55 who did not classify as retired.
b
PSID-1M is the subsample of PSID-1 that is matched to the treatment group (NSW treated).
Morral, 2004). Boosted regression can simplify the process of achieving balance in each stratum.
Appendix A provides further discussion on this technique.
Steiner, Cook, Shadish, and Clark (2010) replicated a prior study to show the importance of
appropriately selecting covariates. They summarized three strategies for covariates selection: First,
select covariates that are correctly measured and modeled. Second, choose covariates that reduce
selection bias. These will be covariates that are highly correlated with the treatment (best predicted
treatment) and with the outcomes (best predicted outcomes). Finally, if there was no prior theoretically or empirically sound guidance for the covariates selection (e.g., the research question is very
new), scholars can measure a rich set of covariates to increase the likelihood of including covariates
that satisfy the strongly ignorable assumption.
After specifying the observational covariates, the propensity scores can be estimated using these
observational variables. This article summarizes four different approaches that can be used to estimate the propensity scores. If there is only one treatment (e.g., training), then one can use a logistic
model, probit model, or prepared program.3 If treatment has more than two versions (e.g.,
individuals receive several doses of medicine), then an ordinal logistic model can be used (Joffe
& Rosenbaum, 1999). The treatment must be ordered based on certain threshold values. If there
is more than one treatment and the treatments are discrete choices (e.g., Group 1 receives payment,
Group 2 receives training), the propensity scores can be estimated using a multinomial logistic
model. Receiving treatment does not need to happen at the same time. For many treatments, a decision needs to be made regarding whether to treat now or to wait and treat later. The decision to treat
now versus later is driven by the participants’ preferences. Under this condition, one can use the Cox
proportional hazard model to compute the propensity scores. Li, Propert, and Rosenbaum (2001)
demonstrated that the hazard model has properties similar to those of propensity scores.
Except for the Cox model that uses partial likelihood (PL) and does not require us to specify the
baseline hazard function, the estimating technique used in the aforementioned models is maximum
likelihood estimation (MLE) (see Greene, 2008, Chapter 16, for more information on MLE). The
logistic models and the hazard model all assume a latent variable (Y*) that represents an underlying
propensity or probability to receive treatment. Long (1997) argues that one can view a binary outcome variable as a latent variable. When the estimated probability is greater than a certain threshold
or cut point (t), one observes the treatment (Y* > t; T ¼ 1). For an ordinal logistical model, one can
Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014
Li
11
understand the latent variable with multiple thresholds and observe the treatment according to the
thresholds (e.g., t1 < Y* < t2; T ¼ 2). The multinomial logistical model can simply be viewed as
the model that simultaneously estimates a binary model for all possible comparisons among outcome
categories (Long, 1997), but it is more efficient to use a multinomial logistical model than using
multiple binary models. It is somewhat tricky to generate the predicted probability from the Cox
model because it is semiparametric with no assumption of the distribution of baseline. Two alternative choices can be used to better derive probability for survival model: (1) One can rely on a parametric survival model that specifies the baseline model; (2) one can transform the data in order to use
the discrete-time model.
To illustrate how to calculate propensity scores, this study employed treatment group data from
the NSW and control group data from the observational data extracted from the PSID-2. Following
Dehejia and Wahba (1999), I selected age, education, no degree, Black, Hispanic, RE74, RE75, age
square, RE74 square, RE75 square, and RE74 Black as covariates to calculate propensity scores.
To compute propensity scores, one can first run a logistic or probit model using a treatment
dummy (whether an individual received training) as the dependent variable and the aforementioned covariates as the independent variables. Propensity scores can be obtained by calculating
the fitted value from the logistic or probit models (use –predict mypscore, p– in STATA). Readers
can refer Hoetker (2007) for more information on calculating probability from logit or probit models. After calculating propensity scores, Appendix B includes a randomly selected sample (n ¼ 50)
from the combined data set NSW and PSID-2. Readers can obtain data for Appendix B, NSW
treated, and PSID-2 from the author.
Step 2: Stratifying and Balancing the Propensity Scores
After estimating the propensity scores, the next step is to subclassify them into different strata such
that these blocks are balanced on propensity scores. The number of balanced propensity score blocks
depends on the number of observations in the data set. As discussed previously, five blocks are a good
starting point to stratify the propensity scores (Rosenbaum & Rubin, 1983). One then can test the balance of each block by examining the distribution of covariates and the variance of propensity scores.
The t test and the test for standardized bias (SB) are two widely used techniques to ensure the balance
of the strata (Rosenbaum & Rubin, 1985). The t-test compares whether the means of covariates differ
between the treated and the matched control groups. The SB approach calculates the difference of sample means in the treated and the matched control groups as a percentage of the square root of the average sample variance in both groups. To conduct the SB test, scholars need to compare values
calculated before and after matching. The formula used to calculate the SB value can be written as
jX1M X0M j
SBmatch ¼ 100 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ;
0:5ðV1M ð X Þ þ V0M ðX ÞÞ
ð2:1Þ
where X1M ðV1M Þ and X0M ðV0M Þ are the means (variance) for the treated group and the matched control group. In addition to these two widely used tests, the Kolmogorov-Smirnov’s two-sample test can
also be used to investigate the overlap of the covariates between the treated and the control groups.
Balanced strata between the treated and the matched control group ensure the minimal distance in
the marginal distributions of the covariates. If any pretreatment variable is not balanced in a particular block, one needs to subclassify the block into additional blocks until all blocks are balanced. To
obtain strata balance, researchers sometimes need to add high-order covariates and recalculate the
propensity scores. Rosenbaum and Rubin (1984) detailed the process of cycling between checking
for balance within strata and reformulating the propensity model. Two guidelines for adding
high-order covariates have been proposed: (1) When the variances of a critical covariate are found
Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014
12
Organizational Research Methods 00(0)
to differ dramatically between the treatment and the control group, the squared terms of the covariate
need to be included in the revised propensity score model and (2) when the correlation between two
important covariates differs greatly between the groups, the interaction of the covariates can be
added to the propensity score model.
Appendix B shows a simple example of stratifying data into five blocks after calculating the propensity scores. For this illustration, I stratified the 50 cases into five groups. I first identified the
cases with propensity scores smaller than 0.05, which were classified as unmatched. When the propensity scores were smaller than 0.2 but larger than 0.05, I coded this as block 1 (Block ID ¼ 1).
When the propensity scores were smaller than 0.4 but larger than 0.2, this was coded as block 2. This
process was repeated until I had created five blocks, and then I conducted the t-test within each block
to detect any significant difference of propensity scores between the treated and control groups. Tvalues for each block were added in the columns next to the column of Block ID. Overall, the t-test
reveals that the difference of propensity scores between the treated and control groups is statistically
insignificant. If the t-test shows that there are statistically significant differences in propensity
scores, one should either change threshold value of propensity scores in each block or change the
covariates to recalculate the propensity scores.
When the propensity scores in each stratum are balanced, all covariates in each stratum should
also achieve equivalence of distribution. To confirm this, one can conduct the t-test for each observational variable. To illustrate how balance of propensity scores within strata helps to achieve distribution overlap for other covariates, Appendix B reports the values for one continuous variable,
age. One can conduct the t-test to ensure that there is no age difference between the treated and control groups within each stratum. The column Tage reports the t-test for age within the strata. After
balancing each block’s propensity scores, the age difference between the treated and control groups
in each block became statistically insignificant. I recommend that readers use a prepared statistic
package to stratify propensity scores, as a program can simultaneously categorize propensity scores
and conduct balance tests. For instance, one can use the -pscore- program in STATA (Becker &
Ichino, 2002) to estimate, stratify, and test the balance of propensity scores.
To further illustrate how the PSM can achieve strata balance, I replicated the aforementioned two
procedures for the combined experimental data set and each of the observational data sets in Table
1a. Following Dehejia and Wahba’s (1999) suggestions on choice of covariates, I first computed
propensity scores for each data set. Then, the propensity scores were stratified and tested for the balance within each stratum. When the propensity scores achieved balance within each stratum, I
plotted the means of propensity scores in each stratum for each matched data set. Figure 2 provides
evidence that the means of the propensity scores are almost the same for each sample within each
balanced block.
To demonstrate the effectiveness of the PSM in adjusting for the balance of other covariates,
Table 1b summarizes the means, standard errors, and SB of the matched sample. Comparing the
results between the matched and unmatched samples, one can see that the difference of most
observed characteristics between the experimental design and the nonexperimental design reduces
dramatically. For instance, PSID-1 of Table 1b reports that the absolute SB values range from 12.86
to 184.23 (before using propensity score matching), but PSID-1M of Table 1b shows that the absolute minimum value of SB is 3.48 and the absolute maximum value of SB is 128.07.
Furthermore, the t-test and the Kolmogorov-Smirnov sample test were conducted to examine the
balance of each variable. As reported from Table 2, for the PSID-1 sample, except for RE74 in Block
3, one cannot see a p value smaller than 0.1. For simplicity, Table 2 uses only continuous variables
that have been included for estimating the propensity scores to illustrate the effectiveness of the PSM
in increasing the distribution overlap between the treated group and the matched control group.
Overall, Table 2 shows strong evidence that after obtaining balance of propensity scores within a
stratum, the covariates achieve overlap in terms of distribution. To preserve space, Table 1b and
Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014
13
0 .2 .4 .6 .8
1 2 3 4 5 6 7
1 2 3 4 5 6 7
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
CPS-1: Control group CPS-1: Treated group
0 .2 .4 .6 .8 1
0 .2 .4 .6 .8 1
PSID-1: Control group PSID-1: Treated group
1
2
3
4
5
1
2
3
4
5
PSID-2: Treated group
1 2 3 4 5 6 7
1 2 3 4 5 6 7
CPS-2: Control group
CPS-2: Treated group
0 .2 .4 .6 .8 1
PSID-2: Control group
0 .2 .4 .6 .8 1
Mean of Propensity Score
0 .2 .4 .6 .8 1
Li
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1
PSID-3: Control group PSID-3: Treated group
2
3
4
5
CPS-3: Control group
1
2
3
4
5
CPS-3: Treated group
Block ID
Figure 2. Means of propensity scores in balanced strata
Note: PSID ¼ Population Survey of Income Dynamics (PSID-1); CPS ¼ Current Population Survey–Social Security Administration File (CPS-1).
Table 2. Test of Strata Balance
KS Test for Matcheda
t-test for Matched
Sample
Block ID
Age
Education
RE74
RE75
Age
Education
RE74
RE75
PSID-1
1
2
3
4
5
6
7
0.800
0.856
0.834
0.853
0.341
0.353
0.603
0.995
0.319
0.765
0.378
0.816
0.196
0.574
0.283
0.632
0.077
0.744
0.711
0.888
0.791
0.685
0.627
0.641
0.874
0.113
0.956
0.747
0.566
0.998
0.832
0.954
0.613
0.950
0.280
1.000
0.894
1.000
0.999
0.844
0.942
0.828
0.697
0.983
0.044
0.949
0.512
0.466
1.000
0.984
0.998
0.851
0.754
0.026
0.878
1.000
Note: The table reports the p value of each variable for each stratum between National Supported Work Demonstration
(NSW) Treated and matched control groups. PSID-1 ¼ 1975-1979 Population Survey of Income Dynamics (PSID) where all
male household heads under age 55 who did not classify as retired in 1975.
a
KS (Kolmogorov-Smirnov) two-sample test between NSW Treated and matched control groups.
Table 2 report statistics only for PSID-1. Readers can get a full version of these two tables by contacting the author. The aforementioned evidences generally support that the covariates are balanced
for the treated and control groups.
Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014
14
Organizational Research Methods 00(0)
Step 3: Estimating the Causal Effect
Because the data sets include an experimental design, one can compute the unbiased causal effect.
Table 3 shows the estimated results of training on earnings in 1978 (RE78). The first row of Table 3
reports the benchmark values calculated using the experimental data. The unadjusted result
($1,794.34) was calculated by subtracting the mean of RE78 in the treated group (NSW Treated)
from the mean of RE78 in the control group (NSW Control). The adjusted estimation ($1,676.34)
was computed by using regression, controlling for all observational covariates. Because the experimental data compiled by Lalonde (1986) does not achieve the same distribution between the treated
and control groups (Table 1b), this article uses the causal effect value calculated by the adjusted estimation as the benchmark value. From Table 3 column 1, it is obvious that if there are substantial
differences among the pretreatment variables (as shown in Table 1b), using the mean difference
to estimate the causal effect is strongly biased (it ranges from –$15,204.78 to $1,069.85). In Table
3 column 2, a simple linear regression model was used to gauge the adjusted training effects. Column 2 shows that the estimated treatment effects (with a range from $699.13 to $1,873.77) are more
reliable than those calculated using the mean differences.
In addition to mean difference and regression, PSM can also be used to effectively estimate the
ATT. When the propensity scores are balanced in all strata, one can use two standard techniques to
compute the ATT: matched sampling (e.g., stratified matching, nearest neighbor matching, radius
matching, and kernel matching) and covariance adjustment. Matched sampling or matching is a
technique used to sample certain covariates from the treated group and the control group to obtain
a sample with similar distributions of covariates between the two groups.4 Rosenbaum (2004) concluded that propensity score matching can increase the robustness of the model-based adjustment
and avoid unnecessarily detailed description. The quality of the matched samples depends on the
covariate balance and the structure of the matched sets (Gu & Rosenbaum, 1993).
Ideally, exact matching on all confounding variables is the best matching approach because the
sample distribution of all confounding variables would be identical in the treated and control groups.
Unfortunately, exact matching on a single confounding variable will reduce the number of final
matched cases. Supposing that there are k confounding variables and each variable has three levels,
there will be 3k patterns of levels to get perfectly matched samples. Thus, it is impractical to use the
exact matching technique to get the identical distribution of confounding variables between the two
groups. The PSM is more appropriate than exact matching because it reduces the covariates from
k-dimensional to one-dimensional. Rosenbaum and Rubin (1983) also showed that the PSM not only
simplified the matching algorithm, but also increased the quality of the matches.
Stratified Matching
After achieving strata balance, one can apply stratified matching to calculate the ATT. In each
balanced block, the average differences in the outcomes of the treated group and the matched control
group are calculated. The ATT will be estimated by the mean difference weighted by the number of
treated cases in each block. The ATT can be expressed as
P
P
T
C
Q
X
NqT
i2I ðqÞ Yi
j2I ðqÞ Yj
ATT ¼
ð
Þ
;
ð2:2Þ
NqT
NqC
NT
q¼1
where Q denotes the number of blocks with balanced propensity scores, NqT and NqC refer to the number of cases in the treated and the control groups for matched block q, YiT andYjC represent the observational outcomes for case i in the matched treated group q and case j in the matched control group q,
respectively, and N T stands for the total number of cases in the treated group.
Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014
Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014
15
–3,646.81
1,069.85
–8,497.52
–3,821.97
–635.03
PSID-2h
PSID-3h
CPS-1i
CPS-2i
CPS-3i
1,676.34
(638.68)
751.95
(915.26)
1,873.77
(1,060.56)
1,833.13
(1,159.78)
699.13
(547.64)
1,172.70
(645.86)
1,548.24
(781.28)
1,313.15
270,327.32
Adjustedb
2
1,637.43
(805.43)
1,467.04
(1,461.75)
1,843.20
(981.42)
1,488.29
(716.79)
1,676.43
(796.62)
1,505.49
(1,065.52)
1,602.98
21,084.82
ATTc
3
Stratified
508
1,438
4,563
250
308
1,288
Nd
4
1,654.57
(1,174.63)
1,604.09
(1,092.40)
1,522.23
(1,920.24)
1,600.74
(957.05)
1,638.74
(1,014.64)
1,376.65
(1,129.24)
1,566.17
10,712.11
ATT
5
Neighbor
273
271
280
217
231
248
Nd
6
ATTe
7
Radius
1,871.44
(5,837.10)
1,519.60
(2,110.71)
1,632.74
(1,598.12)
1,890.13
(1,993.50)
1,775.99
(2,286.23)
1,307.63
(2,821.56)
1,544.47
45,779.09
Matching
53
79
102
167
77
37
Nd
8
1,507.10
(826.11)
1,712.18
(1,226.90)
1,776.37
(1,425.32)
1,513.78
(726.47)
1,590.49
(736.85)
1,166.93
(864.38)
1,666.26
51,101.52
ATTf
9
Kernel
493
1,416
4,144
245
297
1153
Nd
10
1,952.23
(791.45)
1,593.32
(1,476.54)
1,583.41
(1,866.46)
1,634.81
(515.58)
1,550.90
(625.04)
1,572.09
(943.65)
1,647.80
23,016.46
ATTg
11
508
1,438
4,563
250
308
1,288
Nd
12
Covariate Adjustment
Note: Bootstrap with 100 replications was used to estimate standard errors for the propensity score matching; standard errors in parentheses.
a
The mean difference between treatment group (NSW Treated) and corresponding control groups (NSW Control, PSID-1 to CSP-3).
b
Least squares regression: regress RE78 (earning in 1978) on age, treatment dummy, education, no degree, Black, Hispanic, RE74 (earning in 1974), and RE75 (earning in 1975).
c
Stratifying blocks based on propensity scores, and then use Formula 2.2 to estimate ATT (average treatment effect on treated).
d
The total number of observations, including observations in NSW Treated and corresponding matched control groups.
e
For Kernel matching, when the number of cases is small, use narrower bandwidth (.01) instead of .06.
f
Radius value ranges from .0001 to .0000025.
g
Use regression, take weights, which are defined by the number of treated observations in each balanced propensity score block.
h
Observational covariates: age, treatment dummy, education, no degree, Black, Hispanic, RE74, and RE75. Higher order covariates: age2, RE742, RE752, RE74 Black.
i
Observational covariates: same as h; high-order covariates: age2, education2, RE742, RE752, Education RE74.
j
Mean and variance are calculated using estimated ATT for each technique.
–5,122.71
3,5078,950.9
–15,204.78
PSID-1h
Meanj
Variancej
1,794.34
NSW
Unadjusteda
1
ATT
Table 3. Estimation Results
16
Organizational Research Methods 00(0)
After stratifying data into different blocks, one can calculate the ATT using data listed in AppenP T
Yi (the summation of the outcome variable in each block for each
dix B. First, one can compute
i2I ð1Þ
of the treated cases, denoted as YiT in Appendix B) and
P
YjC (the summation of the outcome vari-
j2I ð1Þ
able in each block for each of the control cases, denoted as YjC in Appendix B). For example, in
block 1 the summation of the outcome for two treated cases is 49,237.66, and the summation of the
outcome for five control cases is 31,301.69. The number of cases in the treatment (N1T ) and the control group (N1C ) for matched block 1 is 2 and 5, respectively. One then can calculate the ATT for each
block. For instance, ATTq¼1 (for block 1) ¼ 49,237.66/2 – 31,301.69/5 ¼ 18,388.492. After computing the ATT for each block, one can get weighted ATTs using the weight given by the fraction of
treated cases in each block. For example, the weight for block 1 is 0.08 (two treated cases in block 1
divided by 25 treated cases in total). The final ATT is estimated by taking a summation of the
weighted ATT ($1,702.321), which means that individuals who received training will, on average,
earn around $1,702.321 more per year than their counterparts who did not obtain governmental
training. The estimated ATT using simple regression is $2,316.414. Comparing this with the true
treatment effect in Table 3 ($1,676.34), one can see that the PSM produces an ATT substantively
similar to the actual casual effect, given that the propensity scores of every block are balanced.
I also conducted another simulation with 200 randomly selected cases from NSW and PSID-2 for
50 times. The average ATT calculated by the PSM is $1,376.713, whereas the average ATT computed by regression analysis is $709.039. Clearly, the PSM produces an ATT closer to the true causal
effects than does the ordinary least squares (OLS). I further examined the balance test for each of
these 50 randomly drawn data sets. Thirteen of 50 data sets did not achieve strata balance. The average ATT calculated by the PSM was $979.612, and the average ATT calculated by OLS was
$697.626. For the remaining 37 data sets that achieved strata balance, the average ATT calculated
by the PSM was $1,516.23, and the average ATT calculated by OLS was $713.04. Therefore,
achieving balance of propensity scores in each stratum is very important for obtaining a less biased
estimator of causal effect.
I also provided SPSS code in Appendix C and STATA code in Appendix D, which readers can
adjust appropriately to other statistical packages for stratified matching. The codes show how to fit
the model with the logit model, calculate propensity scores, stratify propensity scores, conduct the
balance test, and compute the ATT using stratified matching. It is also convenient to implement the
procedure in Excel after calculating the propensity scores using other statistical packages. Readers
who are interested in Excel calculation can contact the author directly to obtain the original file for
the calculation in Appendix B. Moreover, Appendix E also presents a table that reports the PSM
prewritten software in R, SAS, SPSS, and STATA for readers to conveniently find appropriate statistical packages. Combining NSW Treated with other observational data sets, column 3 of Table 3
further details the estimated ATT using stratified matching. Column 3 shows that the lowest estimated result is $1,467.04 (PSID-2) and the highest estimation of the treatment effect is $1,843.20
(PSID-3). Overall, stratified matching produces an ATT relatively close to the unbiased ATT
($1,676.34).
Nearest Neighbor and Radius Matching
Nearest neighbor (NN) matching computes the ATT by selecting n comparison units whose propensity scores are nearest to the treated unit in question. In radius matching, the outcome of the control
units matches with the outcome of the treated units only when the propensity scores fall in the predefined radius of the treated units. A simplified formula to compute the estimated treatment effect
using the nearest neighbor matching or the radius matching technique can be written as
Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014
Li
17
ATT ¼
1 X T
1 X C
ð
Y
Y Þ;
i
N T i2T
NiC j2C j
ð2:3Þ
where NT is the number of cases in the treated group and NiC is a weighting scheme that equals the
number of cases in the control group using a specific algorithm (e.g., nearest neighbor matching, NiC ,
will be the n comparison units with the closest propensity scores). For more information, readers can
consult Heckman et al. (1997).
For NN matching, one can randomly draw either backward or forward matches. For example, in
Appendix B, for case 7 (propensity score ¼ 0.101), one can draw forward matches and find the control case (case 2) with the closest propensity score (0.109). Drawing backward matches, one can find
case 1 with the closest propensity score (0.076). After repeating this for each treated case, one can
calculate the ATT using Formula 2.3. For radius matching, one needs to specify the radius first. For
example, suppose one sets the radius at 0.01, then the only matched case for case 7 is case 2, because
the absolute value of the difference of the propensity scores between case 7 and case 2 is 0.008
(|0.101 – 0.109|), smaller than the radius value 0.01. One can repeat this matching procedure for each
of the treated cases and use Formula 2.3 to estimate the ATT. In Table 3, column 5 reports the estimated ATT using NN matching, which produced an ATT with a range from $1,376.65 (CPS-3) to
$1,654.57 (PSID-1). Column 7 describes the estimated ATT using the radius matching, which generated an ATT with a range from $1,307.63 (CPS-3) to $1,890.13 (CPS-1).
Kernel Matching
Kernel matching is another nonparametric estimation technique that matches all treated units with
a weighted average of all controls. The weighting value is determined by distance of propensity
scores, bandwidth parameter hn, and a kernel function K(.). Scholars can specify the Gaussian
kernel and the appropriate bandwidth parameter to estimate the treatment effect using the
Formula 2.4
ej ð xÞ ei ð xÞ X
1 X T X C
ek ð xÞ ei ð xÞ
fYi Yj K
K
=
g;
ð2:4Þ
ATT ¼ T
N i2T
hn
hn
j2C
k2C
where ej ð xÞ denotes the propensity score of case j in the control group and ei ð xÞ denotes the propensity score of case i in the treated group, and ej ð xÞ ei ð xÞ represents the distance of the propensity
scores.
When one applies kernel matching, one downweights the case in the control group that has a long
distance from the case in the treated group. The weight function K ð:Þ in Equation 2.4 takes large
values when ej ð xÞ is close to ei ð xÞ. To show how it happens, suppose one chooses Gaussian density
ej ð xÞ ei ð xÞ
1
2
and hn ¼ 0.005, and wants to match treated
function K ð zÞ ¼ pffiffiffiffiffiffi ez =2 where z ¼
hn
2p
case 14 with control cases 10 and 11 (Appendix B). One then can compute z values for case 10
([0.282 – 0.312]/0.05 ¼ –0.6) and case 11 ([0.313 – 0.312]/0.05 ¼ 0.02). The weights for case 10 and
11 are 0.33 (k(–0.6)) and 0.40 (k(0.02)), respectively. Clearly, the weight is low for case 10 (0.33)
that has a long distance of propensity score with treated case 14 (0.282 – 0.312 ¼ –0.04), whereas the
weight is relatively large for case 11 (0.40) that has a short distance of propensity score with case 14
(0.313 – 0.312 ¼ 0.001). For more information on kernel matching, readers can refer to Heckman et
al. (1998). In Table 3, column 9 shows the results for the kernel matching. The estimated ATT using
the kernel matching technique ranges from $1,166.93 (CPS-3) to $1,776.37 (PSID-3).
Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014
18
Organizational Research Methods 00(0)
Covariance Adjustment
Covariance adjustment is a type of regression adjustment that weights the regression using propensity scores. The matching process does not consider the variance in the observational variables
because the PSM can balance the difference in the pretreatment variables in each block. Therefore,
the observational variables in the balanced strata do not contribute to the treatment assignment and
the potential outcome. Although each block has a balanced propensity score, the pretreatment variables may not have exactly the same distributions between the treatment group and the control
group. Table 2 provides evidence that although the propensity scores are balanced in each stratum,
the distributions of some variables do not fully overlap. For example, RE74 are statistically different
between the treated and the matched control group for PSID-1.
Covariate adjustment is achieved by using a matched sample to regress the treatment outcome on
the covariates with appropriate weights for unmatched cases and duplicated cases. Dehejia and
Wahba (1999) estimated the causal effect by conducting within-stratum regression, taking a
weighted sum over the strata. Imbens (2000) proposed that one can use the inverse of one minus the
propensity scores as the weight for each control case and the inverse of propensity scores as the
weight for each treated case. Rubin (2001) provided additional discussion on covariate adjustment.
Unlike matched sapling, covariance adjustment is a hybrid technique that combines nonparametric
propensity matching with parametric regression. Column 11 of Table 3 reports the results of the covariance adjustment, which were produced by regressing RE78 on all observational variables,
weighted by number of treated cases in each block. This approach generates an ATT ranging from
$1,550.90 (CPS-2) to $1,925.23 (PSID-1).
Researchers have suggested two ways to calculate the variance of the nonparametric estimators of
the ATT. First, Imbens (2004) suggested that one can estimate the variance by calculating each of
five components5 included in the variance formula. The asymptotic variance can generally be estimated consistently using kernel methods, which can consistently compute each of these five components. The bootstrap is the second nonparametric approach to calculate variance (Efron &
Tibshirani, 1997). Efron and Tibshirani (1997) argued that 50 bootstrap replications can produce
a good estimator for standard errors, yet a much larger number of replications are needed to determine the bootstrap confidence interval. In Table 3, 100 bootstrap replications were used to calculate
the standard errors for the matching technique. In addition to calculating the variance nonparametrically, one can also compute it parametrically if covariance adjustment is used to produce the ATT.
In Table 3, for the covariate adjustment technique, the standard errors in Column 11 of Table 3 were
generated by linear regression.
Choosing Techniques
This article has reviewed different techniques for gauging the ATT. The performance of these
strategies differs case by case and depends on data structure. Dehejia and Wahba (2002)
demonstrated that when there is substantial overlap in the distribution of propensity scores
(or balanced strata) between the treated and control groups, most matching techniques will
produce similar results. Imbens (2004) remarked that there are no fully applicable versions
of tools that do not require applied researchers to specify smoothing parameters. Specifically,
little is still known about the optimal bandwidth, radius, and number of matches. That being
said, scholars still need to consider particular issues in choosing the techniques that their
research will employ.
For nearest neighbor matching, it is important to determine how many comparison units match
each treated unit. Increasing comparison units decreases the variance of the estimator but increases
the bias of the estimator. Furthermore, one needs to choose between matching with replacement and
Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014
Li
19
Number of matched neighbor ↑; Bias ↑; Variance ↓
Nearest neighbor
Match without replacement; Bias ↑; Variance ↓
Radius matching
Maximum value of radius ↑; Bias ↑; Variance↓
Matched sampling
Weighting: kernel function (e.g., Gaussian)
Kernel matching
Bandwidth ↑; Bias ↑; Variance ↓
Balanced
strata
Stratified matching
Weighting: fraction of treated cases within strata
Number of treated cases in each stratum
Covariate adjustment
Weighting
Inverse of propensity score for treated case
Figure 3. Choosing techniques
matching without replacement (Dehejia & Wahba, 2002). When there are few comparison units,
matching without replacement will force us to match treated units to the comparison ones that are
quite different in propensity scores. This enhances the likelihood of bad matches (increase the bias of
the estimator), but it could also decrease the variance of the estimator. Thus, matching without
replacement decreases the variance of the estimator at the cost of increasing the estimation bias.
In contrast, because matching with replacement allows one comparison unit to be matched more
than once with each nearest treatment unit, matching with replacement can minimize the distance
between the treatment unit and the matched comparison unit. This will reduce bias of the estimator
but increase variance of the estimator.
In regard to radius matching, it is important to choose the maximum value of the radius. The
larger the radius is, the more matches can be found. More matches typically increase the likelihood
of finding bad matches, which raises the bias of the estimator but decreases the variance of the estimator. As far as kernel matching is concerned, choosing an appropriate bandwidth is also crucial
because a wider bandwidth will produce a smoother function at the cost of tracking data less closely.
Typically, wider bandwidth increases chance of bad matches so that the bias of the estimator will
also be high. Yet, more comparison units due to wider bandwidth will also decrease the variance
of the estimator. Figure 3 summarizes the issues that scholars need to consider before choosing
appropriate techniques.
For organizational scholars, I recommend using stratified matching and covariate adjustment for
the following reasons: First, these two techniques do not require scholars to choose specific smoothing parameters. The estimation of the ATT from these two techniques requires minimum statistical
knowledge. Second, the weighting parameters can be easily constructed from the data. One can use a
similar version of weighting parameters (the number of treated cases in each block) for both techniques. For stratified matching, one calculates the number of treated cases in each stratum, and then
the proportion of treated cases will be computed. For covariate adjustment, one can use the number
of treated cases as weights in the regression model. Finally, the performance of these two approaches
(Table 3) is relatively close to other matching techniques. Overall, these two techniques are not only
relatively simple, but can also produce a reliable ATT.
Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014
20
Organizational Research Methods 00(0)
Step 4: Sensitivity Test
The sensitivity test is the final step used to investigate whether the causal effect estimated from the
PSM is susceptible to the influence of unobserved covariates. Ideally, when an unbiased causal
effect is available (e.g., the benchmark ATT estimated from the experimental design), scholars can
compare the ATT generated by the PSM with the unbiased ATT to assure the accuracy of the PSM.
However, in most empirical settings, an unbiased ATT is not available. Rosenbaum (1987) proposed
that multiple comparison groups are valuable in detecting the existence of important unobserved
variables. For example, one can use multiple control groups to match the treated group to calculate
multiple treatment effects. One can have a sense of the reliability of the estimated ATT comparing
the effect size of different treatment effects. Table 3 reports results for such sensitivity test by drawing on multiple groups. One can compare the ATT for between PSID-1 and other data sets to confirm
the effectiveness of stratified matching. Alternatively, one can match two control groups. If the
results show that causal effects are statistically different between these two control groups, then one
can conclude that the strongly ignorable assumption is violated.
In practice, however, scholars will ordinarily not have multiple comparison groups or unbiased
causal effect gauged from experimental data. How then can one conduct a sensitivity test? Three
approaches—changing the specification in the equation, using the instrumental variable, and
Rosenbaum Bounding (RB)—can be implemented. To conduct a sensitivity test by changing the specification in the equation, scholars first need to change the specification by dropping or adding highorder covariates such as quadratic or interaction terms. After changing the specification, scholars
should recalculate the propensity scores and the causal effect. Comparison of the newly calculated causal effect and the originally computed causal effect will reveal how reliable the originally computed
causal effect is. This technique is similar to Dehejia and Wahba’s (1999) suggestion of selecting based
on observables. Selecting based on observables informs researchers whether the treatment assignment
is strongly ignorable, the precondition for the PSM to produce an unbiased estimation.
Table 4a shows the sensitivity analysis when I dropped higher-order pretreatment variables. By
using only the observational variables, column 1 demonstrates that the estimated results of stratifying matching range from $813.20 (PSID-2) to $1,348.56 (CPS-1). Column 3 summarizes the estimated results by using the nearest neighbor technique. The lowest estimated result of the casual
effect is $996.59 (PSID-2) and the highest estimated result of the causal effect is $1,855.61
(PSID-3). Column 5 reports the results of radius matching with a range from $835.68 (PSID-1) to
$2,110.03 (PSID-2). In column 7 of Table 4a, the estimated ATTs range from $831.12 (PSID-1)
to $1,778.12 (PSID-2). Finally, covariate adjustment shows the treatment effects ranging from
$1,342.50 (CPS-1) to $2,328.20 (PSID-1). It is important to emphasize that after dropping the
high-order covariates, the balancing property is not satisfied for all the matched control samples.
When one lacks an unbiased estimator and multiple comparison groups, the instrumental variable
(IV) method is another technique that can be used to assess the bias of the causal effects estimated by
the PSM. DiPrete and Gangl (2004) argued that the IV estimation can produce a consistent and
unbiased estimation of the causal effect when the IVs are appropriately chosen, but this method generally reduces the efficiency of the causal estimators and introduces some uncertainty because of its
reliance on additional assumptions. Usually, for public policy studies, a grouping variable that
divides samples into a number of disjointed groups can be selected as an instrumental variable.6 For
example, Angrist, Imbens, and Rubin (1996) used the lottery number as the instrumental variable to
estimate the causal effect of Vietnam War veteran status on mortality. The rationale behind using
lottery numbers is that they correlate with the treatment variable (whether to serve in the military)
because a low lottery number would potentially get called to serve in the military. On the other hand,
a lottery number is a random number that does not correlate with the error term. Thus the lottery
number serves as a good instrument for the endogenous variable—serving in the Vietnam War. One
Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014
Li
21
Table 4a. Sensitivity Test
Covariate
Adjustment
Matching
Stratified
ATT
1
Neighbor
N
2
ATT
3
PSID-1
1,342.40 1,345
1,545.52
(763.09)
(1,093.77)
PSID-2
813.20
369
996.59
(1,081.68)
(,1643.11)
PSID-3
1,035.09
270
1,855.61
(1,091.28)
(1,703.87)
CPS-1
1,348.56 5,961
1,765.35
(651.14)
(869.69)
CPS-2
1,301.86 1,747
1,108.86
(714.36)
(995.48)
CPS-3
1,077.56
557
1,346.78
(707.68)
(1,019.54)
Mean
1,153.11
1,436.45
Variance 46,267.12
120,918.64
Radius
N
4
257
232
229
380
297
284
ATT
5
Kernel
N
6
ATT
7
N
8
ATT
9
N
10
835.68 21
831.12 1,260
2,328.20
(3,877.08)
(805.65)
(693.69)
2,110.03 17
1,778.12
357
2,145.41
(2,999.31)
(1,000.81)
(1,143.55)
1,764.55 219
1,724.97
269
1,535.83
(1,269.51)
(1,283.44)
(1,400.24)
1,194.55 129
1,186.89 5,851
1,342.50
(1,855.94)
(578.68)
(470.60)
1,296.92 79
1,049.00 1,742
1,570.37
(2,341.93)
(654.90)
(478.94)
868.22 53
1,269.21
554
1,357.84
(2,752.29)
(704.80)
(685.77)
1,306.55
1,344.99
1,713.36
141,108.36
254,592.59
176,117.80
1,345
369
270
5,961
1,747
557
Note: All the sensitivity tests used only observational covariates: age, education, no degree (no high school degree), Black,
Hispanic, RE74 (earning in 1974), and RE75 (earning in 1975). No high-order covariates are included; bootstrap with 100
replications was used to estimate standard errors for the propensity score matching; ATT: average treatment effect on
treated. Standard errors in parentheses.
Table 4b. Sensitivity Test
PSID-1
G
1.00
1.05
1.10
1.15
1.20
1.25
1.30
CPS-2
p-criticala
Lower Bound
Upper Bound
p-criticala
Lower Bound
Upper Bound
0.042
0.074
0.119
0.177
0.246
0.325
0.409
216.997
57.226
–26.215
–188.640
–343.541
–455.599
–621.988
1,752.880
1,941.530
2,090.720
2,293.670
2,478.540
2,627.530
2,778.500
0.006
0.013
0.025
0.044
0.072
0.110
0.157
641.387
468.296
320.627
196.642
43.579
–4.340
–112.684
2,089.060
2,262.150
2,413.840
2,545.930
2,741.260
2,894.800
3,039.860
Note: G ¼ The odds ratio that individuals will receive treatment.
a
Wilcoxon signed-rank gives the significance test for upper bound.
can compare the estimate of the causal effect from the PSM with the IV estimators to determine the
accuracy of the estimators calculated by the PSM. Unfortunately, the limited number of covariates in
these data sets prevents me from using the IV approach to conduct the sensitivity analysis. Readers
who are interested in this topic can find examples from Angrist et al. (1996) and DiPrete and Gangl
(2004). Wooldridge (2002) provides further theoretical background on how IV can be used when one
suspects the failure of a strongly ignorable assumption.
Finally, Rosenbaum (2002, Chapter 4) proposed a bounding approach to test the existence of hidden bias, which potentially arises to make the estimated treatment effect biased. Suppose u1i and u0j
are unobserved characteristics for individuals i and j in the treated group and the control group. G
Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014
22
Organizational Research Methods 00(0)
refers to the effect of these unobserved variables on treatment assignment. The odds ratio that
individuals receive treatment can be simply written as G ¼ exp(u1i – u0j). If the unobserved variables
u1i and u0j are uninformative, then the assignment process is random (G ¼ 1) and the estimated ATT
and confidence intervals are unbiased. When the unobserved variables are informative, then the confidence intervals of the ATT become wider and the likelihood of finding support for the null hypothesis increases. Rosenbaum Bounding sensitivity test changes the effect of the unobserved variables
on the treatment assignment to determine the end point of the significant test that leads one to accept
the null hypothesis. Diprete and Gangl (2004) implemented the procedure in STATA for testing the
continuous outcomes, however, their program only works for one to one matching. Becker and
Caliendo (2007) also implemented this method in STATA but for testing the dichotomous outcome.
Table 4b presents an example of using the RB test. The table reports only the test for PSID-1 and
CPS-2 because the t-values for the ATT estimated using stratified matching show strong evidence of
treatment effect. By varying the value of G, Table 4b reports the p value as well as the upper and
lower bounds of the ATT. The Wilcoxon signed-rank test generates a significance test at a given
level of hidden bias specified by parameter G (DiPrete & Gangl, 2004). As reported from Table
4b, the estimated ATT is very sensitive to hidden bias. As far as PSID-1 is concerned, when the critical value of G is between 1.05 and 1.10 (the unobserved variables cause the odds ratio of being
assigned to the treated group or the control group to be about 1.10), one needs to question the conclusion of the positive effect of training on salary in the year 1978. In regards to the CPS-2 sample,
when the critical value of G is between 1.20 and 1.25, one should question the positive effect of
training on future salary. Yet, a value for G of 1.25 in CPS-2 does not mean that one will not observe
the positive effect of training on future earnings; it only means that when unobserved variables determine the treatment assignment by a ratio of 1.25, it will be so strong that the salary effect would
include zero and that unobserved covariates almost perfectly determine the future salary in each
matched case. RB presents a worst-case scenario that assumes treatment assignment is influenced
by unobserved covariates. This sensitivity test conveys important information about how the level
of uncertainty involved in matching estimators will undermine the conclusions of matched sampling
analyses. The simple test in Table 4b generally reveals that the causal effect of training is very sensitive to hidden biases that could influence the odds of treatment assignment.
Future Applications of the Propensity Score Method
To my knowledge, no publications in the management field have implemented the PSM in an
empirical setting, yet other social science fields have empirically applied the PSM. Thus, before
offering suggestions for applying the PSM to the field, I will provide an overview of how scholars
in relevant social science fields (e.g., economics, finance, and sociology) employ the PSM in their
empirical studies. Most applications of the PSM come from the evaluation of public policy by economists (e.g., Dehejia & Wahba, 1999; Lechner, 2002). Early implementation of the PSM intended to
examine whether this technique effectively reduces bias stemming from the heterogeneity of participants. Economists generally agreed that the PSM is appropriate for examining causal effects using
observational data. Recent application by Couch and Placzek (2010), for example, used the PSM to
calculate the ATT without any concern regarding the legitimacy of the technique. Combining the
PSM and the average difference-in-difference approaches, Couch and Placzek (2010) found that
mass layoff decreased earnings at 33%.
To provide a concise overview of the PSM in other social science fields, I conducted a Web of
Science search calling up articles that cited Rosenbaum and Rubin’s 1983 paper. Because most citations came from health-related fields, I limited the search to fields such as economics, sociology, and
business finance that are relevant to management. Overall, in early 2012, I found 674 articles in
these three fields that have cited Rosenbaum and Rubin’s article. Fewer than 100 articles were
Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014
Li
23
published before 2002, yet around 300 articles were published between 2009 and 2011. I first
randomly selected one to two empirical studies from these top economics journals: American Economic Review, Econometrica, Quarterly Journal of Economics, and Review of Economic Studies. I
then randomly selected one to two empirical articles from two top sociology journals: American
Journal of Sociology and American Sociological Review. I finally randomly selected one to two
studies from three top financial journals: Journal of Finance, Journal of Financial Economics, and
Review of Financial Studies. Table 5 summarizes the data, analytical techniques, and key findings of
these empirical articles employing the PSM in their fields.
Given that management scholars have relied on observational data sets, using the PSM will be
fundamentally helpful in discovering the effectiveness of management interventions, including
areas such as strategy, entrepreneurship, and human resource management. For strategy scholars,
future research can use the PSM to examine whether firms that adopt long-term incentive plans
(e.g., stock options and stock ownership) can increase overall performance. Apparently, the data
used in this type of study are not experimental. Future research can use the PSM to adjust the
distribution between firms using long-term incentive policies and ones that have not adopted such
policies. Indeed, the PSM can be widely used by strategy scholars who want to examine the outcomes of certain strategies. For example, one can examine whether duality (the practice of the
CEO also being the Chairman of the Board) has real implications for stock price and longterm performance.
The PSM can also be used in entrepreneurship research. Wasserman (2003) documented the paradox of success in that founders were more likely to be replaced by professional managers when
founders led firms to an important breakthrough (e.g., the receipt of additional funding from an
external resource). Future research can further explore this question by investigating which types
of funding lead to turnover in the top management team in newly founded firms. For example, scholars can examine whether funding received from venture capitalists (VCs) has a different effect on
executive turnover than that obtained from a Small Business Innovative Research (SBIR) program.
Similarly, using the PSM, scholars can examine how other interventions, such as a business plan, can
affect entrepreneurial performance. Like strategy scholars, entrepreneurship researchers can implement the PSM in many other questions.
The PSM can also be widely implemented by strategic human resource management
(SHRM) scholars. A major interest in SHRM literature is whether HR practices contribute to
firm performance. One can implement the PSM to investigate whether HR practices (e.g.,
downsizing) contribute to firm performance. When the strongly ignorable assumption is satisfied, the PSM provides an opportunity for HR scholars to document a less biased effect size
between HR practices and firm performance. HR researchers can adjust the distributions of the
observational variables and then estimate the ATT of the HR practices on firm performance. In
conclusion, the PSM is an effective technique for scholars to reconstruct counterfactuals using
observational data sets.
Discussion
Research in other academic fields has documented the effectiveness of the PSM. Yet, like other
methods, the PSM has its strength and weakness. The first advantage in using the PSM is that it simplifies the matching procedure. The PSM can reduce k-dimension observable variables into one
dimension. Therefore, scholars can match observational data sets with k-dimensional covariates
without sacrificing many observations or worrying about computational complexity. Second, the
PSM eliminates two sources of bias (Heckman et al., 1998): bias from nonoverlapping supports and
bias from different density weighting. The PSM increases the likelihood of achieving distribution
overlap between the treated and control groups. Moreover, this technique reweights nonparticipant
Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014
24
Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014
CFOs were asked to report whether
their firms were credit constrained
or not. Demographics of asset size,
ownership form, and credit ratings
were used to predict propensity
scores. Average treatment effects of
constrained credit were estimated by
comparing the difference of spending
between constrained and
unconstrained firms.
Propensity score matching on
observable variables was used to
reduce individual heterogeneity.
1,050 Chief Financial Officers (CFOs)
were surveyed
State administrative files from
Connecticut
Campello, Graham, and Harvey
(2010), Journal of Financial
Economics
Couch and Placzek (2010),
American Economic Review
(continued)
Propensity score estimators calculating
average treatment effects on treated
(ATT) and the average difference-indifference showed that earning losses
were 33% at the time of mass layoff
and 12% 6 years later.
Soldiers serving in the military in the
early 1980s were paid more than
comparable civilians. Military service
increased the employment rate for
veterans after service. Military
service led to only a modest long-run
increase in earnings for non-White
veterans, but reduced the civilian
earnings of White veterans.
Credit constrained firms burned more
cash, sold more assets to fund their
operation, drew more heavily on
lines of credit, and planned deeper
cuts in spending. In addition, inability
to borrow forced many firms to
bypass lucrative investment
opportunities.
Because of the nonrandom selection
issues in the labor market, the
propensity score matching technique
and instrumental variables were used
to examine the voluntary military
service on earnings.
Key Findings
Analytical Technique
Military data come from Defense
Manpower Data Center. Earnings
data come from Social Security
Administration.
Data
Angrist (1998), Econometrica
Author(s)
Table 5. Empirical Studies Applying the Propensity Score Method (PSM)
Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014
25
They combined data sets from multiple
databases. They collected data on
seasoned equity issuers, including
credit rating, stock return, lending
history, and insurance history.
Data were collected from New
Immigrant Survey with around 1,000
cases.
Survey of Income and Program
Participation (SIPP) and European
Community Household Panel
(ECHP)
Data came from a number of sources,
including the representative samples
of students who completed high
school in 1972, 1982, and 1992.
Frank, Akresh, and Lu (2010),
American Sociological Review
Gangl (2006), American
Sociological Review
Grodsky (2007), American Journal
of Sociology
Data
Drucker and Puri (2005), Journal
of Finance
Author(s)
Table 5. (continued)
In the first stage, propensity score was
used to adjust for selection on
observational variables. In the second
stage, the author examined the type
of college a student will attend
controlling for propensity scores.
Difference-in-difference propensity
score matching
(continued)
Overall, underwriters (commercial
banks and investment banks) engaged
in concurrent lending and provide
discounts. In addition, concurrent
lending helped underwriters build
relationships, which help
underwriters increase the probability
of receiving current and future
business.
They found an average difference of
$2,435.63 difference between lighter
and darker skinned individuals. In
other words, darker skin individuals
earn around $2,500 less per year than
counterparts.
Gangl found strong evidence that postunemployment losses are largely
permanent, and such effect is
particularly significant for older and
high-wage workers as well as for
female employees.
The author found the evidence that a
wide range of institutions engage in
affirmative action for African
American students as well as for
Hispanic students.
Propensity score matching was used to
match non-current loans to currents
loans. Propensity score is calculated
using observational variables including credit rating, firm industry, and
other variables.
They used ordinal logistic model to
calculate propensity scores, which
were used to estimate the effect of
skin color on earnings.
Key Findings
Analytical Technique
26
Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014
Unemployed individuals in Zurich, a
region of Switzerland, in periods
1997-1999.
Hand-collected list of the winners of
CEO awards between 1975 and 2002
S&P’s executive compensation between
1993 and 2002
Lechner (2002), Review of
Economic Studies
Malmendier and Tate (2009),
Quarterly Journal of Economics
Xuan (2009), Review of Financial
Studies
They used propensity score matching to
create counterfactual sample for nonwinning CEOs. Nearest neighbor
matching technique, both with and
without bias adjustment, was used to
identify the counterfactual sample.
Ordinary least square was used as the
major technique. The propensity
score method was used as robust
check to address the issue of
endogenous selection of CEO.
Specialist CEOs, defined as CEOs who
have promoted from a certain
divisions of their firm, negatively
affect segment investment efficiency.
After decomposing program evaluation
bias into a number of components, it
was found that selection bias due to
unobservable variable is less
important than other components.
Matching technique can potentially
eliminate much of the bias.
The empirical evidence revealed
support for the fact that the
propensity score matching can be an
informative tool to adjust for
individual heterogeneity when
individuals have multiple programs to
be selected.
They found that award-winning CEOs
underperform over the 3 years
following the award: Relative
underperformance is between 15%
to 26%.
Propensity matching, nonparametric
conditional difference-in-difference
The National Job Training Partnership
Act (JTPA) and Survey of Income and
Program Participation (SIPP)
Heckman, Ichimura, and Todd
(1997), Review of Economic
Studies
Multinomial model was used to estimate
propensity scores of discrete choices
(basic training, further training,
employment program, and
temporary wage subsidy).
Key Findings
Analytical Technique
Data
Author(s)
Table 5. (continued)
Li
27
data to obtain equal distribution between the treated and control groups. Third, if treatment assignment is strongly ignorable, scholars can use the PSM on observational data sets to estimate an ATT
that is reasonably close to the ATT calculated from experiments. Fourth, the matching technique, by
its nature, is nonparametric. Like other nonparametric approaches, this technique will not suffer
from problems that are prevalent in most parametric models, such as the assumption of distribution.
It generally outperforms simple regression analysis when the true functional form for the regression
is nonlinear (Morgan & Harding, 2006). Finally, the PSM is an intuitively sounder method for dealing with covariates than is traditional regression analysis. For example, the idea that covariates in
both the treated group and the control group have the same distributions is much easier to understand
than the interpretation using ‘‘control all other variables at mean’’ or ‘‘ceteris paribus.’’ Even for
regression, without appropriately adjusting for the covariate distribution, one can get an ATT with
the regression technique despite the fact that no meaningful ATT exists.
Despite its many advantages, the PSM also has its limitations. Like other nonparametric techniques, the PSM generally has no test statistics. Although the bootstrap technique can be used to
estimate the variance, such techniques are not fully justified or widely accepted by researchers
(Imbens, 2004). Hence, the use of the PSM may be limited because while it can help scholars draw
causal inferences, it cannot help with drawing statistical inferences. Another key hurdle of this
method is that there are currently no established procedures to investigate whether treatment assignment is strongly ignorable. Heckman et al. (1998) demonstrated that the PSM cannot eliminate bias
due to unobservable differences across groups. The PSM can reweight observational covariates, but
it cannot deal with unobservable variables. Some unobservable variables (e.g., environmental context, region) can increase the bias of the ATT estimated using the PSM. Third, even when the treatment assignment is strongly ignorable, the accuracy of the ATT estimated by the PSM depends on
the quality of the observational data. Thus, measurement error (cf. Gerhart, Wright, & McMahan,
2000) and nonrandom missing values can affect the estimated ATT. Finally, although there are a
few propensity score matching techniques, one can find little guidance on which types of matching
techniques work best for different applications.
Overall, despite its shortcomings, the PSM can be employed by management scholars to investigate the ATT of management interventions. Appropriately used, the PSM can eliminate bias due to
nonoverlapping distributions between the treatment and the control groups. The PSM can also
reduce the problem of unfair comparison. However, scholars must be careful about the quality of
the data because the effectiveness of the PSM depends on the observational covariates. Research
using objective measures will be an optimal setting for using the PSM. In empirical settings with
low quality data, scholars can implement nonparametric PSM as a robust test to justify the parametric findings generated by traditional econometric models.
To draw meaningful and honest causal inferences, one must appropriately choose the technique
that works best for testing the causal relationship. When one has collected panel data and believes
that omitted variable is time-invariant, then the fixed effects model is the best choice for estimating
bias due to an omitted variable (Allison, 2009; Beck et al., 2008). When one finds one or more perfect instrumental variables, using two-stage least-squares (2SLS) can also address the bias of causal
effects calculated through conventional regression techniques. When the endogenous variable suffers only from measurement error and when one knows the reliability coefficient, one can use regression analysis and correct the bias using the reliability coefficient. Almost no technique is perfect in
drawing an unbiased causal inference, including experimental design. Heckman and Vytlacil (2007)
remarked that explicitly manipulating treatment assignment cannot always represent the real-world
problem because experimentation naturally discards information contained in a real-world context
that includes dropout, self-selection, and noncompliance.
Sometimes a combination of techniques is also recommended. For example, to alleviate the extrapolation bias in the regression models Imbens and Wooldridge (2009) recommend using matching to
Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014
28
Organizational Research Methods 00(0)
generate a balanced sample. Similarly, Rosenbaum and Rubin (1983) suggested that differences
due to unobserved heterogeneity should be addressed after balancing the observed covariates.
Additionally, the PSM can also be incorporated in studies using the longitudinal design. Readers
who are interested in estimating the ATT using longitudinal data can also refer to the nonparametric conditional difference-in-difference model (Heckman et al., 1997) and the semiparametric
conditional difference-in-difference model (Heckman et al., 1998). To conclude, to draw the best
causal inference, one needs to choose the appropriate methods. Of various techniques, the PSM
should be a potential choice.
Conclusion
The purpose of this article is to introduce the PSM to the management field. This article makes
several contributions to organizational research methods literature. First, it not only advances
management scholars’ understanding of a neglected method to estimate causal effects, but also
discusses some of the technique’s limitations. Second, by integrating previous work on the PSM,
it provides a step-by-step flowchart that management scholars can easily implement in their
empirical studies. The attached data set with SPSS and STATA stratified matching codes help
management scholars to calculate the ATT. Readers can make context-dependent decisions and
choose a matching algorithm that is most beneficial for their objectives. Finally, a brief review
of the applications of the PSM in other social science fields and a discussion of potential usage
of the PSM in the management field provides an overview of how management scholars can
employ the PSM in future empirical studies.
Appendix A
Boosted Regression
Boosted regression (or boosting) is a general, automated, data-mining technique that has shown
considerable success in using a large number of covariates to predict treatment assignment and fit
a nonlinear surface (McCaffrey, Ridgeway, & Morral, 2004). Boosting relies on a regression tree
using a recursive algorithm to estimate the function that describes the relationship between a set of
covariates and the dependent variable. The regression tree begins with a complete data set and then
partitions the data set into two regions by a series of if-then statements (Schonlau, 2005). For
example, if age and race are covariates, the algorithm can first split the data set into two regions
based on the condition of either of these two variables. The splitting algorithm continues recursively until the regression tree reaches the allowable number of splits. Friedman (2001) has shown
that boosted regression outperforms other methods in reducing prediction error. McCaffrey et al.
(2004) summarized three important advantages of the boosting technique. First, regression trees
are easy and fast to fit. Second, regression trees can handle different types of covariates including
continuous, nominal, ordinal, and missing variables. When boosted logistic regression is used to
predict propensity scores, the use of different forms of covariates generally produces exactly the
same propensity score adjustment. Finally, the boosting technique is capable of handling many
covariates, even those unrelated to treatment assignment or correlated with one another. Schonlau
(2005) listed factors that favor the use of the boosting technique. These factors include a large data
set, suspected nonlinearities, more variables than observations, suspected interactions, correlated
data, and ordered categorical covariates. He concludes that the boosting technique does not require
scholars to specify interactions and nonlinearities. Thus, the boosting technique can simplify the
procedure of computing propensity scores by reducing the burden of adding high-order covariates
such as interactions.
Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014
Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014
29
Outcome
1,0048.54
0
2,0688.17
0
664.977
36,646.95
12,590.71
24,642.57
10,344.09
9,788.461
0
0
13,167.52
4,321.705
12,558.02
12,418.07
0
17,732.72
4,433.18
0
17,732.72
7,284.986
5,522.788
20,505.93
0
2,364.363
22,165.9
7,447.742
2,164.022
11,141.39
3,462.564
559.443
4,279.613
0
Case
ID
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
0
0
0
0
0
1
1
0
0
0
0
0
1
1
1
1
0
0
0
0
0
1
1
1
1
0
0
0
1
1
1
1
1
1
Treatment
Step 1
50
19
26
44
39
35
33
32
44
41
33
20
22
26
46
46
40
26
30
21
20
41
17
24
27
41
23
24
21
23
29
20
19
23
Age
0.076
0.109
0.128
0.14
0.177
0.075
0.101
0.265
0.268
0.282
0.313
0.365
0.261
0.312
0.361
0.392
0.412
0.456
0.481
0.513
0.558
0.511
0.525
0.547
0.59
0.678
0.727
0.746
0.654
0.739
0.758
0.759
0.764
0.768
PScore
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
3
3
3
3
3
3
3
3
3
4
4
4
4
4
4
4
4
4
Block ID
0.136
0.025
1.561
1.86
0.80
0.167
Tage
1.00
1.32
Tpscore
Step 2
21,607.032
33,313.704
42,465.315
49,237.660
YiT
31,978.005
39,898.620
44,775.121
31,401.687
YjC
Appendix B. A Small Data Set for Manually Calculating Average Treatment Effect on the Treated Group (ATT)
6
4
4
2
NqT
3
5
5
5
NqC
–7,058.163
348.702
1,661.30455
18,338.4926
ATTq¼15
Step 3: Estimate Causal Effect
0.24
0.16
0.16
0.08
Weight
(continued)
–1,693.959
55.792
265.809
1,467.079
ATT Weight
30
Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014
5,615.361
0
13,385.86
8,472.158
0
6,181.88
289.79
17,814.98
9,265.788
1,923.938
8,124.715
11,821.81
24,825.81
33,987.71
33,987.71
54,675.88
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
0
0
1
1
1
1
1
1
1
1
1
0
0
0
0
0
Treatment
Step 1
28
23
18
27
18
17
21
37
17
25
25
53
52
28
41
38
Age
0.923
0.954
0.913
0.948
0.954
0.959
0.961
0.965
0.966
0.97
0.987
0.001
0.003
0.009
0.013
0.016
PScore
5
5
5
5
5
5
5
5
5
5
5
Unmatched
cases
Block ID
1.23
Tpscore
Step 2
0.552
Tage
65,459.109
YiT
5,615.361
YjC
9
NqT
2
NqC
4,465.553833
ATTq¼15
Step 3: Estimate Causal Effect
1,607.599
ATT Weight
ATT¼ 1,702.321
0.36
Weight
Note: PScore ¼ propensity scores; Tage/Tpscore ¼ t-test for age and propensity scores in each balanced block; YiT ¼ summation of outcome variable for treated cases in each block; YiC ¼ summation of outcome variable for control cases in each block; NqT ¼ total number of treated cases in each block; NqC ¼ total number of control cases in each block; ATTq¼15¼ YiT/NqT – YiC/NqC;
average treatment effect for each balanced block; weight ¼ total number of treated cases in each block divided by total number of treated cases in the sample.
Outcome
Case
ID
Appendix B. (continued)
Li
31
Appendix C
SPSS Code for Stratified Matching
*Step 1: Calculate propensity score.
LOGISTIC REGRESSION VARIABLES TREATMENT
/METHOD¼ENTER X1 X2 X3
/SAVE¼PRED
/CRITERIA¼PIN(.05) POUT(.10) ITERATE(20) CUT(.5).
RENAME VARIABLES (PRE_1¼pscore).
The above code calculates predicted probability using a number of observation variables (e.g. X1,
X2, and X3). Readers can change their variables correspondingly.
*Step 2: Stratify into five blocks.
compute blockid¼.
if (pscore<¼ .2) & (pscore > .05) blockid¼1.
if (pscore<¼ .4) & (pscore > .2) blockid¼2.
if (pscore<¼ .6) & (pscore > .4) blockid¼3.
if (pscore<¼ .8) & (pscore > .6) blockid¼4.
if ( pscore > .8) blockid¼5.
execute.
*Perform t test for each block.
*Split file first, and then excute t test.
SORT CASES BY blockid.
SPLIT FILE SEPARATE BY blockid.
T-TEST GROUPS¼treatment(0 1)
/MISSING¼ANALYSIS
/VARIABLES¼age pscore
/CRITERIA¼CI(.95).
The above code first stratifies variables into five blocks, and then carries on the t-test for each of
the blocks. SPSS has no ‘‘if’’ option for t-test, thus it is important to split the data based on block ID,
and then conduct the t-test.
*Step 3: Perform Stratification Matching Procedure.
*Caclulate YiT and YjC in Appendix B.
AGGREGATE
/OUTFILE¼* MODE¼ADDVARIABLES
/BREAK¼blockid treatment
/outcome_sum¼SUM(outcome).
*Calculate NqT and NqC in Appendix B.
AGGREGATE
/OUTFILE¼* MODE¼ADDVARIABLES
/BREAK¼blockid treatment
/N_BREAK¼N.
*Calculate total number of treatment cases.
AGGREGATE
/OUTFILE¼* MODE¼ADDVARIABLES
/BREAK¼
/N_Treatment¼sum(treatment).
Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014
32
Organizational Research Methods 00(0)
COMPUTE ATTQ¼outcome_sum/N_BREAK.
EXECUTE.
DATASET DECLARE agg_all.
AGGREGATE
/OUTFILE¼’agg_all’
/BREAK¼treatment blockid
/N_Block_T¼MEAN(N_BREAK)
/ATTQ_T¼MEAN(ATTQ)
/N_Treatment¼MEAN(N_Treatment).
DATASET ACTIVATE agg_all.
DATASET COPY agg_treat.
DATASET ACTIVATE agg_treat.
FILTER OFF.
USE ALL.
SELECT IF (treatment ¼ 1).
EXECUTE.
DATASET ACTIVATE agg_all.
DATASET COPY agg_control.
DATASET ACTIVATE agg_control.
FILTER OFF.
USE ALL.
SELECT IF (treatment¼0&blockid<6).
EXECUTE.
DATASET ACTIVATE agg_control.
RENAME VARIABLES (N_Block_T ATTQ_T ¼N_Block_C ATTQ_C ).
MATCH FILES /FILE¼*
/FILE¼’agg_treat’
/RENAME (blockid N_Treatment treatment ¼ d0 d1 d2)
/DROP¼ d0 d1 d2.
EXECUTE.
COMPUTE ATTQ¼ATTQ_T-ATTQ_C.
EXECUTE.
COMPUTE weight¼N_Block_T/N_Treatment.
EXECUTE.
COMPUTE ATTxweight¼ATTQ*weight.
EXECUTE.
AGGREGATE
/OUTFILE¼* MODE¼ADDVARIABLES OVERWRITEVARS¼YES
/BREAK¼
/ATTxweight_sum¼SUM(ATTxweight).
DATASET CLOSE agg_all.
DATASET CLOSE agg_control.
DATASET CLOSE agg_treat.
Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014
Li
33
This step computes each of the components in Equation 2.2. For example, it first calculates the
number of treated cases and the number of control cases in each matched block. Then, it also gauges
the summation of outcome in each balanced blocks. The code then extracts each of the necessary
components into two different data sets: agg_control and agg_treat. Finally, the code matches these
two data sets based on block ID and estimates the ATT. The final result will be displayed in the variable called ‘‘ATTxweight.’’
Appendix D
STATA Code for Stratified Matching
*STEP 1: get the propensity scores using logistical regression
*Choose covariates appropriately
logit treatment X1 X2 X3
*Calculate propensity scores
predict pscore, p
*STEP 2: subclassification
gen blockid¼.
replace blockid¼1 if
replace blockid¼2 if
replace blockid¼3 if
replace blockid¼4 if
replace blockid¼5 if
pscore<¼.2
pscore<¼.4
pscore<¼.6
pscore<¼.8
pscore>.8
&
&
&
&
pscore>.05
pscore> .2
pscore> .4
pscore> .6
*STEP 2: t test for balance in each block
foreach var of varlist age pscore f
forvalues i¼1/5 f
ttest ‘var’ if blockid ¼¼‘i’, by(treatment)
g
g
*STEP 3: Estimate causal effects using stratified matching
sort blockid treatment
gen YTQ¼. *Yic in Appendix B table
gen TTN¼1 *Nqt in Appendix B table
gen YCQ¼. *Yjc in Appendix B table
gen TCN¼1 *Nqc in Appendix B table
forvalues i¼1/5 f
*Get sum for outcome in each treated block
sum outcome if treatment¼¼1 & blockid¼¼‘i’
replace YTQ¼r(sum) if blockid¼¼‘i’
*Number of treated cases in each block
sum TTN if treatment¼¼1 & blockid¼¼‘i’
replace TTN¼r(sum) if blockid¼¼‘i’
*Get sum for outcome in each control block
sum outcome if treatment¼¼0 & blockid¼¼‘i’
replace YCQ¼r(sum) if blockid¼¼‘i’
*Number of treated cases in each block
sum TCN if treatment¼¼0 & blockid¼¼‘i’
Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014
34
Organizational Research Methods 00(0)
replace TCN¼r(sum) if blockid¼¼‘i’
g
gen ATTQ¼YTQ/TTN-YCQ/TCN
*Weights for ATT
sum treatment
gen W¼TTN/r(sum)
*Weighted ATT
gen ATT¼ATTQ*W
bysort blockid: gen id¼_n
sum ATT if id¼¼1
display "The ATT is ‘r(sum)’"
Appendix E.
Software Packages for Applying the Propensity Score Method (PSM)
Environment Software Name
Authors
Function and Download Sources
R
Matching
Sekhon (2007)
PSAgraphics
Helmreich and Pruzek
(2009)
Twang
Ridgeway, McCaffrey,
and Morral (2006)
Greedy matching
Kosanke and Bergstralh
(2004)
OneToManyMTCH
Parsons (2004)
Relies on an automated procedure to detect
matches based on a number of univariate
and multivariate metrics. It performs
propensity matching, primarily 1:M
matching. The package also allows matching
with and without replacement.
Download source:
http://sekhon.berkeley.edu/matching/
Document:
http://cran.r-project.org/web/packages/
Matching/Matching.pdf
Provides enriched graphical tools to test
within strata balance. It also provides
graphical tools to detect covariate
distributions across strata.
Download source:
http://cran.r-project.org/web/packages/
PSAgraphics/index.html
Includes propensity score estimating and
weighting. Generalized boosted regression
is used to estimate propensity scores thus
simplifying the procedure to estimate
propensity scores.
Download source:
http://cran.r-project.org/web/packages/
twang/index.html
Performs 1:1 nearest neighbor matching.
Download source: http://
mayoresearch.mayo.edu/mayo/research/
biostat/upload/gmatch.sas
Allows users to specify the propensity score
matching from 1:1 or 1:M.
Download source:
http://www2.sas.com/proceedings/sugi29/
165-29.pdf
SAS
(continued)
Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014
Li
35
Appendix E. (continued)
Environment Software Name
Authors
SPSS
SPSS Macro for P
score matching
Painter (2004)
STATA
Pscore
Psmatch2
Function and Download Sources
Performs nearest neighbor propensity score
matching. It seems to solely do 1:1 matching
without replacement.
Download source:
http://www.unc.edu/*painter/SPSSsyntax/
propen.txt
Becker and Ichino (2002) Estimates propensity scores and conducts a
number of matching such as radius, nearest
neighbor, kernel, and stratified.
Download source:
http://www.lrz.de/*sobecker/pscore.html
Leuven and Sianesi
Allows a number of matching procedures,
(2003)
including kernel matching and k:1 matching.
It also supports common support graphs
and balance testing.
Download source:
http://ideas.repec.org/c/boc/bocode/
s432001.html
Acknowledgments
Special thanks to Barry Gerhart for his invaluable support and to Associate Editor James LeBreton and anonymous reviewers for their constructive feedbacks. This article has also benefited from suggestions by Russ
Coff, Jose Cortina, Cindy Devers, Jon Eckhardt, Phil Kim, and seminar participants at 2011 AOM conference.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Notes
1. Harder, Stuart, and Anthony (2010) argued that propensity score method (PSM) can be used to estimate the
average treatment effect on the treated group (ATT), and subclassifying the propensity score can be used to
calculate the average treatment effect (ATE). However, economists typically viewed the PSM as a technique
to estimate the ATT (Dehejia & Wahba, 1999, 2002). Following Dehejia and Wahba (1999, 2002), the
remaining section regards the PSM as a way to calculate the ATT. The remaining sections use causal effects,
treatment effects, and ATT interchangeably.
2. Psychology scholars also extended this to develop the causal steps approach to draw mediating causal inference (e.g., Baron & Kenny, 1986). It is beyond the scope of this article to fully discuss mediation. Interested
readers can read LeBreton, Wu, and Bing (2008) and Wood, Goodman, Beckmann, and Cook (2008) for
surveys.
3. Becker and Ichino (2002) have written a nice STATA program (pscore) to estimate the propensity score. The
convenience of using pscore is that the program can stratify propensity scores to a specified number of
blocks and test the balance of propensity scores in each block. However, when there is more than one treatment, it is inappropriate to use pscore to estimate the propensity score.
Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014
36
Organizational Research Methods 00(0)
4. Propensity score matching is one technique of many matched sampling technique. One can use exact matching simply based on one or more covariates. For example, scholars may match sample based on standard
industry classification (SIC) and firm size rather than matching using propensity scores.
5. These components are: the variance of the covariates in the control groups, the variance of the covariates in
the treated groups, the mean of the covariates in the control groups, the mean of the covariates in the treated
groups, and the estimated propensity score. The variance of the covariates in the treated and the control
groups are weighted by the propensity score.
6. Instrumental variable (IV) is typically used by scholars under the condition of simultaneity. Because of the
difficulty in finding an IV, it is not viewed as a general remedy for endogeneity issues.
References
Allison, P. (2009). Fixed effects regression models. Newbury Park, CA: Sage.
Angrist, J. (1998). Estimating the labor market impact of voluntary military service using social security data on
military applicants. Econometrica, 66, 249-288.
Angrist, J. D., Imbens, G. W., & Rubin, D. B. (1996). Identification of causal effects using instrumental variables. Journal of the American Statistical Association, 9, 444-455.
Antonakis, J., Bendahan, S., Jacquart, P., & Lalive, R. (2010). On making causal claims: A review and recommendations. The Leadership Quarterly, 21(6), 1086-1120.
Arceneaux, K., Gerber, A., & Green, D. (2006). Comparing experimental and matching methods using a largescale voter mobilization experiment. Political Analysis, 14, 1-26.
Baron, R. M., & Kenny, D. A. (1986). The moderator-mediator variable distinction in social psychological
research: Conceptual, strategic, and statistical considerations. Journal of Personality and Social
Psychology, 51(6), 1173-1182.
Beck, N., Brüderl, J., & Woywode, M. (2008). Momentum or deceleration? Theoretical and methodological reflections on the analysis of organizational change. Academyof Management Journal, 51(3),
413-435.
Becker, S., & Caliendo, M. (2007). Sensitivity analysis for average treatment effects. Stata Journal, 7(1), 71-83.
Becker, S., & Ichino, A. (2002). Estimation of average treatment effects based on propensity scores. The Stata
Journal, 2, 358-377.
Berk, R. A. (1983). An introduction to sample selection bias in sociological data. American Sociological
Review, 48(3), 386-398.
Campello, M., Graham, J., & Harvey, C. (2010). The real effects of financial constraints: Evidence from a
financial crisis. Journal of Financial Economics, 97, 470-487.
Cochran, W. (1957). Analysis of covariance: Its nature and uses. Biometrics, 13(3), 261-281.
Cochran, W. (1968). The effectiveness of adjustment by subclassification in removing bias in observational
studies. Biometrics, 24, 295-313.
Couch, K. A., & Placzek, D. W. (2010). Earnings losses of displaced workers revisited. American Economic
Review, 100, 572-589.
Cox, D. (1992). Causality: Some statistical aspects. Journal of the Royal Statistical Society, Series A (Statistics
in Society), 155, 291-301.
Dehejia, R., & Wahba, S. (1999). Causal effects in nonexperimental studies: Reevaluating the evaluation of
training programs. Journal of the American Statistical Association, 94, 1053-1062.
Dehejia, R., & Wahba, S. (2002). Propensity score-matching methods for nonexperimental causal studies.
Review of Economics and Statistics, 84, 151-161.
DiPrete, T. A., & Gangl, M. (2004). Assessing bias in the estimation of causal effects: Rosenbaum bounds on
matching estimators and instrumental variables estimation with imperfect instruments. Sociological
Methodology, 34, 271-310.
Duncan, O. D. (1975). Introduction to structural equation models. San Diego, CA: Academic Press.
Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014
Li
37
Drucker, S., & Puri, M. (2005). On the benefits of concurrent lending and underwriting. Journal of Finance,
60(6), 2763-2799.
Efron, B., & Tibshirani, R. (1997). An introduction to the bootstrap. London: Chapman & Hall.
Frank, R., Akresh, I. R., & Lu, B. (2010). Latino Immigrants and the US racial order: How and where do they fit
in? American Sociological Review, 75(3), 378-401.
Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. The Annals of Statistics,
29, 1189-1232.
Gangl, M. (2006). Scar effects of unemployment: An assessment of institutional complementarities. American
Sociological Review, 71(6), 986-1013.
Gerhart, B. (2007). Modeling human resource management and performance linkages. In P. Boxall, J. Purcell,
& P. Wright (Eds.), The Oxford handbook of human resource management (pp. 552-580). Oxford: Oxford
University Press.
Gerhart, B., Wright, P., & McMahan, G. (2000). Measurement error in research on the human resources and
firm performance relationship: Further evidence and analysis. Personnel Psychology, 53, 855-872.
Greene, W. (2008). Econometric analysis (6th ed.). Upper Saddle River, NJ: Prentice Hall.
Grodsky, E. (2007). Compensatory sponsorship in higher education. American Journal of Sociology, 112(6),
1662-1712.
Gu, X., & Rosenbaum, P. (1993). Comparison of multivariate matching methods: Structures, distances, and
algorithms. Journal of Computational and Graphical Statistics, 2, 405-420.
Harder, V. S., Stuart, E. A., & Anthony, J. C. (2010). Propensity score techniques and the assessment of measured covariate balance to test causal associations in psychological research. Psychological Methods, 15,
234-249.
Hamilton, B. H., & Nickerson, J. A. (2003). Correcting for endogeneity in strategic management research.
Strategic Organization, 1, 51-78.
Heckman, J. (1979). Sample selection bias as a specification error. Econometrica, 47, 153-161.
Heckman, J., & Hotz, V. (1989). Choosing among alternative nonexperimental methods for estimating the
impact of social programs: The case of manpower training. Journal of the American Statistical
Association, 84, 862-874.
Heckman, J., Ichimura, H., Smith, J., & Todd, P. (1998). Characterizing selection bias using experimental data.
Econometrica, 66, 1017-1098.
Heckman, J., Ichimura, H., & Todd, P. E. (1997). Matching as an econometric evaluation estimator: Evidence
from evaluating job training program. Review of Economic Studies, 64, 605-654.
Heckman, J. J., & Vytlacil, E. J. (2007). Econometric evaluation of social programs, part II: Using the marginal
treatment effect to organize alternative econometric estimators to evaluate social programs, and to forecast
their effects in new environments. Handbook of Econometrics, 6, 4875-5143.
Helmreich, J. E., & Pruzek, R. M. (2009). PSAgraphics: An R package to support propensity score analysis.
Journal of Statistical Software, 29, 1-23.
Hoetker, G. (2007). The use of logit and probit models in strategic management research: Critical issues.
Strategic Management Journal, 28(4), 331-343.
Imbens, G. (2000). The role of the propensity score in estimating dose-response functions. Biometrika, 87(3),
706-710.
Imbens, G. W. (2004). Nonparametric estimation of average treatment effects under exogeneity: A review. The
Review of Economics and Statistics, 86, 4-29.
Imbens, G. W., & Wooldridge, J. M. (2009). Recent developments in the econometrics of program evaluation.
Journal of Economic Literature, 47(1), 5-86.
James, L. R. (1980). The unmeasured variables problem in path analysis. Journal of Applied Psychology, 65(4),
415-421.
James, L. R., Mulaik, S. A., & Brett, J. M. (1982). Causal analysis: Assumptions, models, and data. Thousand
Oaks, CA: Sage.
Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014
38
Organizational Research Methods 00(0)
Joffe, M. M., & Rosenbaum, P. R. (1999). Invited commentary: Propensity scores. American Journal of
Epidemiology, 150, 327-333.
King, G., Keohane, R. O., & Verba, S. (1994). Designing social inquiry: Scientific inference in qualitative
research. Princeton, NJ: Princeton University Press.
Kosanke, J., & Bergstralh, E. (2004). gmatch: Match 1 or more controls to cases using the GREEDY algorithm.
Retrieved from http://mayoresearch.mayo.edu/mayo/research/biostat/upload/gmatch.sas (accessed May 15,
2012)
Lalonde, R. J. (1986). Evaluating the econometric evaluations of training programs with experimental data.
American Economic Review, 76, 604-620.
LeBreton, J. M., Wu, J., & Bing, M. N. (2008). The truth(s) on testing for mediation in the social and organizational sciences. In C. E. Lance, & R. J. Vandenberg (Eds.), Statistical and methodological myths and urban
legends (pp. 107-140). New York, NY: Routledge.
Lechner, M. (2002). Program heterogeneity and propensity score matching: An application to the evaluation of
active labor market policies. Review of Economics and Statistics, 84, 205-220.
Leuven, E., & Sianesi, B. (2003). PSMATCH2: Stata module to perform full Mahalanobis and propensity score
matching, common support graphing, and covariate imbalance testing [Statistical software components].
Boston, MA: Boston College.
Li, Y., Propert, K., & Rosenbaum, P. (2001). Balanced risk set matching. Journal of the American Statistical
Association, 96, 870-882.
Long, J. S. (1997). Regression models for categorical and limited dependent variables. Thousand Oaks, CA:
Sage.
Malmendier, U., & Tate, G. (2009). Superstar CEOs. The Quarterly Journal of Economics, 124(4),
1593-1638.
McCaffrey, D. F., Ridgeway, G., & Morral, A. R. (2004). Propensity score estimation with boosted regression
for evaluating causal effects in observational studies. Psychological Methods, 9, 403-425.
Mellor, S., & Mark, M. M. (1998). A quasi-experimental design for studies on the impact of administrative
decisions: Applications and extensions of the regression-discontinuity design. Organizational Research
Methods, 1(3), 315-333.
Morgan, S. L., & Harding, D. J. (2006). Matching estimators of causal effects—Prospects and pitfalls in theory
and practice. Sociological Methods & Research, 35, 3-60.
Morgan, S. L., & Winship, C. (2007). Counterfactuals and causal inference: Methods and principles for social
research. Cambridge, UK: Cambridge University Press.
Painter, J. (2004). SPSS Syntax for nearest neighbor propensity score matching. Retrieved from http://www.
unc.edu/~painter/SPSSsyntax/propen.txt (accessed May 15, 2012)
Parsons, L. (2004). Performing a 1: N case-control match on propensity score. Proceedings of the 29th Annual
SAS Users Group International Conference, SAS Institute, Montreal, Canada.
Ridgeway, G., McCaffrey, D., & Morral, A. (2006). Toolkit for weighting and analysis of nonequivalent
groups: A tutorial for the twang package. Santa Monica, CA: RAND Corporation.
Rosenbaum, P. (1987). The role of a second control group in an observational study. Statistical Science, 2,
292-306.
Rosenbaum, P. (2002). Observational studies. New York, NY: Springer-Verlag.
Rosenbaum, P. (2004). Matching in observational studies. In A. Gelman & X. Meng (Eds.), Applied Bayesian
modeling and causal inference from an incomplete-data perspective (pp. 15-24). New York, NY: Wiley.
Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of propensity score in observational studies for
causal effects. Biometrika, 70, 41-55.
Rosenbaum, P., & Rubin, D. (1984). Reducing bias in observational studies using subclassification on the
propensity score. Journal of the American Statistical Association, 79, 516-524.
Rosenbaum, P., & Rubin, D. (1985). Constructing a control group using multivariate matched sampling
methods that incorporate the propensity score. American Statistician, 39, 33-38.
Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014
Li
39
Rousseau, D. (2006). Is there such a thing as evidence-based management. Academy of Management Review,
31, 256-269.
Rubin, D. (1997). Estimating causal effects from large data sets using propensity scores. Annals of Internal
Medicine, 127, 757-763.
Rubin, D. (2001). Using propensity scores to help design observational studies: Application to the tobacco litigation. Health Services and Outcomes Research Methodology, 2(3), 169-188.
Rubin, D. (2004). Teaching statistical inference for causal effects in experiments and observational studies.
Journal of Educational and Behavioral Statistics, 29, 343-367.
Rynes, S., Giluk, T., & Brown, K. (2007). The very separate worlds of academic and practitioner periodicals in
human resource management: Implications for evidence-based management. Academy of Management
Journal, 50(5), 987-1008.
Schonlau, M. (2005). Boosted regression (boosting): An introductory tutorial and a Stata plugin. Stata Journal,
5, 330-354.
Sekhon, J. S. (2007). Multivariate and propensity score matching software with automated balance optimization: The matching package for R. Journal of Statistical Software, 10(2), 1-51.
Simith, J., & Todd, P. E. (2005). Does matching overcome Lalonde’s critique of nonexperimental estimators.
Journal of Econometrics, 125, 305-353.
Steiner, P. M., Cook, T. D., Shadish, W. R., & Clark, M. H. (2010). The importance of covariate selection in
controlling for selection bias in observational studies. Psychological Methods, 15, 250-267.
Wasserman, N. (2003). Founder-CEO succession and the paradox of entrepreneurial success. Organization
Science, 14(2), 149-172.
Wolfe, F., & Michaud, K. (2004). Heart failure in rheumatoid arthritis: Rates, predictors, and the effect of
anti-tumor necrosis factor therapy. American Journal of Medicine, 116, 305-311.
Wood, R. E., Goodman, J. S., Beckmann, N., & Cook, A. (2008). Mediation testing in management research:
A review and proposals. Organizational Research Methods, 11(2), 270-295.
Wooldridge, J. (2002). Econometric analysis of cross section and panel data. Cambridge, MA: MIT Press.
Xuan, Y. (2009). Empire-building or bridge-building? Evidence from new CEOs’ internal capital allocation
decisions. Review of Financial Studies, 22, 4919-4918.
Bio
Mingxiang Li is a doctoral candidate at the Wisconsin School of Business, University of Wisconsin-Madison.
In addition to research methods, his current research interests include corporate governance, social network,
and entrepreneurship.
Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014