Notes on Subgroup Analyses and Meta

CHAPTER 21
Notes on Subgroup Analyses
and Meta-Regression
Introduction
Computational model
Multiple comparisons
Software
Analyses of subgroups and regression analyses are observational
Statistical power for subgroup analyses and meta-regression
INTRODUCTION
In this chapter we address a number of issues that are relevant to both subgroup
analyses (analysis of variance) and to meta-regression.
COMPUTATIONAL MODEL
The researcher must always choose between a fixed-effect model and a randomeffects model. When we are working with a single set of studies the fixed-effect
analysis assumes that all studies share a common effect size. When we are working
with subgroups, it assumes that all studies within a subgroup share a common effect
size. When we are working with meta-regression, it assumes that all studies which
have the same values on the covariates share a common effect size.
These kinds of assumption can sometimes be justified, as in the pharmaceutical
example that we used on pages 83, 161 and 195. In most cases, however, especially
when the studies for the review have been culled from the literature, it is more
plausible to assume that the subgroup membership or the covariates explain some,
but not all, of the dispersion in effect sizes. Therefore, the random-effects model is
more likely to fit the data, and is the model that should be selected.
Introduction to Meta-Analysis. Michael Borenstein, L. V. Hedges, J. P. T. Higgins and H. R. Rothstein
© 2009 John Wiley & Sons, Ltd. ISBN: 978-0-470-05724-7
206
Heterogeneity
Mistakes to avoid in selecting a model
When we introduced the idea of a random-effects model in Chapter 12 we noted that
researchers sometimes start with a fixed-effect model and then move to a randomeffects model if there is empirical evidence of heterogeneity (a statistically significant p-value). In the case of subgroup analysis this approach would suggest that
we start by using the fixed-effect model within groups, and then move to the
random-effects model only if Q within groups was statistically significant. In the
case of meta-regression it would suggest that we start by using the fixed-effect
model, and then move to the random-effects model only if the Q for the residual
error was statistically significant.
We explained that this approach was problematic when working with a single set
of studies, and it continues to be a bad idea here, for the same reasons. If substantive
considerations suggest that the effect size is likely to vary (within the full set of
studies, within subgroups, or for studies with a common set of covariate values) then
we should be using the corresponding model even if the test for heterogeneity fails
to yield a significant p-value. This lack of significance means only that we have
failed to meet a certain threshold of proof (possibly because of low statistical
power) and does not prove that the studies share a common effect size.
Practical differences related to the model
Researchers often ask about the practical implications of using a random-effects
model rather than a fixed-effect model. The random-effects model will apportion
the study weights more evenly, so that a large study has less impact on the summary
effect (or regression line) than it would have under the fixed-effect model, and a
small study has more impact that it would have under the fixed-effect model. Also,
confidence intervals will tend to be wider under the random-effects model than
under the fixed-effect model. While this tells us what the impact will be of using the
fixed-effect or random-effects model, it says nothing about which model we should
use. The only issue relevant to that decision is the question of which model fits
the data.
The null hypothesis under the different models
Since the meaning of a summary effect size is different for fixed versus random
effects, the null hypothesis being tested also differs under the two models.
Recall that when we are working with a single group, under the fixed-effect
model the summary effect size represents the common effect size for all the studies.
The null hypothesis is that the common effect size is equal to a nil value (0.00 for a
difference, or 1.00 for a ratio). By contrast, under the random-effects model the
summary effect size represents the mean of the distribution of effect sizes. The null
hypothesis is that the mean of all possible studies is equal to the nil value.
Chapter 21: Notes on Subgroup Analyses and Meta-Regression
In subgroup analyses, under the fixed-effect model the summary effect size for
subgroups A and B each represents the common effect size for a group of studies.
The null hypothesis is that the common effect for the A studies is the same as the
common effect for the B studies. By contrast, under the random-effects model the
effect size for subgroups A and B each represents the mean of a distribution of effect
sizes. The null hypothesis is that the mean of all possible A studies is identical to the
mean of all possible B studies.
In regression, under the fixed-effect model we assume that there is one true effect
size for any given value of the covariate(s). The null hypothesis is that this effect
size is the same for all values of the covariate(s). By contrast, under the randomeffects model we assume that there is a distribution of effect sizes for any given
value of the covariate(s). The null hypothesis is that this mean is the same for all
values of the covariate(s).
While the distinction between a common effect size and a mean effect size might
sound like a semantic nuance, it actually reflects an important distinction between
the models. In the case of the fixed-effect model, because we assume that we are
dealing with a common effect size, we apply an error model which assumes that the
between-studies variance is zero. In the case of the random-effects model, because
we allow that the effect sizes may vary, we apply an error model which makes
allowance for this additional source of uncertainty. This difference has an impact on
the mean (the summary effect for a single group, the summary effect within
subgroups, and the slope in a meta-regression). It also has an impact on the standard
error, tests of significance and confidence intervals.
Some technical considerations in random-effects meta-regression
As is the case for a standard random-effects meta-analysis in the absence of
covariates, several methods are available for estimating 2 in meta-regression,
including a moment method and maximum likelihood method. In practice, any
differences among methods will usually be small.
The results we presented in this chapter used a moment estimate, which is the
same as the method we used in Chapter 16 to estimate 2 for a single group. If we
were to perform a meta-regression with no covariates, our estimate of 2 would be
the same as the estimate we would obtain using the formulas in that chapter.
Whichever method is used to estimate 2, the use of a Z-test to assess the
statistical significance of a covariate (or the difference between two subgroups),
while common, is not strictly appropriate. When dealing with simple numerical
data, to compute a confidence interval or to test the significance of a difference (or
variance) we use Z if the sampling distribution is known. By contrast, we use t if the
sampling distribution is being estimated from the dispersion observed in the sample
(as it is, for example, when we compare means using a t-test).
Similarly, in meta-analysis, the Z-distribution is appropriate only for the fixed-effect
model, where the only source of error is within studies. By contrast, when we use a
208
Heterogeneity
random-effects model, we are estimating the dispersion across studies, and should
account for this by using a t-distribution. Several methods have been proposed to
address this issue, including one by Knapp and Hartung (2003) which is outlined
below. While these can be applied to any use of the random-effects model (for a single
group of studies, for a subgroup analysis, and for meta-regression), they have to date
only been implemented in software for meta-regression.
The Knapp-Hartung method involves two modifications to the standard error for
the random-effects model. First, the between-studies component of the variance
is multiplied by a factor that makes it correspond to the t-distribution rather than the
Z-distribution. Second, the test statistic is compared against the t-distribution rather
than the Z-distribution. This has the effect of expanding the width of the confidence
intervals and of moving the p-value away from zero.
Higgins and Thompson (2004) proposed an approach that bypasses the sampling
distributions and instead employs a permutation test to yield a p-value. Using this
approach we compute the Z-score corresponding to the observed covariate. Then,
we randomly redistribute the covariates among studies and see what proportion of
these yield a Z-score exceeding the one that we had obtained. This proportion may
be viewed as an exact p-value.
MULTIPLE COMPARISONS
In primary studies researchers often need to address the issue of multiple comparisons. The basic problem is that if we conduct a series of analyses with alpha set at
0.05 for each, then the overall likelihood of a type I error (assuming that the null is
actually true) will exceed 5%. This problem crops up when a study includes more
than two groups and we compare more than one pair of means. It also arises when
we perform an analysis on more than one outcome.
While there is consensus that conducting many comparisons can pose a
problem, there is no consensus about how this problem should be handled.
Some suggest conducting an omnibus test that asks if there are any nonzero
effects, and then proceeding to look at pair-wise comparisons only if the initial
test meets the criterion for significance. Others suggest going straight to the pairwise tests but using a stricter criterion for significance (say 0.01 rather than 0.05
for five tests). Hedges and Olkin (1985) discuss this and other methods to control
the error rate when using multiple tests. Some suggest that the researcher not
make any formal adjustment, but evaluate the data in context. For example, one
significant p-value in forty tests would not be seen as grounds for rejection of the
null hypothesis.
Essentially the same issue exists in meta-analysis. In the case of subgroup
analyses, if a meta-analysis includes a number of subgroups, the issue of multiple
comparisons arises when we start to compare several pairs of subgroup means. In
the case of meta-regression this issue arises when we include a number of covariates
and want to test each one for significance. As with primary studies, while there is
Chapter 21: Notes on Subgroup Analyses and Meta-Regression
consensus that conducting many comparisons can pose a problem, there is no
consensus about how this problem should be handled. The approaches generally
used for primary studies can be applied to meta-analysis as well.
SOFTWARE
Some of the programs developed for meta-analysis are able to perform subgroup
analysis as well as meta-regression (see Chapter 44). Note that programs intended
for statistical analysis of primary studies should not be used to perform these
procedures in meta-analysis, for two reasons. First, routines for analysis of variance
or multiple regression intended for primary studies do not weight the studies, as is
needed for meta-analysis. While most programs do allow the user to assign weights,
this becomes a difficult procedure when we move to random-effects weights (which
are usually the ones we want to use). Second, the rules for assigning degrees of
freedom in the analysis of variance and meta-regression are different for metaanalysis than for primary studies, and so using the primary-study routines for a
meta-analysis will yield incorrect standard errors and p-values.
ANALYSES OF SUBGROUPS AND REGRESSION ANALYSES ARE OBSERVATIONAL
In a randomized trial, participants are assigned at random to a condition (such as
treatment versus placebo). Because the participants are assumed to be similar in all
respects except for the treatment condition, differences that do emerge between
conditions can be attributed to the treatment. By contrast, in an observational study
we compare pre-existing groups, such as workers with a college education versus
those who did not attend college. While we can report on differences in wages of the
two groups we cannot attribute this outcome to the amount of schooling because the
groups differ in various ways. For example, we are likely to find that those with a
college education are paid more, but we cannot attribute this to their schooling since
it could be due (at least in part) to other factors associated with higher socioeconomic status.
The issue of randomized versus observational studies as it relates to metaanalysis is discussed in Chapter 40. There, we discuss the fact that randomized
studies and observational studies address different questions, and for this reason it
generally makes sense to include only one or the other in a given meta-analysis.
However, there is one issue that is directly relevant to the present discussion, as
follows. Assume we start with a set of randomized experiments that assess the
impact of an intervention. The effect in any single experiment could serve to
establish causality and the summary effect can also serve to establish causality.
This is because the relationship between treatment and outcome is protected by the
randomization process (it must be due to treatment) in each study, and this protection carries over to the summary effect.
210
Heterogeneity
However, even if the individual studies are randomized trials, once we move beyond
the goal of reporting a summary effect and proceed to perform a subgroup analyses or
meta-regression, we have moved out of the domain of randomized experiments, and into
the domain of observational studies. For example, suppose that half the studies used a
low dose of aspirin while half used a high dose, and that the impact of the treatment was
significantly stronger in the high-dose studies. It is possible that the difference is due to
the dose, but it is also possible that the studies that used a higher dose differed in some
systematic way from the other studies. Perhaps these studies used patients who were in
poor health, or older, and therefore more likely to benefit from the treatment. Therefore,
the difference between subgroups, or the relationship between a covariate and effect
size, is observational. The same caveats that apply to any observational studies, in
particular the fact that relationship does not imply causality, apply here too.
That said, in primary observational studies, researchers sometimes use regression analysis to try and remove the impact of potential confounders. In the
aspirin example they might enter covariates in the sequence of health, age, and
dose, to assess the impact of dose with health and age held constant. This is not a
perfect solution since there may be other confounders of which we are not aware,
but this approach can help to isolate the impact of specific factors and generate
hypotheses to be tested in randomized trials. The same holds true for metaregression. Of course, since covariate values are assigned at the study level,
meta-regression can be used to adjust for potential confounders only for comparisons across studies, and not for potential confounders within studies.
There is one exception to the rule that subgroup analysis and regression cannot
prove causality. This exception is the case where we know that the studies are
identical in all respects except for the one captured by subgroup membership or by
the covariate. The pharmaceutical example is a case in point. Here, we enrolled
1000 patients and assigned some to studies that would test a low dose of the drug vs.
placebo, and others to studies that would test a high dose of the drug vs. placebo.
Here, the assignment to subgroups is random. The same would apply if the patients
were assigned to ten studies where the dose of drug was varied on a continuous
scale, and we used meta-regression to test the relationship between dose and effect
size. This set of circumstances is rarely (if ever) found in practice.
STATISTICAL POWER FOR SUBGROUP ANALYSES AND META-REGRESSION
Statistical power is the likelihood that a test of significance will reject the null
hypothesis. In the case of subgroup analyses it is the likelihood that the Z-test to
compare the effect in two groups, or the Q-test to compare the effects across a series
of groups, will yield a statistically significant p-value. In the case of meta-regression it is the likelihood that the Z-test of a single covariate or the Q-test of a set of
covariates will yield a statistically significant p-value.
Power depends on the size of the effect and the precision with which we measure
the effect. For subgroup analysis this means that power will increase as the
Chapter 21: Notes on Subgroup Analyses and Meta-Regression
difference between (or among) subgroup means increases, and/or the standard error
within subgroups decreases. For meta-regression this means that power will
increase as the magnitude of the relationship between the covariate and effect size
increases, and/or the precision of the estimate increases. In both cases, a key factor
driving the precision of the estimate will be the total number of individual subjects
across all studies and (for random effects) the total number of studies.
While there is a general perception that power for testing the main effect is
consistently high in meta-analysis, this perception is not correct (see Chapter 29)
and certainly does not extend to tests of subgroup differences or to meta-regression.
The failure to find a statistically significant p-value when comparing subgroups or
in meta-regression could mean that the effect (if any) is quite small, but could also
mean that the analysis had poor power to detect even a large effect. One should
never use a nonsignificant finding to conclude that the true means in subgroups are
the same, or that a covariate is not related to effect size.
SUMMARY POINTS
The selection of a computational model (fixed-effect or random-effects)
should be based on our understanding of the underlying distribution. In
most cases, especially when the studies have been gathered from the published literature, the random-effects model (within-subgroups) is more plausible than the fixed-effect model.
The strategy of starting with the fixed-effect model and then moving to the
random-effects (or mixed-effect) model if the test for heterogeneity is significant, is a mistake, and should be strongly discouraged.
The problem of performing multiple tests (the fear that the actual alpha may
exceed the nominal alpha) is similar in meta-analysis to the same problem in
primary studies, and similar strategies are suggested for dealing with this problem.
The relationship between effect size and subgroup membership, or between
effect size and covariates, is observational, and cannot be used to prove
causality. This holds true even if all studies in the analysis are randomized
trials. The protection afforded by the study design carries over to the summary
effect across all studies, but not to other analyses.
Statistical power for detecting a difference among subgroups, or for detecting the relationship between a covariate and effect size, is often low, and the
usual caveats apply. To wit, failure to obtain a statistically significant
difference among subgroups should never be interpreted as evidence that
the effect is the same across subgroups. Similarly, failure to obtain a
statistically significant effect for a covariate should never be interpreted as
evidence that there is no relationship between the covariate and the effect
size.
212
Heterogeneity
Further Reading
Cohen, J., West, S.G., Cohen, P., & Aiken, L. (2002). Applied Multiple Regression/Correlation
Analysis for the Behavioral Sciences (3rd ed). Mahwah, NJ, Lawrence Erlbaum Assoc.
Higgins, J.P.T., & Thompson, S.G (2004). Controlling the risk of spurious findings from metaregression. Statistics in Medicine 23,1663–1682.
Knapp, G. & Hartung, J. (2003). Improved tests for a random effects meta-regression with a single
covariate. Statistics in Medicine 22, 2693–2710.