Methodology of the Outcome Evaluation of the VERB Campaign

Methodology of the Outcome Evaluation of the
VERB™ Campaign
Lance D. Potter, MA, David R. Judkins, MA, Andrea Piesse, PhD, Mary Jo Nolin, PhD, Marian Huhman, PhD
Abstract:
This article summarizes the methods used in the outcome evaluation of the VERB™
campaign. The outcome evaluation was designed to measure the awareness and understanding of VERB among the target audience of children aged 9 –13 years (tweens) and to
determine the effect of VERB awareness on psychosocial and behavioral outcomes. Cohorts
of tweens and parents were interviewed annually via a telephone survey (Youth Media
Campaign Longitudinal Survey). The first cohort (baseline) was surveyed in 2002 prior to
VERB advertising and was repeated annually through 2006. A second cohort was surveyed
in 2004 –2006. A third, cross-sectional sample was surveyed in 2006. Each cohort consisted
of a nationally representative sample of tweens to enable generalizability to the nation as
a whole. Propensity scoring was used to control for confounding influences. The outcomes
were analyzed for dose–response effects (i.e., whether higher levels of awareness led to
stronger effects) and overall awareness effects (i.e., the difference between tweens unaware
of VERB and all tweens in the U.S.). Secular trends in tweens’ physical activity during the
life of the campaign were also examined. This article also discusses weighting and
imputation, alternative analyses used to assess the adequacy of the propensity methods, and
the challenges involved in media campaign evaluations.
(Am J Prev Med 2008;34(6S):S230 –S240) © 2008 American Journal of Preventive Medicine
Introduction
T
he VERB™ campaign was a media-based health
marketing campaign of the CDC intended to
improve physical activity levels among children
aged 9 –13 years (tweens).1 In 2002, when CDC began
to plan the campaign, the agency was aware that
developing strong evidence of the campaign’s effects
would be essential. This article summarizes the methods used in the outcome evaluation of VERB, noting
techniques that were used to overcome design limitations common to evaluations of health communication
programs.
The primary goal of the outcome evaluation was to
determine whether associations could be detected in
the tween population between awareness of the campaign and psychosocial or behavioral outcomes related
to physical activity. The evaluation began prior to the
launch of the intervention and continued through the
conclusion of the campaign in 2006. The methods used
were statistically complex; to make them accessible to a
wide audience, they will be presented here without
formulae and without detailed statistical discussion.
Referenced articles provide more detailed information
From Westat (Potter, Judkins, Piesse, Nolin), Rockville, Maryland;
and the National Center for Chronic Disease Prevention and Health
Promotion, CDC (Huhman), Atlanta, Georgia
Address correspondence and reprint requests to: Lance D. Potter,
MA, Westat, 1650 Research Boulevard, Rockville MD 20850. E-mail:
[email protected].
about this evaluation and further discussions of the
statistical methods used.
Evaluation Design and Research Methodology
To accommodate all the elements of the campaign, the
evaluation included a range of process measurement
activities2– 4 in addition to a longitudinal outcome
evaluation. The focus of this article is a description of
the analytic methods used to address three outcome
research questions:
1. What level of awareness of the VERB campaign was
achieved among the target population and to what
extent did they understand its messages?
2. Was there evidence of an association between
tweens’ awareness of VERB and their attitudes and
behaviors related to physical activity?
3. Were there temporal changes in tweens’ attitudes
and behaviors related to physical activity during the
campaign?
Overview of Evaluation Methods
Prior to the launch of the campaign in 2002, a randomly selected longitudinal panel of parent–tween
dyads was interviewed by telephone concerning their
physical activity–related beliefs and behaviors and their
households’ characteristics. Questions about awareness
of VERB were added to the survey in 2003, and the
S230 Am J Prev Med 2008;34(6S)
© 2008 American Journal of Preventive Medicine • Published by Elsevier Inc.
0749-3797/08/$–see front matter
doi:10.1016/j.amepre.2008.03.007
Figure 1. Panel design for the Youth Media Campaign survey 2002–2006. All surveys were conducted April–June.
panel was interviewed annually through 2006. An additional panel was added in 2004, and a final crosssectional sample of parent–tween dyads was added in
2006. Data were fully imputed and weighted to national
population totals for children aged 9 –13 years. The
pre-campaign data were modeled to create propensity
scores that adjusted for confounding influences in a
series of endpoint analyses (2003, 2004, 2005, and
2006). A dose–response analysis sought evidence that as
the reported frequency of seeing or hearing VERB
advertisements increased, VERB-related outcomes improved. The remainder of this article provides additional detail on the design and methods used.
Panel Design
The outcome evaluation data were collected from
parent and tween respondents recruited from households contacted at random. Within each cooperating
household, a dyad (consisting of a child aged 9 –13
years and one of the child’s parents or a guardian) was
interviewed by telephone. The outcomes assessed are
described below.
Three panels consisting of parent–tween dyads were
recruited. Figure 1 summarizes the design. Panel 1
respondents (3120 dyads) were first surveyed in spring
2002 before any VERB advertising was launched and
follow-up surveys were conducted in spring 2003, spring
2004, spring 2005, and spring 2006.
By 2004, 40% of the tween respondents in Panel 1
were beyond the campaign’s target age range, so a
second panel of 5177 parent–tween dyads was recruited
and interviewed. Panel 2 respondents were interviewed
again in spring 2005 and spring 2006. To ensure
responses for all years in the tween target age range for
the final data collection, a third set of respondent dyads
(Panel 3: 1200 dyads) was recruited and interviewed a
single time, in spring 2006.
Response Rates
For each panel created, a list-assisted, random-digitdialed method was used to select a sample of households with telephones, and that sample was screened
to identify households with children aged 9 –13 years.
To increase the likelihood of participation, prior to
beginning telephoning, CDC sent letters of invitation
to households with potential participants. Overall
response rates per time period were calculated by
multiplying household screening rate by the parent
interview completion rate by the tween interview
completion rate. The screening rate (i.e., those
households that finished the 1-minute screening
interview to determine whether they had a child aged
9 –13 years) was 62%, a typical screening rate in a
randomized telephone survey. Cooperation rates for
parents and tweens in both panels for all years always
exceeded 80%. Initial screening rates and coopera-
Table 1. Screening and cooperation rates (%), Panel 1 (2002–2005) and Panel 2 (2004 –2005)
Panel 1
Screener response rate [SRR]
Parent cooperation rate [PCR]
Child cooperation rate [CCR]
Annual response rate [ARR]a
Cumulative response rate [CRR]b
Panel 2
2002
2003
2004
2005
2004
61
87
81
70
43
—
75
91
71
38
—
84
97
82
31
—
87
98
85
26
60
85
88
75
45
2005
—
78
96
75
38
a
Product of parent and tween cooperation rates
In 2002, the cumulative response rate is the product of the screener response rate and the annual response rate. In 2003, all 2002 cooperating
parents were contacted regardless of tween result in 2002. Therefore, 2003 CRR ⫽ 2002 SRR*2002PCR*2003ARR. Thereafter, the cumulative
response rate is the product of the previous year’s cumulative rate and the current year’s annual rate.
b
June 2008
Am J Prev Med 2008;34(6S)
S231
tion rates for both panels in all years are in Table 1.
Surveys were conducted in English or Spanish, depending on the preference of the respondent. About
8% of the parent interviews and 3% of the tween
interviews were conducted in Spanish. The survey was
conducted April–June of each year to maintain a
consistent proportion of tweens in and out of school
at the time of interview. The aim was to complete as
many interviews as possible before the school year
ended. The proportion of tweens in school when
each panel was surveyed ranged from 82% to 87%,
except in 2004, when 71% of the tweens in Panel 1
were in school when interviewed.
Outcome Measures
During their interviews, tweens were asked about three
domains: awareness and understanding of VERB, level
of participation in physical activity, and attitudes toward
physical activity. An independent study5 validated the
behavioral measures, seeking correlations between survey response data and data from accelerometers and
activity logs. That study found correlations as high, or
higher than, the correlations between widely used
measures and accelerometer data described in scientific literature on physical activity research.6,7
read any messages or advertising about kids getting active?
Those who reported VERB messages without further
prompting were categorized as having unaided awareness of the campaign. All other tweens were asked,
Have you seen, heard, or read any messages or advertising
about VERB? Children who answered in the affirmative were categorized as having aided awareness of
VERB. Both groups were asked What is VERB all
about? and What does VERB mean to you? Responses to
these items were coded and tweens who correctly
described at least one key message of the advertising
were credited with understanding the campaign.
Understanding was incorporated into the definition
of awareness such that tweens who recalled but did
not understand the VERB messages were not credited
with awareness in the measurement of outcomes.
Both unaided and aided recognition of VERB were
considered awareness of VERB.
Beginning in 2004, tweens aware of VERB were asked
to recall the frequency with which they were aware of
the campaign’s advertising on television or radio. On
the basis of reported frequency of seeing or hearing
VERB advertisements, tweens were categorized into
one of four levels of frequency: about every day, several
times a week, about once a week, and less than once a
week. Along with those not aware, this created five
levels of frequency of awareness.
Measurement of Awareness
Four components of awareness were measured: unaided awareness of VERB, aided awareness of VERB,
understanding of VERB messages, and recalled frequency of encountering VERB.8 First, at the end of
the survey tweens were asked Have you seen, heard, or
Psychosocial Outcome Measures
Three psychosocial scales were developed from the 15
attitude and belief items contained in the Youth Media
Campaign Longitudinal Survey (YMCLS) survey: outcome expectations, self-efficacy, and social influences.
Table 2. Items in the YMCLS psychosocial outcome scales
Item
Outcome expectations scale (␣ 0.73)
If I did physical activities on most days it would be boring (4-point agreement scale)*
If I did physical activities on most days it would be fun (4-point agreement scale)
If I did physical activities on most days it would help me make new friends (4-point agreement scale)
If I did physical activities on most days it would help me spend more time with my friends (4-point
agreement scale)
If I did physical activities on most days it would make me feel good about my self (4-point
agreement scale)
Social influences scale (␣ 0.70)
My friends think that doing physical activities is fun (4-point agreement scale)
Kids my age think that doing physical activities is fun (4-point agreement scale)
My friends think that doing physical activities is important (4-point agreement scale)
Kids my age think that doing physical activities is important (4-point agreement scale)
How many kids your age do physical activities every day? (All, most, some, or none)
How many of your friends do physical activities every day? (All, most, some, or none)
Self-efficacy scale (␣ 0.66)
I think I can be physically active no matter how busy my day is (4-point agreement scale)
I think I can be physically active no matter how tired I might feel (4-point agreement scale)
I think I can be physically active even if it is hot of cold outside (4-point agreement scale)
I think I have what it takes to be physically active (4-point agreement scale)
Factor loading
0.604
0.735
0.649
0.692
0.714
0.728
0.718
0.709
0.681
0.497
0.589
0.747
0.718
0.704
0.650
*Reverse coded
YMCLS, Youth Media Campaign Longitudinal Survey
S232 American Journal of Preventive Medicine, Volume 34, Number 6S
www.ajpm-online.net
These scales were used in the 2004, 2005, and 2006
analyses. Table 2 contains the items and factor loadings
for the scales.
Behavioral Outcome Measures
Behavioral outcomes sought to capture the tweens’
out-of-school physical activity during the 7 days before
the interview. Four behavioral outcomes were assessed
from the tweens’ self-reports of physical activity. Three
outcomes were based on tweens’ reports of (1) any
nonschool physical activity on the day prior to the
interview, (2) all nonschool physical activity in their
free-time during the previous 7 days, and (3) all nonschool organized physical activity in which the tween
participated during the prior 7 days. Free-time physical
activity was defined as things you did on your own, in your
free time in contrast to organized physical activity which
was described as having been done with a coach, supervisor, or leader. The fourth behavioral measure, the sum
of free-time physical activity and organized physical
activity, was intended to approximate all nonschool
physical activity during the 7 days before the interview.
Demographic and Lifestyle Items
To isolate the relationship between the frequency of
seeing VERB and campaign outcomes, it was essential
to adjust for tween characteristics that might predict
both likelihood of encountering the advertising and
outcome characteristics. Items used as covariates were
obtained from both parent and tween surveys. For
example, tweens reported on time spent watching
television, using computers, or playing video games. A
parent or guardian of each tween respondent was
surveyed on subjects such as awareness of VERB, attitude toward and beliefs about their tweens’ physical
activity, behaviors related to involvement with their
tweens’ physical activities, and perceived barriers to
their tweens’ participation in physical activities. During
the parent or guardian interview, data were collected
on demographic characteristics such as tweens age and
race and ethnicity, respondent’s marital status and level
of education, number of household members, and
household income.
Data-Preparation Procedures
Imputation of Missing Data: Individual
Questions
As with most surveys, some respondents did not answer
all interview questions: some did not know the answer
and others did not wish to respond for various reasons.
Nonresponse in the YMCLS was low: median item
response rates exceeded 99% for all panels in all years.
Nevertheless, responses to some items were missing
June 2008
from each survey. The missing responses were imputed
to facilitate weighting and analysis.
In 2002 and 2003, parent and tween missing interview items were imputed by using a short series of
simple hot-deck procedures9 –12 in which both donor
and recipient cases were sorted into cells defined by
characteristics likely to be associated with differences in
their likelihood of responding to each survey item.
Cell-defining variables could be drawn from either
household or individual respondent characteristics. Donors for missing data were randomly selected for recipients from within the same cell. The completed data
item from the donor case was used to fill in the missing
data item. Item imputation was conducted separately
for each question.
In 2004 and 2005, non-ordinal questionnaire items
were again hot-decked. But for ordinal questionnaire
items, a different procedure was used to ensure that
data relationships in each year, and for each case back
to 2002, were meaningful.13 A regression model was fit
using a stepwise ordinary least squares (OLS) selection
procedure. Baseline outcome variables, household sociodemographic and geographic variables, and questionnaire variables from the year being imputed were
used as eligible predictors. The resulting predicted
values were used in a nearest-neighbor imputation
procedure, in which each case with a missing value was
matched to the complete case with the closest predicted values for the variable being imputed. Because
median item response rates exceeded 99%, no sensitivity analysis of the item imputations was conducted.
Imputation of Missing Data: Whole Case
Following the national Panel-1 data collection in 2002
(baseline) there was a group of households in which
the parent interview had been completed but the tween
interview had not. Those cases were fielded again in
2003 (1-year follow-up), and in 278 households (8.9%
of final sample) both the parent and tween interviews
were completed. For this small group of cases, then,
parent interviews were available for both baseline and
first follow-up, but tween interviews at follow-up only.
To make the best use of the data for these households,
the baseline tween interview was imputed. The imputation involved identifying other tweens who had baseline
data and who were as similar as possible to tweens who
had no baseline data. The similarity was assessed by
responses from the parent interview both in 2002 and
2003 and responses from the 2003 tween interview.
The imputation process involved four steps. First, five
key outcome measures were selected:
1. number of days in which the tween took part in
free-time or unstructured physical activities;
2. number of days in which the tween took part in
organized physical activity;
Am J Prev Med 2008;34(6S)
S233
3. whether or not the tween engaged in physical activities the day prior to the interview;
4. attitude toward the statement, I should probably do
more physical activities than I do (really agree, sort of
agree, sort of disagree, or really disagree); and
5. attitude toward the statement, I’d rather watch TV or
play video games than do physical activities (really agree,
sort of agree, sort of disagree, or really disagree).
Second, predictive models were developed for those
five outcome measures. These models were then used
to predict values for the key outcome measures for
tweens without baseline data. Third, clustering algorithms were run on the predicted outcome measure
values to form imputation classes. Finally, hot-deck
imputation was used to find donors and impute for the
missing baseline data within the imputation classes.
When a suitable donor was found, all data on the
donor’s baseline questionnaire were copied to the
recipient.
By imputing the entire set of baseline data from one
tween, the covariance structure was preserved in a way
that could not have been achieved if the missing data
had been imputed one variable at a time, and at least
some of the most important features of the covariance
structure across years and across family members were
preserved.
To validate the preservation of the covariate structure, the predictive models were refit with and without
the imputed outcome values. The change in item
coefficients with and without the imputed values provided a measure of the extent to which the covariance
structure changed as a result of the imputation. The
difference between item coefficients with and without
imputation was generally 10% or less. Standard errors
were also compared, with and without imputed values;
the differences between with and without were small,
generally less than 6%, and nearly all negative, reflecting the artificial reduction in standard errors due to the
increased sample size created by adding cases through
imputation.
Weighting
Within each panel for each year, household-level
weights were created first, to adjust for screener nonresponse and for the number of telephones in a
household. Household weights were used as the basis
for parent weights, which adjusted for parent nonresponse. These weights were then further adjusted with
a tween weight, which included adjustments for number of eligible tweens in the household and for tween
nonresponse. Finally, adjustment for under-coverage
was accomplished via raking to census totals. Because
the post-baseline follow-up study was conducted in
2003, 2004, and 2005, weights were reconstructed each
year.
Outcome Analysis
Overall Analytic Framework
The Panel 1 and Panel 2 cohorts were followed for 5
and 3 years, respectively, and a third cross-section
(Panel 3) was added in 2006. Several outcome analyses
were conducted in each year of the evaluation. Awareness of VERB, which was a necessary condition for
campaign success, and the measure of intervention
dose, was analyzed as a binary variable (aware versus not
aware) and as a frequency-of-awareness variable (described above).
Panel 1, which was begun before the launch of
advertising, was used in the primary outcome analysis to
take advantage of the true baseline. The pre-campaign
data were used for confounder control, and annual
endpoint analyses (described below) were conducted
in 2003 through 2006. Associations between awareness
of VERB and the psychosocial and behavioral outcomes
were controlled for propensity to be aware of VERB,
and each level of awareness was weighted to the national population of children aged 9 to 13 years. Then
the unaware population was compared with the total
population for an estimate of overall VERB effect, and
dose–response relationships were sought between reported frequency of awareness to VERB and the outcomes. T-tests were used to test the significance of the
overall effect, and the gamma statistic (SAS, version
9.1) was used to assess the significance of dose–
response relationships.
Panels 2 and 3 were used to create age-controlled,
partial cross-sections for analysis of longitudinal trends
in 2004 through 2006. Both propensity-controlled and
secular trends were examined.
In the following section, the confounder control
procedures used in the analysis are discussed, including
a description of the analyses on which conclusions were
based about associations between VERB and tween
outcomes.
Propensity Scores
Random assignment of subjects to treatment and control was not possible in a national broadcast media
intervention in which subjects largely controlled their
own likelihood of awareness via media consumption
patterns. As a consequence, the evaluation sought to
measure as many sources of variation in awareness
patterns as possible.
The presumption was that some variables would
influence both the independent variable and the predicted outcomes. A demographic or lifestyle variable
that influenced the likelihood that one would see a
television commercial could, at the same time, be
influential in the amount of physical activity one gets.
For example, a preference for viewing many hours of
television per week would probably increase chances
S234 American Journal of Preventive Medicine, Volume 34, Number 6S
www.ajpm-online.net
that a tween would encounter VERB advertising, but at
the same time might decrease the amount of physical
activity in which the tween engaged. Other variables
might have different combinations of effects on the key
association between VERB and physical activity–related
outcomes. Uncontrolled, such variables would prevent
us from identifying true associations between VERB
and physical activity outcomes.
Propensity scoring is one method of controlling for
such nonrandom confounders.14 Propensity scoring is
based on multiple regression, but has several advantages over the standard regression-based analysis.15
Propensity scoring permits large numbers of covariates,
including complex higher-order interaction terms, to
be fit in a single model that can be used for confounder
control on all outcomes. It also avoids threats of
multicollinearity among covariates in the model and is
robust to most skewed data distributions. Propensity
scoring is also more resistant to bias from variables
included in the wrong functional form (for example,
linear rather than quadratic).16 Last, the propensity
score technique permits, in effect, a test of whether
observed covariates are independent of the outcome
variable (balanced).
Propensity scores were used in the VERB evaluation
to statistically adjust respondents within the five levels
of awareness. Once propensity scoring was complete,
within each level of awareness, any association between
awareness of VERB and physical activity–related outcomes was free of observed confounding factors.
Creating Propensity Scores
In each year of outcome analysis, the propensity of
being aware of VERB was modeled for that year’s
respondents. Specifically, an ordinal logit model was fit
for the five-level awareness variable described earlier. In
each year, more than 300 baseline (2002) variables
were permitted to enter the model in a series of forward
stepwise procedures. Variables included tween, parent,
and household demographics; tween and parental attitudes and behaviors; media consumption patterns; and
awareness of other advertising campaigns. Census variables and Claritas Prizm™17 geosocial codes were included. Main effects, first-order interactions, and
squared variables were permitted to enter. Only variables significant at p⬍0.05 were permitted to remain in
the final model. The final model was balance-tested,
and the behavioral outcome variables were forced into
the model.
Propensity Weights
Once the model predicting awareness was complete, it
was used in a second modeling procedure to create a
propensity score for each respondent. In this study, the
propensity score was a tween’s predicted probability of
being aware of VERB, regardless of the tween’s selfreported awareness level. The propensity scores were
divided into quintiles. An adjustment factor was then
assigned to each combination of the five levels of
self-reported awareness by the five levels of predicted
propensity. As can be seen in Table 3, tweens with a
high propensity to see VERB, based on the propensity
model, and who reported frequent awareness (awareness Level 5), received a low weight, and those with
high propensity but low awareness received a higher
weight. The effect of the propensity weights was to
neutralize the differential propensities to encounter
the campaign.
Completing Weights
The completed propensity adjustment factors were
then combined with previously completed sampling
weights. Combining the propensity score with a sampling weight for each household produced a weight
referred to as a counterfactual projection weight (CFP
weight). When the CFP weights were applied to the
outcome responses, the sample of tweens at each level
of VERB awareness approximated a national population of children aged 9 –13 years with that level of
awareness and controlled for all observed covariates
that could confound the relationship between outcomes and frequency of awareness. This permitted us
to test the differences between levels of awareness or to
test for evidence that as frequency of awareness increased, outcomes related to physical activity improved.
Table 3. Counterfactual projection weighting adjustment factors before raking, by awareness frequency and awareness
propensity, VERB, 2005
Awareness propensity stratum
Lowest propensity
Frequency of awareness of the VERB campaign
Unaware of VERB
Less than once per week
About once per week
Several times per week
Every day
June 2008
1
2.4
3.3
7.4
11.8
15.6
Highest propensity
2
4.7
4.3
3.7
5.9
8.6
3
6.8
3.7
4.3
4.6
7.6
4
11.9
4.5
3.8
3.6
6.4
5
14.5
10.2
5.2
3.1
3.1
Am J Prev Med 2008;34(6S)
S235
0.00
(0.00, 0.00)
⫺0.55
(⫺2.02, 0.92)
⫺3.16
(⫺14.18, 7.86)
0.00
(0.00, 0.00)
0.21
(⫺1.00, 1.41)
1.83
(⫺4.25, 7.90)
38.81
(36.03, 41.58)
3.14
(2.22, 3.98)
56.05
(48.46, 63.64)
38.81
(36.03, 41.58)
4.65
(3.91, 5.43)
60.94
(55.17, 66.72)
b
a
PA previous dayc
Free-time PAb
Percent reporting engagement in organized physical activity in the past 7 days
Median number of free-time physical activity sessions reported for the past 7 days
c
Percent reporting physical activity on previous day
PA, physical activity
38.81
(36.03, 41.58)
4.58
(3.82, 5.41)
66.70
(60.82, 72.59)
38.81
(36.03, 41.58)
4.10
(3.26, 4.78)
62.63
(57.65, 67.60)
38.81
(36.03, 41.58)
3.70
(2.73, 4.71)
59.21
(52.56, 65.87)
38.81
(36.03, 41.58)
3.90
(3.48, 4.33)
61.04
(58.35, 63.73)
No exposure
to VERB
Column 2
Actual
population
Column 1
Baseline
characteristic
Organized PAa
Maximum
potential
campaign effect
Col 6 (ⴚ) Col 2
Direct campaign
effect
Col 1 (ⴚ) Col 2
Exposed to
VERB every
day
Column 6
Exposed to
VERB several
times per
week
Column 5
Exposed to
VERB about
once per
week
Column 4
Exposed to
VERB less
than once per
week
Column 3
Table 4. Excerpt of balance test table showing total population balance on three outcomes
Balance Testing
To ensure that there was no relationship between
observed baseline covariates and awareness of VERB,
balance tests were conducted prior to the outcome
analysis. In balance testing, an analysis is conducted to
ensure that for any given outcome, the estimate for a
covariate remains the same at each level of awareness
(i.e., within the sampling error of the other levels). For
example, if balance were achieved, then on every
outcome the baseline point estimates (outcome scores)
for a covariate such as sex or age would be the same at
each level of awareness.
To test for balance, all respondents were grouped
into the five awareness levels based on their VERB
awareness reported in 2005. Respondents’ 2002 (baseline) outcome responses were then used to create an
outcome score for each 2005 awareness grouping and
tested for differences between the groups. No differences should be detected, since 2005 VERB awareness
could not have influenced behaviors reported in 2002.
Therefore, when respondents are grouped by 2005
awareness, no pattern should be seen in their 2002
physical activity outcomes; if the control model was
successful, 2002 outcome estimates would be the same
at every level of 2005 awareness. Were a false effect
detected, further adjustment of the covariates would be
required.
Table 4 displays an example of the results of that
analysis using the variable physical activity on the day prior
to the interview. As can be seen, the proportion of tweens
reporting physical activity on the day prior to the
interview in 2002 was the same within each group of
tweens when they were categorized by their 2005 awareness. This indicated that characteristics underlying the
propensity to see VERB had been successfully controlled. However, this was not the case for all outcomes.
Some “balance failures” were observed. When this
occurred, additional modeling steps were used (e.g., a
search was made for higher-order interactions that
would produce the desired covariate control). When
this technique was not fully adequate, another adjustment technique called raking was used. Raking reduced
the number of balance failures to below statistical
significance. Thereafter, a series of analyses were
conducted to ensure that the adjustment procedures
had not substantially altered the character of the
data. The results of these simulation studies suggested that raked weights performed as well as unraked weights when the weighting model was correctly specified and better than an unraked model if
there were errors in the model.18
Panel Conditioning
Panel designs, such as YMCLS, are subject to a form of
nonsampling error that can result if participants’ re-
S236 American Journal of Preventive Medicine, Volume 34, Number 6S
www.ajpm-online.net
Table 5. Summary of panel conditioning on YMCLS variables in Panel 1, 2004, among children aged 11 to 13 years
Characteristic
Panel 1
Mean (CI)
Panel 2
Mean (CI)
Time in sample bias:
Panel 1 (ⴚ) Panel 2
Mean (CI)
Outcome expectationsa
Social influencesa
Self-efficacya
Percent engaged in organized physical activity in previous 7 days
Median sessions of free-time physical activity in previous 7 days
Percent engaged in any physical activity on previous day
Frequency of recalled awareness of VERBb
Percent reporting awareness of VERB campaign
10.1 (10.0, 10.1)
10.1 (10.0, 10.1)
10.0 (10.0, 10.1)
40.8 (37.9, 43.6)
4.2 (3.9, 4.5)
61.7 (58.6, 64.7)
2.2 (2.1, 2.3)
84.2 (81.8, 86.6)
9.9 (9.9, 10.0)
10.0 (10.0, 10.1)
10.0 (10.0, 10.0)
41.6 (39.7, 43.4)
4.4 (4.1, 4.6)
63.6 (61.3, 65.9)
2.1 (2.1, 2.2)
75.6 (74.1, 77.2)
0.1* (0.1, 0.2)
0.1* (0.0, 0.1)
0.0 (⫺0.1, 0.1)
⫺0.8 (⫺4.3, 2.8)
⫺0.2 (⫺0.5, 0.2)
⫺1.9 (⫺5.7, 1.9)
0.1 (0.0, 0.2)
8.5* (5.8, 11.3)
Mean estimate on standardized scale with mean ⫽ 10 and standard deviation ⫽ 1
Mean estimate on 0 – 4 scale
*Significant difference at p⬍0.05
Source: Youth Campaign Longitudinal Survey (YMCLS), 2004 (Panel 1) and 2004 (Panel 2)
a
b
sponses are influenced by the number of times they
have taken a survey.19 Several kinds of panel conditioning (sometimes called time-in-sample bias) have been
identified, and different effects on survey results have
been reported.20
It was anticipated that panel conditioning might
introduce error into our estimates after the YMCLS
panels had been surveyed multiple times. Panel 1 was
tested for bias in 2004 and 2005. Testing became
possible in 2004 when Panel 2 was surveyed for the first
time, because those results provided an unbiased
benchmark for assessing potential conditioning in
Panel 1.
Table 5 compares the two panels across eight outcomes in 2004. Three variables showed a significant
difference, and therefore, conditioning was suspected.
The table also suggests that panel conditioning differed
by variable type; there was no significant effect on
behavioral variables assessing physical activity, a slight
effect on some cognitive variables related to physical
activity, and a substantial effect on awareness of the
campaign. Of importance, there was no effect on the
frequency of awareness variable.
There was no fresh panel in 2005 to which existing
panels could be compared. Instead, estimates obtained
from Panel 1 in 2005 (fourth administration) were
compared with counterpart estimates from Panel 2 in
2005 (second administration). To control for age, only
those 12 to 14 years were included. The results suggested that some panel conditioning occurred on physical activity behavior outcomes in 2005, but again, not
on the frequency-of-awareness variable.
The bias detected in both years encourages caution
when using the point estimates (outcome scores) of
those variables in direct comparisons. The analysis used
to gauge VERB outcomes used the dose–response relationship between frequency of awareness and the variables of interest, which minimizes any effect of the
conditioning in the outcome findings.
June 2008
Assessing Outcomes
Dose–response outcomes. Having controlled for confounders and sampling bias, an association was sought
between levels of tweens’ awareness of VERB and
outcomes within each year. Endpoint analysis was chosen to assess yearly outcomes, as opposed to conducting
a change-score analysis. Although change scores would
have been more conservative than an endpoint analysis,
a change-score analysis was less suited to this evaluation
for several reasons. First, the physical activity outcome
correlations over time were consistent with correlations
reported in scientific literature on physical activity5–7
but were below 0.5, too low to make change scores as
powerful as an endpoint analysis.20,21 Second, the fully
imputed cases were regarded as more appropriate for
use in the control structure of an endpoint analysis
than in a change-score estimate. Third, two campaignappropriate outcomes were added to the instrument
after the baseline data were collected. These outcomes
could not be assessed in a change-score analysis, but
they could be accommodated in an endpoint design.
Within each year of the promotion, a search was
made for 1-year evidence that the campaign had produced outcome effects, controlling for baseline characteristics. Any effects per year were cumulative, of
course, which mirrored the intent of the campaign.
Over the course of the follow-up analyses, two types of
outcome tests were used to determine annual cumulative outcomes. In 2003 overall effects were calculated;
in 2004 –2006, overall effects plus dose–response effects
were used.
Overall effects. In 2003 (Year 1), recalled frequency of
encountering the campaign was not measured. The
effect measure that year calculated the difference between tweens with no reported awareness of VERB and
all the tweens in the sample, for each outcome. Given
an outcome such as total sessions of physical activity in 7
days, the average number of sessions among tweens who
reported no awareness was subtracted from the average
Am J Prev Med 2008;34(6S)
S237
number of sessions in 7 days among all tweens.
Because the data were weighted and controlled for
each level of awareness, this calculation estimated the
average 7-day total sessions of physical activity had
the VERB campaign not occurred (no awareness)
and subtracted that figure from the average number
of sessions among the actual population of children
aged 9 to 13 years in the U.S. Any difference detected
could be assigned to the target population’s awareness of VERB.
In 2004 (Year 2), a survey item was added that asked
tweens to recall the frequency with which they encountered VERB; those data were used to create the dose–
response measure of outcomes. The theory of dose–
response, borrowed from epidemiology, is that a
substance can be presumed to produce an effect if an
increasing dose results in an increasing outcome. Applied to a communication intervention, this technique
suggests that, if the campaign is producing an effect
then increasing levels of awareness of the VERB campaign should be associated with increasing levels of
physical activity among the target audience. Reported
frequency of awareness of VERB television and radio
advertisements was used as the measure of dose. For
each of the eight outcomes, point estimates (outcome
scores) were calculated for each of the five levels of
frequency of awareness: aware of the campaign about
every day, aware several times a week, aware about once
a week, aware less than once a week, and not aware of
VERB. Using the fully adjusted outcome estimate at
each level of awareness, the gamma statistic tested for
the presence of a significant ordinal association between an outcome and levels of awareness, in effect that
the outcome changed in the desired direction as frequency of awareness rose.
Identical analyses were conducted for all eight primary outcomes. The full analysis tables displayed effects
by gender, age at baseline, age at survey year, race or
ethnicity, and census region.8,22–24
Alternative Analyses
To ensure that the analysis was sufficiently conservative,
a series of alternative analyses were conducted every
year. For each analysis, a double-difference method was
used. For a given outcome, the difference between
aware and unaware groups was calculated for the
baseline and was recalculated for the year being studied, yielding two difference scores. Then the difference
between those differences was calculated. The doubledifference method has similar properties to a change
score.
In three sensitivity analyses, the structure of the
calculations was varied to isolate elements of the primary analysis. In Alternative 1, the covariance structure
was removed and double-difference effects were calculated using simple, longitudinally weighted data. This
alternative removed any effects of propensity modeling
from the results In Alternative 2, only raked covariates
other than the outcome were controlled. This analysis
was free of effects from controlling on outcomes measured with error at baseline.21 In Alternative 3, the full
control structure was used, but without raking for
balance. This alternative was free of any effects that the
raking procedure may have introduced.
Table 6 shows a comparison of the gamma statistics
for the full control analysis and three alternatives, using
2005 data. As can be seen, the pattern of outcomes is
similar across them, even though the double-difference
analyses have considerably less power. Of particular
interest is the pattern of effects between Alternative 3
and the primary analysis. Although Alternative 3 lacks
the statistical power of the primary analysis, and controls on some variables both through the covariate
structure and the double difference, the pattern of
effects is the same. These analyses confirm that endpoint propensity-controlled analysis maintains robust
confounder control while providing the evaluation its
best chance of detecting the effects of the VERB
campaign.
Limitations
Throughout this article potential limitations to the
design of media campaign evaluations and the efforts
to overcome them have been explored. The techniques
used cannot, however, remove all threats to validity.
Like every nonrandomized design, this study could not
control for unobserved covariates. Although considerable effort was made to identify the key questions, it
remains possible that unmeasured covariates influenced the analysis. Second, the study relies on selfreported data from parents and their tweens. Although
the validation study conducted provides considerable
assurance, it is possible that subjects’ directly observed
behavior would vary from their reported behavior, in
particular among the youngest respondents, whose
cognitive processes are still developing. Third, it is
noted that panel conditioning was found in 2004 and
2005. Although the dose–response metric used for
outcome assessments is somewhat robust to time-insample bias, caution should be exercised in using by
themselves as estimates of attitudes or behaviors outcome scores that showed potential conditioning.
Conclusion
The VERB campaign offered an opportunity to evaluate a large-scale media-based health communication
campaign. An ideal design would have entailed random
assignment of target audience participants to treatment
and control groups. To reach the maximum possible
audience, however, campaign planners designed a national media plan that made assignment of a random-
S238 American Journal of Preventive Medicine, Volume 34, Number 6S
www.ajpm-online.net
0.1
(⫺0.1, 0.2)
0.2*
(0.0, 0.3)
0.0
(⫺0.1, 0.2)
0.1*
(0.0, 0.2)
⫺0.0
(⫺0.1, 0.1)
⫺0.0
(⫺0.01, 0.1)
0.0
(⫺0.1, 0.1)
0.1*
(0.0, 0.2)
*p⬍0.05
CFP, counterfactual projection; PA, physical activity.
0.1*
(0.0, 0.2)
0.2*
(0.1, 0.2)
0.1
(⫺0.1, 0.2)
0.0
(⫺0.1, 0.1)
0.1
(⫺0.1, 0.2)
0.1*
(0.0, 0.3)
0.0
(⫺0.1, 0.2)
⫺0.0
(⫺0.1, 0.1)
0.1
(⫺0.0, 0.2)
0.1*
(0.0, 0.2)
0.0
(⫺0.1, 0.2)
0.1*
(0.0, 0.3)
⫺0.0
(⫺0.1, 0.1)
⫺0.0
(⫺0.1, 0.1)
0.0
(⫺0.1, 0.2)
0.2*
(0.0, 0.3)
0.17*
(0.10, 0.24)
0.07
(⫺0.01, 0.16)
0.0
(⫺0.1, 0.1)
2005 (ⴚ) 2002
2005
0.16*
(0.10, 0.22)
0.06*
(0.00, 0.12)
0.0
(⫺0.1, 0.1)
⫺0.01
(⫺0.07, 0.04)
⫺0.01
(⫺0.07, 0.04)
0.0
(⫺0.0, 0.0)
0.18*
(0.10, 0.25)
0.07
(⫺0.01, 0.16)
⫺0.0
(⫺0.1, 0.1)
2005 (ⴚ) 2002 2002
2005
0.15*
(0.09, 0.21)
0.07*
(0.00, 0.13)
0.0
(⫺0.1, 0.1)
⫺0.03
(⫺0.09, 0.03)
⫺0.01
(⫺0.06, 0.05)
0.0
(⫺0.1, 0.1)
0.13*
(0.07, 0.19)
0.06
(⫺0.01, 0.13)
0.1
(⫺0.1, 0.2)
2005 (ⴚ) 2002 2002
2005
0.17*
(0.12, 0.23)
0.07*
(0.02, 0.12)
0.0
(⫺0.0, 0.1)
0.04
(⫺0.01, 0.09)
0.01
(⫺0.04, 0.06)
⫺0.0
(⫺0.1, 0.1)
0.15*
(0.10, 0.20)
0.05
(⫺0.01, 0.12)
0.0
(⫺0.1, 0.1)
2005 (ⴚ) 2002 2002
0.21*
(0.16, 0.25)
0.07*
(0.03, 0.12)
0.1
(⫺0.0, 0.1)
Score on outcome
expectations of PA scale
Median number of free-time
PA sessions in past 7 days
Percent reporting
engagement in organized
PA in the past 7 days
Total number of PA sessions
reported for past 7 days
Percent reporting PA on
previous day
2005
2002
0.06*
(0.01, 0.10)
0.02
(⫺0.03, 0.07)
0.0
(⫺0.1, 0.1)
Outcome
Primary analysis: raked CFP weights based
on full set of baseline covariates
Alternative 3: CFP weights based on full set
of baseline covariates but without raking
Alternative 1: longitudinal weights
Alternative 2: raked CFP weights based on
covariates other than baseline measurements
of outcomes
Table 6. Summary of gammas for post-hoc sensitivity analyses (CIs)
June 2008
ized control impossible. Another strong design would
have been a time series in which periods of exposure to
the campaign were compared to periods when the
campaign was not being aired.25 This design was also
infeasible because the media plan included no periods
without cable television advertising during the 4 years
of the intervention. Absent these options, researchers
sought a naturalistic design that would balance the
need for rigorous control of alternative explanations
against the need for sufficient statistical power to detect
potentially small, but important, effects.
The creation of three panels of randomly-selected
respondents, recruited in 2002, 2004, and 2006, provided a true baseline, produced a substantial sample
size, and maintained a study population within the
campaign target range. Adding panels in 2004 and
2006 also permitted us to gauge the effect of panel
conditioning on the 2002 and 2004 panels.
The use of propensity scores was a means of ensuring
that the relationship between awareness of the campaign
and physical activity–related outcomes was isolated from
the effects of any other observed variables.26,27 Propensity
scores also made the measurement of many outcomes
more feasible because only a single parsimonious
model was required. Further, the design was robust to
skewed data distributions that would have violated the
assumptions of more common techniques. Balance
testing provided reassurance that the control process
had not created pseudo-effects. Finally, the findings are
expressed in units of the outcomes themselves (e.g.,
number of sessions of physical activity) rather than in
difficult-to-interpret statistics. The dose–response measurement provided a well-accepted test for whether real
effects had occurred.
The use of an endpoint analysis permitted researchers to maximize sample size and to retain as much
statistical power as possible. As the sensitivity analyses
showed, the use of an endpoint design and propensity
scoring techniques did not appear to produce bias in
the findings.
Not all health communication campaigns could take
advantage of these techniques. Large sample sizes are
required to use propensity scoring, and the analysis is
labor intensive and complex. That said, large-scale
health communication initiatives require substantial
investments to achieve their often ambitious goals. To
support claims of campaign success, outcome studies
must overcome a host of warranted concerns. The
techniques used to evaluate the VERB campaign can
provide campaign planners and funders with credible
evidence about the consequences of their investments.
The findings and conclusions in this paper are those of the
authors and do not necessarily represent the views of the
CDC.
Am J Prev Med 2008;34(6S)
S239
No financial disclosures were reported by the authors of
this paper.
References
1. Wong FL, Greenwell M, Gates S, Berkowitz JM. It’s what you do! Reflections
on the VERB™ campaign. Am J Prev Med 2008;34(6S):S175–S182.
2. Heitzler CD, Asbury LD, Kusner SL. Bringing “play” to life: the use of
experiential marketing in the VERB™ campaign. Am J Prev Med
2008;34(6S):S188 –S193.
3. Bretthauer-Mueller R, Berkowitz JM, Thomas M, et al. Catalyzing community action within a national campaign: VERB™ community and national
partnerships. Am J Prev Med 2008;34(6S):S210 –S221.
4. Berkowitz JM, Huhman M, Nolin MJ. Did augmenting the VERB™
advertising in select communities have an effect on awareness, attitudes,
and physical activity? Am J Prev Med 2008;34(6S):S257–S266.
5. Welk GJ, Wickel E, Peterson M, Heitzler CD, Fulton JE, Potter LD.
Reliability and validity of physical activity questions on the Youth Media
Campaign Longitudinal Survey. Med Sci Sports Exerc 39:612–21.
6. Telford A, Salmon J, Jolley D, Crawford, D. Reliability and validity of
physical activity questionnaires for children: the Children’s Leisure Activities Study Survey (CLASS). Pediatr Exerc Sci 2004;16:64 –78.
7. Treuth MS, Sherwood NE, Butte NF, et al. Validity and reliability of activity
measures in African-American girls for GEMS. Med Sci Sports Exer
2003;35:532–9.
8. Potter LD, Nolin MJ, Judkins D, Piesse A, Huhman M. Evaluation of the
CDC VERB™ campaign: findings of the Youth Media Campaign Longitudinal Survey, 2002, 2003, 2004, and 2005. Rockville MD: Westat, 2006.
9. Kalton G. Compensating for missing survey data [Research Report Series].
Ann Arbor MI: Institute for Social Research, 1983.
10. Kalton G, Kasprzyk D. The treatment of missing survey data. Surv Methodol
1986;12:1–16.
11. Little RJA, Rubin DB. Statistical analysis with missing data. New York: John
Wiley, 1987.
12. Marker DA, Judkins D, Winglee M. Large-scale imputation for complex
surveys. In: Groves, RM, Dillman DA, Eltinge JL, Little RJA, Eds. Survey
nonresponse. New York: John Wiley, 2001.
13. Piesse A, Judkins D, Fan Z. Item imputation made easy. 2005 proceedings
of the Section on Survey Research Methods of the American Statistical
Association. Alexandria VA: American Statistical Association, 2005:3476 –9.
14. Rosenbaum PR, Rubin DB. The central role of propensity score in
observational studies for causal effects. Biometrika 1983;70:41–55.
15. Rubin DB, Waterman RP. Estimating the causal effects of marketing
interventions using propensity score methodology. Stat Sci 2006;21:
206 –28.
16. Drake C. Effects of misspecification of the propensity score on estimators of
treatment effect. Biometrics 1993;49:1231– 6.
17. Claritas Corporation. Getting to know the 62 clusters: cluster snapshots.
Arlington VA: Claritas Corporation, 1995.
18. Judkins DR, Morganstein D, Zador P, Piesse A, Barrett B, Mukhopadhyay P.
Variable selection and raking in propensity scoring. Stat Med
2007;26:1022–33.
19. Bailar B. Information needs, surveys, and measurement errors. In: Kasprzyk
D, Duncan G, Kalton G, Singh MP, Eds. Panel surveys. New York: John
Wiley, 1989.
20. Wang K, Cantor D, Safir A. Panel conditioning in an random digit dial
survey, 2000 proceedings of the Section on Survey Research Methods of the
American Statistical Association. Alexandria VA: American Statistical Association, 2000.
21. Cox DR. The use of a concomitant variable in selecting an experimental
design. Biometrika, 1957;44:150 – 8.
22. Cook TD, Campbell DT. Quasi-experimentation: design and analysis issues
for field settings. 2nd ed. Chicago: Rand McNally College Publishing, 1979.
23. Potter LD, Duke JC, Nolin MJ, Judkins DR, Huhman M. Evaluation of the
CDC VERB™ campaign: findings from the Youth Media Campaign Longitudinal Survey, 2002 and 2003. Rockville MD: Westat, 2004.
24. Potter LD, Duke JC, Nolin MJ, Judkins DR, Piesse A, Huhman M.
Evaluation of the CDC VERB™ campaign: findings from the Youth Media
Campaign Longitudinal Survey, 2002, 2003, and 2004. Rockville MD:
Westat, 2005.
25. Huhman M, Potter LD, Wong FL, Banspach SW, Duke J, Heitzler C. Effects
of a mass media campaign to increase physical activity in children: year-1
results of the VERB™ campaign. Pediatrics 2005;116:e277– 84.
26. Palmgreen P, Donohew L, Lorch EP, Hoyle RH, Stephenson T. Television
campaigns and sensation seeking. Targeting of adolescent marijuana use: a
controlled time series approach. In: Hornik R, Ed., Public health communication: evidence for behavior change. Mahway NJ: Lawrence Erlbaum
Associates, 2002.
27. Yanovitzky I. Zanutto E, Hornik R. Estimating causal effects of public health
education campaigns using propensity score methodology. Eval Program
Plann 2005;28:209 –20.
S240 American Journal of Preventive Medicine, Volume 34, Number 6S
www.ajpm-online.net