Methodology of the Outcome Evaluation of the VERB™ Campaign Lance D. Potter, MA, David R. Judkins, MA, Andrea Piesse, PhD, Mary Jo Nolin, PhD, Marian Huhman, PhD Abstract: This article summarizes the methods used in the outcome evaluation of the VERB™ campaign. The outcome evaluation was designed to measure the awareness and understanding of VERB among the target audience of children aged 9 –13 years (tweens) and to determine the effect of VERB awareness on psychosocial and behavioral outcomes. Cohorts of tweens and parents were interviewed annually via a telephone survey (Youth Media Campaign Longitudinal Survey). The first cohort (baseline) was surveyed in 2002 prior to VERB advertising and was repeated annually through 2006. A second cohort was surveyed in 2004 –2006. A third, cross-sectional sample was surveyed in 2006. Each cohort consisted of a nationally representative sample of tweens to enable generalizability to the nation as a whole. Propensity scoring was used to control for confounding influences. The outcomes were analyzed for dose–response effects (i.e., whether higher levels of awareness led to stronger effects) and overall awareness effects (i.e., the difference between tweens unaware of VERB and all tweens in the U.S.). Secular trends in tweens’ physical activity during the life of the campaign were also examined. This article also discusses weighting and imputation, alternative analyses used to assess the adequacy of the propensity methods, and the challenges involved in media campaign evaluations. (Am J Prev Med 2008;34(6S):S230 –S240) © 2008 American Journal of Preventive Medicine Introduction T he VERB™ campaign was a media-based health marketing campaign of the CDC intended to improve physical activity levels among children aged 9 –13 years (tweens).1 In 2002, when CDC began to plan the campaign, the agency was aware that developing strong evidence of the campaign’s effects would be essential. This article summarizes the methods used in the outcome evaluation of VERB, noting techniques that were used to overcome design limitations common to evaluations of health communication programs. The primary goal of the outcome evaluation was to determine whether associations could be detected in the tween population between awareness of the campaign and psychosocial or behavioral outcomes related to physical activity. The evaluation began prior to the launch of the intervention and continued through the conclusion of the campaign in 2006. The methods used were statistically complex; to make them accessible to a wide audience, they will be presented here without formulae and without detailed statistical discussion. Referenced articles provide more detailed information From Westat (Potter, Judkins, Piesse, Nolin), Rockville, Maryland; and the National Center for Chronic Disease Prevention and Health Promotion, CDC (Huhman), Atlanta, Georgia Address correspondence and reprint requests to: Lance D. Potter, MA, Westat, 1650 Research Boulevard, Rockville MD 20850. E-mail: [email protected]. about this evaluation and further discussions of the statistical methods used. Evaluation Design and Research Methodology To accommodate all the elements of the campaign, the evaluation included a range of process measurement activities2– 4 in addition to a longitudinal outcome evaluation. The focus of this article is a description of the analytic methods used to address three outcome research questions: 1. What level of awareness of the VERB campaign was achieved among the target population and to what extent did they understand its messages? 2. Was there evidence of an association between tweens’ awareness of VERB and their attitudes and behaviors related to physical activity? 3. Were there temporal changes in tweens’ attitudes and behaviors related to physical activity during the campaign? Overview of Evaluation Methods Prior to the launch of the campaign in 2002, a randomly selected longitudinal panel of parent–tween dyads was interviewed by telephone concerning their physical activity–related beliefs and behaviors and their households’ characteristics. Questions about awareness of VERB were added to the survey in 2003, and the S230 Am J Prev Med 2008;34(6S) © 2008 American Journal of Preventive Medicine • Published by Elsevier Inc. 0749-3797/08/$–see front matter doi:10.1016/j.amepre.2008.03.007 Figure 1. Panel design for the Youth Media Campaign survey 2002–2006. All surveys were conducted April–June. panel was interviewed annually through 2006. An additional panel was added in 2004, and a final crosssectional sample of parent–tween dyads was added in 2006. Data were fully imputed and weighted to national population totals for children aged 9 –13 years. The pre-campaign data were modeled to create propensity scores that adjusted for confounding influences in a series of endpoint analyses (2003, 2004, 2005, and 2006). A dose–response analysis sought evidence that as the reported frequency of seeing or hearing VERB advertisements increased, VERB-related outcomes improved. The remainder of this article provides additional detail on the design and methods used. Panel Design The outcome evaluation data were collected from parent and tween respondents recruited from households contacted at random. Within each cooperating household, a dyad (consisting of a child aged 9 –13 years and one of the child’s parents or a guardian) was interviewed by telephone. The outcomes assessed are described below. Three panels consisting of parent–tween dyads were recruited. Figure 1 summarizes the design. Panel 1 respondents (3120 dyads) were first surveyed in spring 2002 before any VERB advertising was launched and follow-up surveys were conducted in spring 2003, spring 2004, spring 2005, and spring 2006. By 2004, 40% of the tween respondents in Panel 1 were beyond the campaign’s target age range, so a second panel of 5177 parent–tween dyads was recruited and interviewed. Panel 2 respondents were interviewed again in spring 2005 and spring 2006. To ensure responses for all years in the tween target age range for the final data collection, a third set of respondent dyads (Panel 3: 1200 dyads) was recruited and interviewed a single time, in spring 2006. Response Rates For each panel created, a list-assisted, random-digitdialed method was used to select a sample of households with telephones, and that sample was screened to identify households with children aged 9 –13 years. To increase the likelihood of participation, prior to beginning telephoning, CDC sent letters of invitation to households with potential participants. Overall response rates per time period were calculated by multiplying household screening rate by the parent interview completion rate by the tween interview completion rate. The screening rate (i.e., those households that finished the 1-minute screening interview to determine whether they had a child aged 9 –13 years) was 62%, a typical screening rate in a randomized telephone survey. Cooperation rates for parents and tweens in both panels for all years always exceeded 80%. Initial screening rates and coopera- Table 1. Screening and cooperation rates (%), Panel 1 (2002–2005) and Panel 2 (2004 –2005) Panel 1 Screener response rate [SRR] Parent cooperation rate [PCR] Child cooperation rate [CCR] Annual response rate [ARR]a Cumulative response rate [CRR]b Panel 2 2002 2003 2004 2005 2004 61 87 81 70 43 — 75 91 71 38 — 84 97 82 31 — 87 98 85 26 60 85 88 75 45 2005 — 78 96 75 38 a Product of parent and tween cooperation rates In 2002, the cumulative response rate is the product of the screener response rate and the annual response rate. In 2003, all 2002 cooperating parents were contacted regardless of tween result in 2002. Therefore, 2003 CRR ⫽ 2002 SRR*2002PCR*2003ARR. Thereafter, the cumulative response rate is the product of the previous year’s cumulative rate and the current year’s annual rate. b June 2008 Am J Prev Med 2008;34(6S) S231 tion rates for both panels in all years are in Table 1. Surveys were conducted in English or Spanish, depending on the preference of the respondent. About 8% of the parent interviews and 3% of the tween interviews were conducted in Spanish. The survey was conducted April–June of each year to maintain a consistent proportion of tweens in and out of school at the time of interview. The aim was to complete as many interviews as possible before the school year ended. The proportion of tweens in school when each panel was surveyed ranged from 82% to 87%, except in 2004, when 71% of the tweens in Panel 1 were in school when interviewed. Outcome Measures During their interviews, tweens were asked about three domains: awareness and understanding of VERB, level of participation in physical activity, and attitudes toward physical activity. An independent study5 validated the behavioral measures, seeking correlations between survey response data and data from accelerometers and activity logs. That study found correlations as high, or higher than, the correlations between widely used measures and accelerometer data described in scientific literature on physical activity research.6,7 read any messages or advertising about kids getting active? Those who reported VERB messages without further prompting were categorized as having unaided awareness of the campaign. All other tweens were asked, Have you seen, heard, or read any messages or advertising about VERB? Children who answered in the affirmative were categorized as having aided awareness of VERB. Both groups were asked What is VERB all about? and What does VERB mean to you? Responses to these items were coded and tweens who correctly described at least one key message of the advertising were credited with understanding the campaign. Understanding was incorporated into the definition of awareness such that tweens who recalled but did not understand the VERB messages were not credited with awareness in the measurement of outcomes. Both unaided and aided recognition of VERB were considered awareness of VERB. Beginning in 2004, tweens aware of VERB were asked to recall the frequency with which they were aware of the campaign’s advertising on television or radio. On the basis of reported frequency of seeing or hearing VERB advertisements, tweens were categorized into one of four levels of frequency: about every day, several times a week, about once a week, and less than once a week. Along with those not aware, this created five levels of frequency of awareness. Measurement of Awareness Four components of awareness were measured: unaided awareness of VERB, aided awareness of VERB, understanding of VERB messages, and recalled frequency of encountering VERB.8 First, at the end of the survey tweens were asked Have you seen, heard, or Psychosocial Outcome Measures Three psychosocial scales were developed from the 15 attitude and belief items contained in the Youth Media Campaign Longitudinal Survey (YMCLS) survey: outcome expectations, self-efficacy, and social influences. Table 2. Items in the YMCLS psychosocial outcome scales Item Outcome expectations scale (␣ 0.73) If I did physical activities on most days it would be boring (4-point agreement scale)* If I did physical activities on most days it would be fun (4-point agreement scale) If I did physical activities on most days it would help me make new friends (4-point agreement scale) If I did physical activities on most days it would help me spend more time with my friends (4-point agreement scale) If I did physical activities on most days it would make me feel good about my self (4-point agreement scale) Social influences scale (␣ 0.70) My friends think that doing physical activities is fun (4-point agreement scale) Kids my age think that doing physical activities is fun (4-point agreement scale) My friends think that doing physical activities is important (4-point agreement scale) Kids my age think that doing physical activities is important (4-point agreement scale) How many kids your age do physical activities every day? (All, most, some, or none) How many of your friends do physical activities every day? (All, most, some, or none) Self-efficacy scale (␣ 0.66) I think I can be physically active no matter how busy my day is (4-point agreement scale) I think I can be physically active no matter how tired I might feel (4-point agreement scale) I think I can be physically active even if it is hot of cold outside (4-point agreement scale) I think I have what it takes to be physically active (4-point agreement scale) Factor loading 0.604 0.735 0.649 0.692 0.714 0.728 0.718 0.709 0.681 0.497 0.589 0.747 0.718 0.704 0.650 *Reverse coded YMCLS, Youth Media Campaign Longitudinal Survey S232 American Journal of Preventive Medicine, Volume 34, Number 6S www.ajpm-online.net These scales were used in the 2004, 2005, and 2006 analyses. Table 2 contains the items and factor loadings for the scales. Behavioral Outcome Measures Behavioral outcomes sought to capture the tweens’ out-of-school physical activity during the 7 days before the interview. Four behavioral outcomes were assessed from the tweens’ self-reports of physical activity. Three outcomes were based on tweens’ reports of (1) any nonschool physical activity on the day prior to the interview, (2) all nonschool physical activity in their free-time during the previous 7 days, and (3) all nonschool organized physical activity in which the tween participated during the prior 7 days. Free-time physical activity was defined as things you did on your own, in your free time in contrast to organized physical activity which was described as having been done with a coach, supervisor, or leader. The fourth behavioral measure, the sum of free-time physical activity and organized physical activity, was intended to approximate all nonschool physical activity during the 7 days before the interview. Demographic and Lifestyle Items To isolate the relationship between the frequency of seeing VERB and campaign outcomes, it was essential to adjust for tween characteristics that might predict both likelihood of encountering the advertising and outcome characteristics. Items used as covariates were obtained from both parent and tween surveys. For example, tweens reported on time spent watching television, using computers, or playing video games. A parent or guardian of each tween respondent was surveyed on subjects such as awareness of VERB, attitude toward and beliefs about their tweens’ physical activity, behaviors related to involvement with their tweens’ physical activities, and perceived barriers to their tweens’ participation in physical activities. During the parent or guardian interview, data were collected on demographic characteristics such as tweens age and race and ethnicity, respondent’s marital status and level of education, number of household members, and household income. Data-Preparation Procedures Imputation of Missing Data: Individual Questions As with most surveys, some respondents did not answer all interview questions: some did not know the answer and others did not wish to respond for various reasons. Nonresponse in the YMCLS was low: median item response rates exceeded 99% for all panels in all years. Nevertheless, responses to some items were missing June 2008 from each survey. The missing responses were imputed to facilitate weighting and analysis. In 2002 and 2003, parent and tween missing interview items were imputed by using a short series of simple hot-deck procedures9 –12 in which both donor and recipient cases were sorted into cells defined by characteristics likely to be associated with differences in their likelihood of responding to each survey item. Cell-defining variables could be drawn from either household or individual respondent characteristics. Donors for missing data were randomly selected for recipients from within the same cell. The completed data item from the donor case was used to fill in the missing data item. Item imputation was conducted separately for each question. In 2004 and 2005, non-ordinal questionnaire items were again hot-decked. But for ordinal questionnaire items, a different procedure was used to ensure that data relationships in each year, and for each case back to 2002, were meaningful.13 A regression model was fit using a stepwise ordinary least squares (OLS) selection procedure. Baseline outcome variables, household sociodemographic and geographic variables, and questionnaire variables from the year being imputed were used as eligible predictors. The resulting predicted values were used in a nearest-neighbor imputation procedure, in which each case with a missing value was matched to the complete case with the closest predicted values for the variable being imputed. Because median item response rates exceeded 99%, no sensitivity analysis of the item imputations was conducted. Imputation of Missing Data: Whole Case Following the national Panel-1 data collection in 2002 (baseline) there was a group of households in which the parent interview had been completed but the tween interview had not. Those cases were fielded again in 2003 (1-year follow-up), and in 278 households (8.9% of final sample) both the parent and tween interviews were completed. For this small group of cases, then, parent interviews were available for both baseline and first follow-up, but tween interviews at follow-up only. To make the best use of the data for these households, the baseline tween interview was imputed. The imputation involved identifying other tweens who had baseline data and who were as similar as possible to tweens who had no baseline data. The similarity was assessed by responses from the parent interview both in 2002 and 2003 and responses from the 2003 tween interview. The imputation process involved four steps. First, five key outcome measures were selected: 1. number of days in which the tween took part in free-time or unstructured physical activities; 2. number of days in which the tween took part in organized physical activity; Am J Prev Med 2008;34(6S) S233 3. whether or not the tween engaged in physical activities the day prior to the interview; 4. attitude toward the statement, I should probably do more physical activities than I do (really agree, sort of agree, sort of disagree, or really disagree); and 5. attitude toward the statement, I’d rather watch TV or play video games than do physical activities (really agree, sort of agree, sort of disagree, or really disagree). Second, predictive models were developed for those five outcome measures. These models were then used to predict values for the key outcome measures for tweens without baseline data. Third, clustering algorithms were run on the predicted outcome measure values to form imputation classes. Finally, hot-deck imputation was used to find donors and impute for the missing baseline data within the imputation classes. When a suitable donor was found, all data on the donor’s baseline questionnaire were copied to the recipient. By imputing the entire set of baseline data from one tween, the covariance structure was preserved in a way that could not have been achieved if the missing data had been imputed one variable at a time, and at least some of the most important features of the covariance structure across years and across family members were preserved. To validate the preservation of the covariate structure, the predictive models were refit with and without the imputed outcome values. The change in item coefficients with and without the imputed values provided a measure of the extent to which the covariance structure changed as a result of the imputation. The difference between item coefficients with and without imputation was generally 10% or less. Standard errors were also compared, with and without imputed values; the differences between with and without were small, generally less than 6%, and nearly all negative, reflecting the artificial reduction in standard errors due to the increased sample size created by adding cases through imputation. Weighting Within each panel for each year, household-level weights were created first, to adjust for screener nonresponse and for the number of telephones in a household. Household weights were used as the basis for parent weights, which adjusted for parent nonresponse. These weights were then further adjusted with a tween weight, which included adjustments for number of eligible tweens in the household and for tween nonresponse. Finally, adjustment for under-coverage was accomplished via raking to census totals. Because the post-baseline follow-up study was conducted in 2003, 2004, and 2005, weights were reconstructed each year. Outcome Analysis Overall Analytic Framework The Panel 1 and Panel 2 cohorts were followed for 5 and 3 years, respectively, and a third cross-section (Panel 3) was added in 2006. Several outcome analyses were conducted in each year of the evaluation. Awareness of VERB, which was a necessary condition for campaign success, and the measure of intervention dose, was analyzed as a binary variable (aware versus not aware) and as a frequency-of-awareness variable (described above). Panel 1, which was begun before the launch of advertising, was used in the primary outcome analysis to take advantage of the true baseline. The pre-campaign data were used for confounder control, and annual endpoint analyses (described below) were conducted in 2003 through 2006. Associations between awareness of VERB and the psychosocial and behavioral outcomes were controlled for propensity to be aware of VERB, and each level of awareness was weighted to the national population of children aged 9 to 13 years. Then the unaware population was compared with the total population for an estimate of overall VERB effect, and dose–response relationships were sought between reported frequency of awareness to VERB and the outcomes. T-tests were used to test the significance of the overall effect, and the gamma statistic (SAS, version 9.1) was used to assess the significance of dose– response relationships. Panels 2 and 3 were used to create age-controlled, partial cross-sections for analysis of longitudinal trends in 2004 through 2006. Both propensity-controlled and secular trends were examined. In the following section, the confounder control procedures used in the analysis are discussed, including a description of the analyses on which conclusions were based about associations between VERB and tween outcomes. Propensity Scores Random assignment of subjects to treatment and control was not possible in a national broadcast media intervention in which subjects largely controlled their own likelihood of awareness via media consumption patterns. As a consequence, the evaluation sought to measure as many sources of variation in awareness patterns as possible. The presumption was that some variables would influence both the independent variable and the predicted outcomes. A demographic or lifestyle variable that influenced the likelihood that one would see a television commercial could, at the same time, be influential in the amount of physical activity one gets. For example, a preference for viewing many hours of television per week would probably increase chances S234 American Journal of Preventive Medicine, Volume 34, Number 6S www.ajpm-online.net that a tween would encounter VERB advertising, but at the same time might decrease the amount of physical activity in which the tween engaged. Other variables might have different combinations of effects on the key association between VERB and physical activity–related outcomes. Uncontrolled, such variables would prevent us from identifying true associations between VERB and physical activity outcomes. Propensity scoring is one method of controlling for such nonrandom confounders.14 Propensity scoring is based on multiple regression, but has several advantages over the standard regression-based analysis.15 Propensity scoring permits large numbers of covariates, including complex higher-order interaction terms, to be fit in a single model that can be used for confounder control on all outcomes. It also avoids threats of multicollinearity among covariates in the model and is robust to most skewed data distributions. Propensity scoring is also more resistant to bias from variables included in the wrong functional form (for example, linear rather than quadratic).16 Last, the propensity score technique permits, in effect, a test of whether observed covariates are independent of the outcome variable (balanced). Propensity scores were used in the VERB evaluation to statistically adjust respondents within the five levels of awareness. Once propensity scoring was complete, within each level of awareness, any association between awareness of VERB and physical activity–related outcomes was free of observed confounding factors. Creating Propensity Scores In each year of outcome analysis, the propensity of being aware of VERB was modeled for that year’s respondents. Specifically, an ordinal logit model was fit for the five-level awareness variable described earlier. In each year, more than 300 baseline (2002) variables were permitted to enter the model in a series of forward stepwise procedures. Variables included tween, parent, and household demographics; tween and parental attitudes and behaviors; media consumption patterns; and awareness of other advertising campaigns. Census variables and Claritas Prizm™17 geosocial codes were included. Main effects, first-order interactions, and squared variables were permitted to enter. Only variables significant at p⬍0.05 were permitted to remain in the final model. The final model was balance-tested, and the behavioral outcome variables were forced into the model. Propensity Weights Once the model predicting awareness was complete, it was used in a second modeling procedure to create a propensity score for each respondent. In this study, the propensity score was a tween’s predicted probability of being aware of VERB, regardless of the tween’s selfreported awareness level. The propensity scores were divided into quintiles. An adjustment factor was then assigned to each combination of the five levels of self-reported awareness by the five levels of predicted propensity. As can be seen in Table 3, tweens with a high propensity to see VERB, based on the propensity model, and who reported frequent awareness (awareness Level 5), received a low weight, and those with high propensity but low awareness received a higher weight. The effect of the propensity weights was to neutralize the differential propensities to encounter the campaign. Completing Weights The completed propensity adjustment factors were then combined with previously completed sampling weights. Combining the propensity score with a sampling weight for each household produced a weight referred to as a counterfactual projection weight (CFP weight). When the CFP weights were applied to the outcome responses, the sample of tweens at each level of VERB awareness approximated a national population of children aged 9 –13 years with that level of awareness and controlled for all observed covariates that could confound the relationship between outcomes and frequency of awareness. This permitted us to test the differences between levels of awareness or to test for evidence that as frequency of awareness increased, outcomes related to physical activity improved. Table 3. Counterfactual projection weighting adjustment factors before raking, by awareness frequency and awareness propensity, VERB, 2005 Awareness propensity stratum Lowest propensity Frequency of awareness of the VERB campaign Unaware of VERB Less than once per week About once per week Several times per week Every day June 2008 1 2.4 3.3 7.4 11.8 15.6 Highest propensity 2 4.7 4.3 3.7 5.9 8.6 3 6.8 3.7 4.3 4.6 7.6 4 11.9 4.5 3.8 3.6 6.4 5 14.5 10.2 5.2 3.1 3.1 Am J Prev Med 2008;34(6S) S235 0.00 (0.00, 0.00) ⫺0.55 (⫺2.02, 0.92) ⫺3.16 (⫺14.18, 7.86) 0.00 (0.00, 0.00) 0.21 (⫺1.00, 1.41) 1.83 (⫺4.25, 7.90) 38.81 (36.03, 41.58) 3.14 (2.22, 3.98) 56.05 (48.46, 63.64) 38.81 (36.03, 41.58) 4.65 (3.91, 5.43) 60.94 (55.17, 66.72) b a PA previous dayc Free-time PAb Percent reporting engagement in organized physical activity in the past 7 days Median number of free-time physical activity sessions reported for the past 7 days c Percent reporting physical activity on previous day PA, physical activity 38.81 (36.03, 41.58) 4.58 (3.82, 5.41) 66.70 (60.82, 72.59) 38.81 (36.03, 41.58) 4.10 (3.26, 4.78) 62.63 (57.65, 67.60) 38.81 (36.03, 41.58) 3.70 (2.73, 4.71) 59.21 (52.56, 65.87) 38.81 (36.03, 41.58) 3.90 (3.48, 4.33) 61.04 (58.35, 63.73) No exposure to VERB Column 2 Actual population Column 1 Baseline characteristic Organized PAa Maximum potential campaign effect Col 6 (ⴚ) Col 2 Direct campaign effect Col 1 (ⴚ) Col 2 Exposed to VERB every day Column 6 Exposed to VERB several times per week Column 5 Exposed to VERB about once per week Column 4 Exposed to VERB less than once per week Column 3 Table 4. Excerpt of balance test table showing total population balance on three outcomes Balance Testing To ensure that there was no relationship between observed baseline covariates and awareness of VERB, balance tests were conducted prior to the outcome analysis. In balance testing, an analysis is conducted to ensure that for any given outcome, the estimate for a covariate remains the same at each level of awareness (i.e., within the sampling error of the other levels). For example, if balance were achieved, then on every outcome the baseline point estimates (outcome scores) for a covariate such as sex or age would be the same at each level of awareness. To test for balance, all respondents were grouped into the five awareness levels based on their VERB awareness reported in 2005. Respondents’ 2002 (baseline) outcome responses were then used to create an outcome score for each 2005 awareness grouping and tested for differences between the groups. No differences should be detected, since 2005 VERB awareness could not have influenced behaviors reported in 2002. Therefore, when respondents are grouped by 2005 awareness, no pattern should be seen in their 2002 physical activity outcomes; if the control model was successful, 2002 outcome estimates would be the same at every level of 2005 awareness. Were a false effect detected, further adjustment of the covariates would be required. Table 4 displays an example of the results of that analysis using the variable physical activity on the day prior to the interview. As can be seen, the proportion of tweens reporting physical activity on the day prior to the interview in 2002 was the same within each group of tweens when they were categorized by their 2005 awareness. This indicated that characteristics underlying the propensity to see VERB had been successfully controlled. However, this was not the case for all outcomes. Some “balance failures” were observed. When this occurred, additional modeling steps were used (e.g., a search was made for higher-order interactions that would produce the desired covariate control). When this technique was not fully adequate, another adjustment technique called raking was used. Raking reduced the number of balance failures to below statistical significance. Thereafter, a series of analyses were conducted to ensure that the adjustment procedures had not substantially altered the character of the data. The results of these simulation studies suggested that raked weights performed as well as unraked weights when the weighting model was correctly specified and better than an unraked model if there were errors in the model.18 Panel Conditioning Panel designs, such as YMCLS, are subject to a form of nonsampling error that can result if participants’ re- S236 American Journal of Preventive Medicine, Volume 34, Number 6S www.ajpm-online.net Table 5. Summary of panel conditioning on YMCLS variables in Panel 1, 2004, among children aged 11 to 13 years Characteristic Panel 1 Mean (CI) Panel 2 Mean (CI) Time in sample bias: Panel 1 (ⴚ) Panel 2 Mean (CI) Outcome expectationsa Social influencesa Self-efficacya Percent engaged in organized physical activity in previous 7 days Median sessions of free-time physical activity in previous 7 days Percent engaged in any physical activity on previous day Frequency of recalled awareness of VERBb Percent reporting awareness of VERB campaign 10.1 (10.0, 10.1) 10.1 (10.0, 10.1) 10.0 (10.0, 10.1) 40.8 (37.9, 43.6) 4.2 (3.9, 4.5) 61.7 (58.6, 64.7) 2.2 (2.1, 2.3) 84.2 (81.8, 86.6) 9.9 (9.9, 10.0) 10.0 (10.0, 10.1) 10.0 (10.0, 10.0) 41.6 (39.7, 43.4) 4.4 (4.1, 4.6) 63.6 (61.3, 65.9) 2.1 (2.1, 2.2) 75.6 (74.1, 77.2) 0.1* (0.1, 0.2) 0.1* (0.0, 0.1) 0.0 (⫺0.1, 0.1) ⫺0.8 (⫺4.3, 2.8) ⫺0.2 (⫺0.5, 0.2) ⫺1.9 (⫺5.7, 1.9) 0.1 (0.0, 0.2) 8.5* (5.8, 11.3) Mean estimate on standardized scale with mean ⫽ 10 and standard deviation ⫽ 1 Mean estimate on 0 – 4 scale *Significant difference at p⬍0.05 Source: Youth Campaign Longitudinal Survey (YMCLS), 2004 (Panel 1) and 2004 (Panel 2) a b sponses are influenced by the number of times they have taken a survey.19 Several kinds of panel conditioning (sometimes called time-in-sample bias) have been identified, and different effects on survey results have been reported.20 It was anticipated that panel conditioning might introduce error into our estimates after the YMCLS panels had been surveyed multiple times. Panel 1 was tested for bias in 2004 and 2005. Testing became possible in 2004 when Panel 2 was surveyed for the first time, because those results provided an unbiased benchmark for assessing potential conditioning in Panel 1. Table 5 compares the two panels across eight outcomes in 2004. Three variables showed a significant difference, and therefore, conditioning was suspected. The table also suggests that panel conditioning differed by variable type; there was no significant effect on behavioral variables assessing physical activity, a slight effect on some cognitive variables related to physical activity, and a substantial effect on awareness of the campaign. Of importance, there was no effect on the frequency of awareness variable. There was no fresh panel in 2005 to which existing panels could be compared. Instead, estimates obtained from Panel 1 in 2005 (fourth administration) were compared with counterpart estimates from Panel 2 in 2005 (second administration). To control for age, only those 12 to 14 years were included. The results suggested that some panel conditioning occurred on physical activity behavior outcomes in 2005, but again, not on the frequency-of-awareness variable. The bias detected in both years encourages caution when using the point estimates (outcome scores) of those variables in direct comparisons. The analysis used to gauge VERB outcomes used the dose–response relationship between frequency of awareness and the variables of interest, which minimizes any effect of the conditioning in the outcome findings. June 2008 Assessing Outcomes Dose–response outcomes. Having controlled for confounders and sampling bias, an association was sought between levels of tweens’ awareness of VERB and outcomes within each year. Endpoint analysis was chosen to assess yearly outcomes, as opposed to conducting a change-score analysis. Although change scores would have been more conservative than an endpoint analysis, a change-score analysis was less suited to this evaluation for several reasons. First, the physical activity outcome correlations over time were consistent with correlations reported in scientific literature on physical activity5–7 but were below 0.5, too low to make change scores as powerful as an endpoint analysis.20,21 Second, the fully imputed cases were regarded as more appropriate for use in the control structure of an endpoint analysis than in a change-score estimate. Third, two campaignappropriate outcomes were added to the instrument after the baseline data were collected. These outcomes could not be assessed in a change-score analysis, but they could be accommodated in an endpoint design. Within each year of the promotion, a search was made for 1-year evidence that the campaign had produced outcome effects, controlling for baseline characteristics. Any effects per year were cumulative, of course, which mirrored the intent of the campaign. Over the course of the follow-up analyses, two types of outcome tests were used to determine annual cumulative outcomes. In 2003 overall effects were calculated; in 2004 –2006, overall effects plus dose–response effects were used. Overall effects. In 2003 (Year 1), recalled frequency of encountering the campaign was not measured. The effect measure that year calculated the difference between tweens with no reported awareness of VERB and all the tweens in the sample, for each outcome. Given an outcome such as total sessions of physical activity in 7 days, the average number of sessions among tweens who reported no awareness was subtracted from the average Am J Prev Med 2008;34(6S) S237 number of sessions in 7 days among all tweens. Because the data were weighted and controlled for each level of awareness, this calculation estimated the average 7-day total sessions of physical activity had the VERB campaign not occurred (no awareness) and subtracted that figure from the average number of sessions among the actual population of children aged 9 to 13 years in the U.S. Any difference detected could be assigned to the target population’s awareness of VERB. In 2004 (Year 2), a survey item was added that asked tweens to recall the frequency with which they encountered VERB; those data were used to create the dose– response measure of outcomes. The theory of dose– response, borrowed from epidemiology, is that a substance can be presumed to produce an effect if an increasing dose results in an increasing outcome. Applied to a communication intervention, this technique suggests that, if the campaign is producing an effect then increasing levels of awareness of the VERB campaign should be associated with increasing levels of physical activity among the target audience. Reported frequency of awareness of VERB television and radio advertisements was used as the measure of dose. For each of the eight outcomes, point estimates (outcome scores) were calculated for each of the five levels of frequency of awareness: aware of the campaign about every day, aware several times a week, aware about once a week, aware less than once a week, and not aware of VERB. Using the fully adjusted outcome estimate at each level of awareness, the gamma statistic tested for the presence of a significant ordinal association between an outcome and levels of awareness, in effect that the outcome changed in the desired direction as frequency of awareness rose. Identical analyses were conducted for all eight primary outcomes. The full analysis tables displayed effects by gender, age at baseline, age at survey year, race or ethnicity, and census region.8,22–24 Alternative Analyses To ensure that the analysis was sufficiently conservative, a series of alternative analyses were conducted every year. For each analysis, a double-difference method was used. For a given outcome, the difference between aware and unaware groups was calculated for the baseline and was recalculated for the year being studied, yielding two difference scores. Then the difference between those differences was calculated. The doubledifference method has similar properties to a change score. In three sensitivity analyses, the structure of the calculations was varied to isolate elements of the primary analysis. In Alternative 1, the covariance structure was removed and double-difference effects were calculated using simple, longitudinally weighted data. This alternative removed any effects of propensity modeling from the results In Alternative 2, only raked covariates other than the outcome were controlled. This analysis was free of effects from controlling on outcomes measured with error at baseline.21 In Alternative 3, the full control structure was used, but without raking for balance. This alternative was free of any effects that the raking procedure may have introduced. Table 6 shows a comparison of the gamma statistics for the full control analysis and three alternatives, using 2005 data. As can be seen, the pattern of outcomes is similar across them, even though the double-difference analyses have considerably less power. Of particular interest is the pattern of effects between Alternative 3 and the primary analysis. Although Alternative 3 lacks the statistical power of the primary analysis, and controls on some variables both through the covariate structure and the double difference, the pattern of effects is the same. These analyses confirm that endpoint propensity-controlled analysis maintains robust confounder control while providing the evaluation its best chance of detecting the effects of the VERB campaign. Limitations Throughout this article potential limitations to the design of media campaign evaluations and the efforts to overcome them have been explored. The techniques used cannot, however, remove all threats to validity. Like every nonrandomized design, this study could not control for unobserved covariates. Although considerable effort was made to identify the key questions, it remains possible that unmeasured covariates influenced the analysis. Second, the study relies on selfreported data from parents and their tweens. Although the validation study conducted provides considerable assurance, it is possible that subjects’ directly observed behavior would vary from their reported behavior, in particular among the youngest respondents, whose cognitive processes are still developing. Third, it is noted that panel conditioning was found in 2004 and 2005. Although the dose–response metric used for outcome assessments is somewhat robust to time-insample bias, caution should be exercised in using by themselves as estimates of attitudes or behaviors outcome scores that showed potential conditioning. Conclusion The VERB campaign offered an opportunity to evaluate a large-scale media-based health communication campaign. An ideal design would have entailed random assignment of target audience participants to treatment and control groups. To reach the maximum possible audience, however, campaign planners designed a national media plan that made assignment of a random- S238 American Journal of Preventive Medicine, Volume 34, Number 6S www.ajpm-online.net 0.1 (⫺0.1, 0.2) 0.2* (0.0, 0.3) 0.0 (⫺0.1, 0.2) 0.1* (0.0, 0.2) ⫺0.0 (⫺0.1, 0.1) ⫺0.0 (⫺0.01, 0.1) 0.0 (⫺0.1, 0.1) 0.1* (0.0, 0.2) *p⬍0.05 CFP, counterfactual projection; PA, physical activity. 0.1* (0.0, 0.2) 0.2* (0.1, 0.2) 0.1 (⫺0.1, 0.2) 0.0 (⫺0.1, 0.1) 0.1 (⫺0.1, 0.2) 0.1* (0.0, 0.3) 0.0 (⫺0.1, 0.2) ⫺0.0 (⫺0.1, 0.1) 0.1 (⫺0.0, 0.2) 0.1* (0.0, 0.2) 0.0 (⫺0.1, 0.2) 0.1* (0.0, 0.3) ⫺0.0 (⫺0.1, 0.1) ⫺0.0 (⫺0.1, 0.1) 0.0 (⫺0.1, 0.2) 0.2* (0.0, 0.3) 0.17* (0.10, 0.24) 0.07 (⫺0.01, 0.16) 0.0 (⫺0.1, 0.1) 2005 (ⴚ) 2002 2005 0.16* (0.10, 0.22) 0.06* (0.00, 0.12) 0.0 (⫺0.1, 0.1) ⫺0.01 (⫺0.07, 0.04) ⫺0.01 (⫺0.07, 0.04) 0.0 (⫺0.0, 0.0) 0.18* (0.10, 0.25) 0.07 (⫺0.01, 0.16) ⫺0.0 (⫺0.1, 0.1) 2005 (ⴚ) 2002 2002 2005 0.15* (0.09, 0.21) 0.07* (0.00, 0.13) 0.0 (⫺0.1, 0.1) ⫺0.03 (⫺0.09, 0.03) ⫺0.01 (⫺0.06, 0.05) 0.0 (⫺0.1, 0.1) 0.13* (0.07, 0.19) 0.06 (⫺0.01, 0.13) 0.1 (⫺0.1, 0.2) 2005 (ⴚ) 2002 2002 2005 0.17* (0.12, 0.23) 0.07* (0.02, 0.12) 0.0 (⫺0.0, 0.1) 0.04 (⫺0.01, 0.09) 0.01 (⫺0.04, 0.06) ⫺0.0 (⫺0.1, 0.1) 0.15* (0.10, 0.20) 0.05 (⫺0.01, 0.12) 0.0 (⫺0.1, 0.1) 2005 (ⴚ) 2002 2002 0.21* (0.16, 0.25) 0.07* (0.03, 0.12) 0.1 (⫺0.0, 0.1) Score on outcome expectations of PA scale Median number of free-time PA sessions in past 7 days Percent reporting engagement in organized PA in the past 7 days Total number of PA sessions reported for past 7 days Percent reporting PA on previous day 2005 2002 0.06* (0.01, 0.10) 0.02 (⫺0.03, 0.07) 0.0 (⫺0.1, 0.1) Outcome Primary analysis: raked CFP weights based on full set of baseline covariates Alternative 3: CFP weights based on full set of baseline covariates but without raking Alternative 1: longitudinal weights Alternative 2: raked CFP weights based on covariates other than baseline measurements of outcomes Table 6. Summary of gammas for post-hoc sensitivity analyses (CIs) June 2008 ized control impossible. Another strong design would have been a time series in which periods of exposure to the campaign were compared to periods when the campaign was not being aired.25 This design was also infeasible because the media plan included no periods without cable television advertising during the 4 years of the intervention. Absent these options, researchers sought a naturalistic design that would balance the need for rigorous control of alternative explanations against the need for sufficient statistical power to detect potentially small, but important, effects. The creation of three panels of randomly-selected respondents, recruited in 2002, 2004, and 2006, provided a true baseline, produced a substantial sample size, and maintained a study population within the campaign target range. Adding panels in 2004 and 2006 also permitted us to gauge the effect of panel conditioning on the 2002 and 2004 panels. The use of propensity scores was a means of ensuring that the relationship between awareness of the campaign and physical activity–related outcomes was isolated from the effects of any other observed variables.26,27 Propensity scores also made the measurement of many outcomes more feasible because only a single parsimonious model was required. Further, the design was robust to skewed data distributions that would have violated the assumptions of more common techniques. Balance testing provided reassurance that the control process had not created pseudo-effects. Finally, the findings are expressed in units of the outcomes themselves (e.g., number of sessions of physical activity) rather than in difficult-to-interpret statistics. The dose–response measurement provided a well-accepted test for whether real effects had occurred. The use of an endpoint analysis permitted researchers to maximize sample size and to retain as much statistical power as possible. As the sensitivity analyses showed, the use of an endpoint design and propensity scoring techniques did not appear to produce bias in the findings. Not all health communication campaigns could take advantage of these techniques. Large sample sizes are required to use propensity scoring, and the analysis is labor intensive and complex. That said, large-scale health communication initiatives require substantial investments to achieve their often ambitious goals. To support claims of campaign success, outcome studies must overcome a host of warranted concerns. The techniques used to evaluate the VERB campaign can provide campaign planners and funders with credible evidence about the consequences of their investments. The findings and conclusions in this paper are those of the authors and do not necessarily represent the views of the CDC. Am J Prev Med 2008;34(6S) S239 No financial disclosures were reported by the authors of this paper. References 1. Wong FL, Greenwell M, Gates S, Berkowitz JM. It’s what you do! Reflections on the VERB™ campaign. Am J Prev Med 2008;34(6S):S175–S182. 2. Heitzler CD, Asbury LD, Kusner SL. Bringing “play” to life: the use of experiential marketing in the VERB™ campaign. Am J Prev Med 2008;34(6S):S188 –S193. 3. Bretthauer-Mueller R, Berkowitz JM, Thomas M, et al. Catalyzing community action within a national campaign: VERB™ community and national partnerships. Am J Prev Med 2008;34(6S):S210 –S221. 4. Berkowitz JM, Huhman M, Nolin MJ. Did augmenting the VERB™ advertising in select communities have an effect on awareness, attitudes, and physical activity? Am J Prev Med 2008;34(6S):S257–S266. 5. Welk GJ, Wickel E, Peterson M, Heitzler CD, Fulton JE, Potter LD. Reliability and validity of physical activity questions on the Youth Media Campaign Longitudinal Survey. Med Sci Sports Exerc 39:612–21. 6. Telford A, Salmon J, Jolley D, Crawford, D. Reliability and validity of physical activity questionnaires for children: the Children’s Leisure Activities Study Survey (CLASS). Pediatr Exerc Sci 2004;16:64 –78. 7. Treuth MS, Sherwood NE, Butte NF, et al. Validity and reliability of activity measures in African-American girls for GEMS. Med Sci Sports Exer 2003;35:532–9. 8. Potter LD, Nolin MJ, Judkins D, Piesse A, Huhman M. Evaluation of the CDC VERB™ campaign: findings of the Youth Media Campaign Longitudinal Survey, 2002, 2003, 2004, and 2005. Rockville MD: Westat, 2006. 9. Kalton G. Compensating for missing survey data [Research Report Series]. Ann Arbor MI: Institute for Social Research, 1983. 10. Kalton G, Kasprzyk D. The treatment of missing survey data. Surv Methodol 1986;12:1–16. 11. Little RJA, Rubin DB. Statistical analysis with missing data. New York: John Wiley, 1987. 12. Marker DA, Judkins D, Winglee M. Large-scale imputation for complex surveys. In: Groves, RM, Dillman DA, Eltinge JL, Little RJA, Eds. Survey nonresponse. New York: John Wiley, 2001. 13. Piesse A, Judkins D, Fan Z. Item imputation made easy. 2005 proceedings of the Section on Survey Research Methods of the American Statistical Association. Alexandria VA: American Statistical Association, 2005:3476 –9. 14. Rosenbaum PR, Rubin DB. The central role of propensity score in observational studies for causal effects. Biometrika 1983;70:41–55. 15. Rubin DB, Waterman RP. Estimating the causal effects of marketing interventions using propensity score methodology. Stat Sci 2006;21: 206 –28. 16. Drake C. Effects of misspecification of the propensity score on estimators of treatment effect. Biometrics 1993;49:1231– 6. 17. Claritas Corporation. Getting to know the 62 clusters: cluster snapshots. Arlington VA: Claritas Corporation, 1995. 18. Judkins DR, Morganstein D, Zador P, Piesse A, Barrett B, Mukhopadhyay P. Variable selection and raking in propensity scoring. Stat Med 2007;26:1022–33. 19. Bailar B. Information needs, surveys, and measurement errors. In: Kasprzyk D, Duncan G, Kalton G, Singh MP, Eds. Panel surveys. New York: John Wiley, 1989. 20. Wang K, Cantor D, Safir A. Panel conditioning in an random digit dial survey, 2000 proceedings of the Section on Survey Research Methods of the American Statistical Association. Alexandria VA: American Statistical Association, 2000. 21. Cox DR. The use of a concomitant variable in selecting an experimental design. Biometrika, 1957;44:150 – 8. 22. Cook TD, Campbell DT. Quasi-experimentation: design and analysis issues for field settings. 2nd ed. Chicago: Rand McNally College Publishing, 1979. 23. Potter LD, Duke JC, Nolin MJ, Judkins DR, Huhman M. Evaluation of the CDC VERB™ campaign: findings from the Youth Media Campaign Longitudinal Survey, 2002 and 2003. Rockville MD: Westat, 2004. 24. Potter LD, Duke JC, Nolin MJ, Judkins DR, Piesse A, Huhman M. Evaluation of the CDC VERB™ campaign: findings from the Youth Media Campaign Longitudinal Survey, 2002, 2003, and 2004. Rockville MD: Westat, 2005. 25. Huhman M, Potter LD, Wong FL, Banspach SW, Duke J, Heitzler C. Effects of a mass media campaign to increase physical activity in children: year-1 results of the VERB™ campaign. Pediatrics 2005;116:e277– 84. 26. Palmgreen P, Donohew L, Lorch EP, Hoyle RH, Stephenson T. Television campaigns and sensation seeking. Targeting of adolescent marijuana use: a controlled time series approach. In: Hornik R, Ed., Public health communication: evidence for behavior change. Mahway NJ: Lawrence Erlbaum Associates, 2002. 27. Yanovitzky I. Zanutto E, Hornik R. Estimating causal effects of public health education campaigns using propensity score methodology. Eval Program Plann 2005;28:209 –20. S240 American Journal of Preventive Medicine, Volume 34, Number 6S www.ajpm-online.net
© Copyright 2026 Paperzz