Supplemental Material Sperrin et al. Slowing down of adult bod

Supplemental Material
Sperrin et al.
Slowing down of adult bod mass index trend increases in England: a latent class analysis of
cross-sectional surveys (1992-2010)
Figure S1: Flow diagram of the steps taken to create the HSE obesity dataset (1991-2009)
Download Health Survey for England datasets
1991/2,1993,1994,1995,1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006,
2007, 2008, 2009, 2010
Drop population subgroups
Drop children (ages 0-19) and those aged 75 and older
Drop boost samples
Ethnic boost – 1999 and 2004
Older people – 2000 and 2005
Extract core variables of interest
Age, sex, weight (kg) height (cm), BMI, waist and hip measure, waist/hip ratio, variables
indicating that there is a valid measure of height/weight/BMI/hip/waist, smoking status,
smoking frequency, waist and hip measure, tenure, limiting long term illness, equivalised
household income, ethnicity, social class, alcohol consumption, cotinine (from blood sample),
sampling weights, person identifier, qualifications, systolic and diastolic blood pressure
Recoding and derive new variables
Recode variables to ensure consistency over time and to meet aims of analysis
Append datasets
Append all the datasets for each year of HSE to create a single dataset
Drop redundant variables
Final data set includes the following variables:
1
Table S1: Variables in Health Survey for England dataset (1991/2-2010)
* For information on variable used with syntax see the HSE documentation available at http://www.esds.ac.uk/findingData/hseTitles.asp
Variable
name
Pserial
Variable categories
Description
Years available
Syntax (derived variables)*
n/a –
All years (1991/22010)
1991/2 and 1993
Rename hserno pserial
Year
1992, 1993.........2010
Continuous
Agegp
16-34
35-54
55-74
Grouped age
variable
All years (1991/22010)
All years (1991/22010)
All years (1991/22010)
n/a
Age
Unique identifier of
survey respondent
within each year
Health Survey for
England year
Age
Sex
Sex
Ht_valid
Male
Female
Continuous (cm)
BMI_valid
Continuous
all valid BMIs
All years (1991/22010)
smoke
Current
Ex-regular
Never-regular
Smoking status
All years (1991/22010)
All valid heights
(cm)
n/a
gen agegp=.
replace agegp=1 if age>=16 & age <=34
replace agegp=2 if age>=35 & age <=54
replace agegp=3 if age>=55
label define agegp 1 "16-34" 2 "35-54" 3 "55-74"
label values agegp agegp
n/a
All years (1991/22010)
All years (1991/22010)
gen htvalid=htval if year>=1997
replace htvalid=height if htok==1 & year<=1996
codebook htvalid
mvdecode htvalid, mv (-8=.\-1=.)
gen bmivalid=bmival if year>=1997
replace bmivalid=bmi if bmiok==1 & year<=1996
codebook bmivalid
mvdecode bmivalid, mv (-1=.)
codebook bmivalid
gen smokeVH=newsmok if year==1994
recode smokeVH 7/11=1 2/6=2 1=3 *=.
gen smoke=cigsta3 if year>=1999
replace smoke=cigsmk2 if year==1996 | year==1995 | year==1993 | year==1992
gen smokeAM=cigst1 if year==1997|year==1998
recode smokeAM 4=1 2/3=2 1=3
replace smoke=smokeAM if year==1997|year==1998
replace smoke=smokeVH if year==1994
mvdecode smoke, mv (-9=.\-8=.\-7=.\-6=.\-2=.\-1=.)
label define smoke 1 "current" 2 "ex-regular" 3 "never regular"
label values smoke smoke
2
Variable
name
smkquant
Topqual2
sclass
Eqv5
Ethnic
Variable categories
Description
Years available
Syntax (derived variables)*
Light <10
Moderate 10-19
Heavy 20+
Number of
cigarettes smoked
per day
All years (1991/22010)
*recoding into temp variables to match value labels across years
gen smkqt96=newsmok2 if year==1996
recode smkqt96 4=1 5=2 6=3 *=.
gen smkqt9495=newsmok if year==1994|year==1995
recode smkqt9495 8=1 9=2 10=3 *=.
gen smkqt9293=cigsmk1 if year==1992 | year==1993
recode smkqt9293 1=30 2=20 3=10 *=.
recode smkqt9293 30=3 20=2 10=1 *=.
gen smkqt9709=cigst2 if year>=1997
recode smkqt9709 1=1 2=2 3=3 *=.
NVQ4/5 or degree
Higher education below
degree
NVQ3/GCE, A-level
NVQ2/GCE, O-level
NVQ1/CSE other grade
No qualification
Full time student
i – professional
ii – managerial/technical
iiin – skilled non-manual
iiim – skilled manual
iv – semi-skilled manual
v – unskilled manual
<=£10,665.74
£10,665.74 to £16,900.00
£16,900.00 to £26,787.88
£26,787.88 to £41,864.41
>41,864.41
White
Mixed
*generating a consistent smoking quantitiy variable
gen smkquant=smkqt9709 if year>=1997
replace smkquant=smkqt96 if year==1996
replace smkquant=smkqt9495 if year==1994 | year==1995
replace smkquant=smkqt92 if year==1992 | year==1993
mvdecode smkquant, mv (-9=.\-8=.\-1=.)
ta smkquant
label define smkquant 1 "light <10" 2 "moderate 10-19" 3 "heavy 20+"
label values smkquant smkquant
mvdecode topqual2, mv (-1=.\-9=.\-8=.\-7=.)
Highest qualification
All years except 1995
and 1996
Social class
(occupational)
All years from 1996
onwards
mvdecode sclass, mv (-1=.)
**recoding armed forces and 'not fully described as missing' (not clear where to put
armed forces and only a handful of cases)
mvdecode sclass, mv (7=.\8=.)
Quintiles based on
distribution of
equivalised
household income
All years from 1997
onwards excluding
1999 and 2004
mvdecode eqv5, mv (-1=.\-90=.)
Ethnic group
All years from 1999
onwards
gen ethcindVH=ethcind
recode ethcindVH 3=44 4=33
3
Variable
name
Variable categories
Description
Years available
Syntax (derived variables)*
Asian or Asian British
Black or Black British
Other
recode ethcindVH 44=4 33=3
label define ethcindVH 1 "white" 2 "mixed ethnic group" 4 "black or Black british" 3
"asian or asian british" 5 "any other group"
label values ethcindVH ethcindVH
mvdecode ethcindVH, mv (-9=.\-8=.)
gen ethniciVH=ethnici
recode ethniciVH 3=44 4=44 5=33 6=33
recode ethniciVH 44=4 33=3 7=5
label define ethniciVH 1 "white" 2 "mixed ethnic group" 4 "black or Black british" 3
"asian or asian british" 5 "any other group"
label values ethniciVH ethniciVH
mvdecode ethniciVH, mv (-9=.\-8=.)
gen originVH=origin
recode originVH 1/3=1 4/7=2 8/11=3 12/14=4 15/16=5
label define originVH 1 "white" 2 "mixed ethnic group" 4 "black or Black british" 3
"asian or asian british" 5 "any other group"
label values originVH originVH
mvdecode originVH, mv (-9=.\-8=.)
*****
gen ethnicgp=ethniciVH if year>=1999 & year<=2003
replace ethnicgp=ethcindVH if year==2004
replace ethnicgp=ethinda if year>=2005 & year<=2007
replace ethnicgp=originVH if year>=2008 & year<=2009
label define ethnicgp 1 "white" 2 "mixed ethnic group" 4 "black or Black british" 3
"asian or asian british" 5 "any other group"
label values ethnicgp ethnicgp
mvdecode ethnicgp, mv (-9=.\-8=.)
4
Latent Class Analysis or mixture modelling
Latent class analysis, or mixture modelling, is a tool to model population heterogeneity, as
well as a form of semi-parametric modelling, for some measured variable(s) (McLachlan &
Peel, 2000). For the population heterogeneity interpretation, we are assuming that the
population we are examining consists of
(
sub-populations. Within each sub-population
) the measured variable(s) are assumed to be independent and identically
distributed with some distribution (probability density function)
. Let
proportion of individuals belonging to sub-population , so that
denote the
. Then,
we can write the overall mixture density as
In the context of this paper, we take the (univariate) response to be BMI, in the data obtained
from HSE (health survey for England). Figure S2 shows empirical density plots of the
distributions of BMI, separated by gender, for three years: 1993, 2001 and 2008. The
densities are right skewed; this is particularly the case for women. We hypothesize that
these skewed densities can be well approximated by a mixture of at least two normal
densities, with one larger component capturing the main body of the density, and a second
smaller component, with a larger mean, accounting for the right skew. To illustrate the idea,
we focus on the BMI distribution for women in 2001. Figure S3 compares a normal
distribution with one component, and a normal distribution fitted with two components, fitted
to these data. We see that the two component version captures the skew, and appears to
give a better representation to the data (we shall use more formal model fit statistics at the
modelling stage).
Clearly, one can argue that, in the sense of fitting a semi-parametric model that describes
the data, a two-component normal model does a good job. A strength of the mixture
modelling approach is that one can moreover (with some caution) interpret the mixture
components in terms of their possible correspondence to subpopulations.
5
Figure S2: Density plots of BMI distribution in 1993, 2001 and 2008; men and women. The vertical red line
corresponds to 25 kg/m2.
Figure S3: BMI distribution in 2001. Dashed line: normal distribution fit; dotted line: 2-component normal
distribution fit.
6
Mixture Regression
Mixture models, or latent class analysis, can be extended into a regression context (Grun &
Leisch, 2008). In this paper, we use age, sex, and calendar year, as explanatory variables
for BMI. In particular, we are interested in how the distribution of BMI (conditional on age
and sex) changes over calendar time.
The simplest example of mixture models used in the regression context is where they
describe different relationships between a single predictor or response, implicitly assuming
that the relationship between the two variables is moderated by a third, latent, discrete
variable.
We propose modelling BMI in this fashion. In main model, we model the BMI variable as a 2component mixture model, correcting for the variables: year, age and sex (all taken as
factors rather than continuous variables). We tentatively label the two components as an
‘average’ or normal group, making up the bulk of the distribution, and a ‘susceptible’ group,
accounting for the right skew present in the distribution. Importantly, it is only the mean that
is being modelled as dependent on the explanatory variables, so the variances and the
proportions are invariant with respect to the explanatory variables. To mitigate potential
limitations of this, we consider sub-models for different ages and genders, with proportions
and variances then allowed to vary between the sub-models. Formally, we are considering
models of the form
where
,
represents the BMI of the ith subject and
represents the explanatory
variables of the ith subject: gender, year of measurement and age. The means of the
distributions,
where
and
depend on the explanatory variables in a regression:
is a vector of co-efficients to be estimated for
. Because all explanatory
variables are viewed as factors, multiple coefficients are needed for each variable (e.g. a
separate co-efficient is needed for each year).
Testing mixture model fit
All 2-component models are tested against their 1-component analogues, and all sub-model
combinations are tested against their parent models. AIC is used for model comparison –
7
this is commonly recommended in mixture models, as opposed to other approaches such as
the likelihood ratio test, which rely on various assumptions that are not satisfied by mixture
models (McLachlan & Peel, 2000).
Of particular interest is the behaviour of the component means,
and
, and how they
vary with age and calendar year.
Assignment of individuals to clusters
The parameters for the mixture model are estimated using a maximum likelihood approach
(Dempster et al, 1977), using the flexmix package in R (Leisch, 2004). Note that individuals
are not explicitly assigned to these clusters (hence this is soft clustering) but a posterior
probability of an individual’s membership can then be calculated via
When a definite latent class membership is required (such as for sensitivity analysis), we
assign each individual to the class for which their posterior probability of membership is
largest.
Sensitivity Analysis
The parameters for the mixture model are estimated using a maximum likelihood approach
8
Table S2 – exploring component models (k = 1, 2, 3, and 4)
Model
Proportions
Intercept
AIC
BIC
23.62
416014.9 (df = 29)
416281.9 (df = 29)
409530.6 (df = 59)
410073.9 (df = 59)
23.46
518826.0 (df = 29)
519097.0 (df = 29)
501773.0 (df= 59)
502324.0 (df = 59)
499795.1 (df = 89)
500626.7 (df = 89)
499288.3 (df = 119)
500400.3 (df = 119)
MEN
1 component
C1
1.000
2 component
C1
0.765
22.86
C2
0.235
26.89
WOMEN
1 component
C1
1.000
2 component
C1
0.663
21.90
C2
0.337
27.39
3 component
C1
0.220
28.99
C2
0.373
23.04
C3
0.407
21.35
4 component
C1
0.305
24.52
C2
0.156
29.93
C3
0.250
21.06
C4
0.289
21.46
C: component or latent class. df: degrees of freedom.
AIC: Akaike’s information criterion. BIC: Bayesian information criterion
9
Figure S4 One to four component models
In men, models of 3 and 4 components were unstable (i.e. wide variation year on year) and
not considered further. The cause of the instability is a combination of identifiability issues in
the model fitting (common to mixture models), and the components fitting idiosyncrasies in
the data rather than biologically plausible subgroups.
Women
1 component
2 component
32
32
30
30
28
28
26
26
24
24
22
22
33.7%
20
66.3%
20
1990 1995 2000 2005 2010
BMI
(kg/m 2)
20
30
40
50
60
70
1990 1995 2000 2005 2010
3 component
20
30
40
50
60
70
40
50
60
70
4 component
22.0%
15.6%
32
32
30
30
28
28
26
37.3%
30.5%
26
24
24
28.9%
40.7%
22
22
25.0%
20
20
1990 1995 2000 2005 2010
Year
20
30
40
50
60
70
Age
1990 1995 2000 2005 2010
Year
20
30
Age
One to four component mixture models for BMI distributions per year and age in women.
Only means for each component are shown. The percentages indicate the subpopulation
proportion.
10
Potential Response Bias
Analyses were conducted on the 183,259cases in the HSE dataset (1992-2010; ages 20-75
and with gender data) for which BMI measurements were available. There were a further
19,093 cases for which BMI measurements were unavailable but had agreed to interview,
i.e. 89.6% of those interviewed had BMI measurements available. We can consider whether
bias exists amongst those who agreed to interview between those with and without BMI
measurements. However, it is untestable from these data whether response bias exists
between those who do and do not agree to interview for HSE altogether (approximately 70%
agreed to interview).
Henceforth we consider missingness as a proportion of those interviewed. First, separating
by gender, the missingness rate was 8.9% for males, and 11.7% for females. The extra
missingness in females could be because weight measurements were not taken from
pregnant women.
Missingness also changed slightly by survey year, with more missingness generally present
in later survey years:
Year
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
%missing
7.6
10.7
6.5
8.9
7.2
6.5
8.5
9.8
11.7
10.7
10.9
10.0
14.7
15.1
12.9
12.2
13.6
13.3
15.3
11
We also considered differences in age, social class, educational status, smoking status and
income: here there were no discernible differences between missing and non-missing cases:
References for supplemental material
Allman-Farinelli MA, Chey T, Bauman AE, Gill T, James WP (2008) Age, period and birth
cohort effects on prevalence of overweight and obesity in Australian adults from 1990 to
2000. Eur J Clin Nutr 62(7): 898-907
Carstensen B (2007) Age-period-cohort models for the Lexis diagram. Stat Med 26(15):
3018-45
Dempster A, Laird N, Rubin D (1977) Maximum Likelihood from Incomplete Data via the EMAlogrithm. Journal of the Royal Statistical Society 39(B): 1-38
Grun B, Leisch F (2008) FlexMix Version 2: Finite Mixtures with Concomitant Variables and
Varying and Constant Parameters." Journal of Statistical Software, 28(4), 1-35. URL
http://www.jstatsoft.org/v28/i04/.
Holford TR (1983) The estimation of age, period and cohort effects for vital rates. Biometrics
39(2): 311-24
Holford TR (1991) Understanding the effects of age, period, and cohort on incidence and
mortality rates. Annu Rev Public Health 12: 425-57
Howel D (2011) Trends in the prevalence of obesity and overweight in English adults by age
and birth cohort, 1991-2006. Public Health Nutr 14(1): 27-33
Leisch F (2004) FlexMix: A general framework for nite mixture models and latent class
regression in R." Journal of Statistical Software, 11(8). URL http://www.jstatsoft.org/v11/i08/.
McLachlan GJ, Peel D (eds) (2000) Finite Mixture Models. . New York: Wiley
Rutherford MJ, Lambert PC, Thompson JR (2011) Age-Period-Cohort Modelling. The Stata
Journal 10(4)
12