Temporal Breadth

Affirmation 3_29_17
This week, I've worked on more formal adaptations of the models from last week.
Specifically, while last week, I looked at how temporal breadth was associated with
quantities derived from other models, this week, I've incorporated temporal breadth
directly into the models.
Additionally, I've switched all these models over to bayesian estimation methods rather
than frequentist estimation. At a superficial level, this wont make much of a difference for
what we're looking at. We're interested in the degree to which temporal breadth is
associated with academic outcomes. Bayesian and frequentist statistics each have a set of
methods of estimating this relationship. However, the frequentist approaches carry a set of
assumptions (often left unstated) that I think are undesireable. Bayesian methods also
carry some assumptions, but those assumptions are made explicit in the model definition
(through the specification of priors and model structure), and the statements one can
logically make about bayesian estimates are more in line with what I believe we would like
to say about these data.
Just to unpack this idea a little further, I've included a short section on Bayesian estimation.
Bayes
In bayesian estimation, the posterior distribution is a description of what parameter values
are credible in light of the data and our prior beliefs. More succinctly, the posterior tells us
what we can believe after seeing some data. In this setup, our prior beliefs can be quite
vague, and they describe the range of values that we think are probable before seeing any
data. For example, our prior beliefs about temperature on the surface of planet earth might
run something like -150 to 150 degrees Fahrenheit. It's essentially impossible that we
would see a temperature value of ten million degrees, and so our prior would not include
that value. Using a more appropriate comparison, for GPA data, our prior beliefs should
include scores that range from 0 to 4.3 or so. However, we would never expect to see a GPA
of -50, and so including this value in our prior belief is nonsensical. Practically speaking, the
prior should be chosen to balance the demands of effective computations and convincing a
skeptical critic. The former demand is flexible and thanks to computational advances, is
becoming increasingly so. The latter demand is obviously less flexible.
In contrast, frequentist estimation is based on a set of assumptions that are typically
overlooked. I will not belabor the point that has been made numerous times elsewhere
(Kruschke, 2014; Berger & Wolpert, 1988; Jeffreys, 1961; Wagenmakers, 2007), but
conclusions drawn from p-values and confidence intervals estimated using standard
procedures are misleading at best and many scientists and statisticians evaluate them as
though they are quantities derived from Bayesian posterior distributions. Some of the logic
behind these problems is quite technical, but superficially, the problems boil down to three
points, popularly highlighted by Wagenmakers (2007). Specifically, p-values:
1. Depend on data that were never observed
That is, the p-value is dependent upon the null distribution, which includes hypothetical
data expected under the null.
2. Depend on possibly unkown subjective intentions of the experimenter For instance, the
sampling plan of the researcher may be to collect 𝑁 participants, or it may be to collect data
until 𝑇 time passes, or it may be to collect as much data as possible in a given cohort of
students in a given school. These differences in sampling plan will lead to different forms of
the null distribution, leading to differen p-values depending on which null distribution the
analyst specifies.
3. Do not quantify statistical evidence For instance, identical p-values do NOT reflect
identical statistical evidence. A p-value of .03 when computed with a sample of 5
individuals is not perceived to offer the same strength of evidence as the same p-value
when computed with a sample of 500 individuals (e.g. Rosenthal & Gaito, 1963; Nelson,
Rosenthal & Rosnow, 1986).
Much more could be said (and indeed, has been said) about the problems of this statistical
framework as it is typically applied in modern science. If this is of interest, I will be happy
to allocate a fuller treatment in the future. For the present, we can rest easy knowing that
the parameters and intervals I present can be interpreted in the way that we want, and not
in the convoluted nature of frequentist p-values, which necessitate invoking an imaginary
sampling distribution.
Temporal Breadth
Recall once again that temporal breadth is computed using the Herfindahl-Hirschman
index (H-H index). This index is based on three proportions - the proportion of words from
the LIWC future category, the LIWC present category, and the LIWC past categories. The
denominator for each of these categories is the total number of words across the three (i.e.
for a given person, the value of the three categories should sum to 1). As before, I've
reversed this metric such that higher scores indicate more breadth.
A longitudinal model of academic outcomes.
In my view, the best approach to modeling academic outcomes that I previously described
was the one incorporating a slope and intercept for each student. To this model, I will now
add condition and writing information. Specifically, I will be interested in the extent to
which the intercept varies as a function of temporal breadth, and how this variation differs
by condition. Information about the intercept corresponds to our knowledge of where the
student is estimated to be at the first measurement occurrence after the intervention.
A second interest in this model is how the slope varies as a function of temporal breadth
and condition. This information corresponds to our knowledge of a student's trajectory of
performance over time.
Finally, we can obtain the model's estimations for academic outcomes at time points along
the trajectory. These estimations incorporate intercept and slope estimates into a holistic
prediction of whether variation in temporal breadth leads to boosts in academic outcomes
at any point along the longitudinal measurement of students (e.g. a small bump in the
beginning combined with a small deflection in their trajectory could yield large effects after
2 years, even if the initial changes were small).
More formally, the model estimated is of the form:
π‘Œπ‘–π‘— = 𝛽0𝑗 + 𝛽1𝑗 𝑄𝑖𝑗 + 𝛽2 𝐢𝑗 + 𝛽3 𝐡𝑗 + 𝛽4 𝑄𝑖𝑗 𝐢𝑗 +
𝛽5 𝑄𝑖𝑗 𝐡𝑗 + 𝛽6 𝐢𝑗 𝐡𝑗 + 𝛽7 𝑄𝑖𝑗 𝐡𝑗 𝐢𝑗 + 𝛽8 𝑃𝑗 + πœ–π‘–π‘—
Where
𝛽0𝑗 = 𝛾00 + πœ‡0𝑗
𝛽1𝑗 = 𝛾10 + πœ‡1𝑗
In less formal english, this model reads that the observed grade at time 𝑖 for person 𝑗 is a
function of quarter (𝑄), condition (𝐢), temporal breadth (𝐡), all interactions between these
terms, pre-intervention gpa (𝑃), and a random error (πœ–). Additionally, the parameters for
the intercept (𝛽0) and slope (𝛽1) vary by person. For each of these parameters, there is an
overall coefficient (𝛾00 and 𝛾10 for intercept and slope respectively), and individual
deviations from these coefficients (πœ‡0𝑗 and πœ‡1𝑗 ).
The figure below plots the coefficient estimates from this model. Though the direction of
the effect is consistent with they hypothesized relationship (bottom two coefficients),
neither one is robust enough for us to put much credibility in the effect described by these
parameters.
Figure 3: Model fit to the data. Line represents the mean of the posterior distribution. Bands
represent the 95% uncertainty interval. Plot is faceted by the approximate mean and +/-1
standard deviation of temporal breadth
I next broadcast this model out for the full timeline of participation. As indicated in Figure
3, the direction of the effect is consistent with our hypothesis, though the effect is small.
After 8 quarters, there is no clear separation between the control and treatment groups for
this specification. However, focusing on within the treatment, there is modest evidence for
modulation. Specifically, after 8 quarters, a student in the treatment group whose language
is 1 standard deviation above the mean in terms of temporal breadth is estimated to have a
GPA of 2.46, (95% uncertainty interval = [2.3, 2.61]. In contrast, the comparable treatment
student whose language is 1 standard deviation below the mean in temporal breadth is
estimated to have a GPA of approximately 2.31 [2.21, 2.41]. Concretely, the probability that
a student in the treatment whose language is 1 standard deviation above the mean in terms
of temporal breadth will have a higher GPA after 8 quarters than the equivalent student
who is 1 standard deviation below the mean is approximately 0.06, 0.94. Figure 4 plots this
relationship more directly.
Figure 4: Posterior distribution of the GPA difference between treatment students who write
with temporal breadth 1 standard deviation below and 1 standard deviation above the mean.
Positive values on the y axis represent better outcomes for high temporal breadth. Blue
shading represents 95% uncertainty intervals
Incorportating identity threat
Values affirmations are meant to benefit those who academically suffer due to identity
threat. As such, we've so far been ignoring this crucial variable. I now incorporate this
information into the analysis. I'm not sure that there's a clear prediction with respect to
whether we would expect temporal breadth to most help stigmatized versus nonstigmatized students. As such, this is pretty exploratory.
Figure 5: Coefficient estimates. Points represent the mean of the posterior distribution. Inner
bars represent 50% uncertainty intervals. Outer bars represent 95% uncertainty intervals.
Estimate for the intercept is excluded
There are 16 coefficients plotted above, but not all of them are of substantive interest. The
first five (from the top) are main effects and are not especially interesting for this analysis.
The coefficient plot highlights two effects of particular interest. First, the quartercondition-temporal breadth interaction (5 from bottom) represents the fact that temporal
breadth modulates longitudinal trajectory for nonstigmatized students in the treatment
condition. Second, the quarter-condition-breadth-stigmatized interaction (bottom)
representes the fact that temporal breadth does not seem to modulate longitudinal
trajectory for stigmatized students in the treatment group. Indeed, if it does at all, it's likely
in the opposite direction (more breadth leads to worse performance over time.)
Figure 6: Model fit to the data. Line represents the mean of the posterior distribution. Bands
represent the 95% uncertainty interval. Plot is faceted by the approximate mean and +/-1
standard deviation of temporal breadth, along with group (stigmatized and not stigmatized)
More specifically, the mean of the posterior distribution estimating this interaction
between temporal breadth and time in the non-stigmatized group is 0.15, [0, 0.3]. The
probability that this effect is greater than zero is 0.02, 0.98.
Concretely, after 8 quarters, for a non-stigmatized student in the treatment group whose
writing was 1 standard deviation below the mean in terms of temporal breadth, the model
predicts that their GPA would be 2.64 [2.5, 2.78]. For the equivalent student whose writing
is 1 standard deviation above the mean, the model predicts that their GPA would be 2.89
[2.68, 3.11]. The probability that a non-stigmatized student in the treatment whose
language is 1 standard deviation above the mean in terms of temporal breadth will have a
higher GPA after 8 quarters than the equivalent student who is 1 standard deviation below
the mean is approximately 0.97.
College Attendence
We can also perform a similar analysis with respect to college attendence. There is, of
course, less data in such an analysis, but we can explore any support for the idea that
students who write with temporal breadth are more likely to attend college. Figure 7 plots
the coefficient estimates from this model. They suggest a story that is similar to that we
saw in the model of middle-school grades - students in the treatment condition who write
with breadth appear to go on to greater achievement.
Figure 7: Coefficient estimates. Points represent the mean of the posterior distribution. Inner
bars represent 50% uncertainty intervals. Outer bars represent 95% uncertainty intervals.
Coefficient estimate values are in logit units
Figure 8: Model fit to the data representing the probability of attending college (y axis) as a
function of pre-performance index (x axis), condition (colored lines) and mean +/- 1 standard
deviation of temporal breadth (facet).
Figure 8 plots this model in a more intuitive way. As seen above, the effect here appears a
bit larger than the previous analyses. For instance, a student in the treatment condition
who has the average pre-performance score and writes with temporal breadth of 1
standard deviation below the mean is estimated to have a probability of attending college
of 0.59, [0.5, 0.68]. This estimate is very close to a comparable student in the control
condition (probability = 0.66, [0.48, 0.68]). However, a student in the treatment condition
with the average pre-performance score who writes with temporal breadth of 1 standard
deviation above the mean is considerably improved - they are estimated to have a
probability of attending college of 0.81, [0.66, 0.92]. A similar control student is actually
estimated to have declined in their probability of attending college somewhat (with respect
to a control student writing 1 SD below the mean). The probability of attending college for
this student is estimated at 0.55, [0.45, 0.64].
A full 99% of the posterior distribution is consistent with an effect of this direction.