Modeling the Spearman`s hypothesis using MGCFA

Modeling the Spearman’s hypothesis using MGCFA: The Woodcock-Johnson data
Abstract
There has been a good deal of research on Spearman’s hypothesis with regard to Black-White
differences in tests of cognitive ability. Most of the research has relied on Jensen’s Method of
Correlated Vectors (Jensen, 1998). This method, however, is incapable of rigorously testing
competing models that do not involve group differences in g (Dolan, 2000, Dolan & Hamaker,
2001). The purpose of the present paper is to test Spearman’s hypothesis using Multi-Group
Confirmatory Factor Analysis applied to three waves of Woodcock-Johnson standardization data.
First, using Jensen’s MCV, it is found, for all three standardization waves, that there is a Jensen
effect, i.e., positive correlation between the subtests’ g-loadings and the Black-White differences.
Secondly, while measurement invariance (MI) was generally found to hold, results from MGCFA
using either the high-order-factor or the bi-factor model approach were unclear, and although the
present data may indicate that the Spearman’s model fits better than the contra hypothesis, the data
was less than ideal.
Keywords : Spearman’s hypothesis; MGCFA; IQ; Woodcock-Johnson
1. Introduction
Differences in cognitive abilities between U.S. self-identified racial/ethnic (SIRE) groups, e.g.,
Blacks, Whites, and Hispanics, are beyond dispute (Jensen, 1998; Rushton & Jensen, 2010). Jensen
(1998) proposed that the magnitude of the racial differences in IQ, at least between Black and
White Americans, as well as differences in cognitive-related socio-economic outcomes are a
function of the g-loadings (i.e., the correlation between the tests or outcomes and the general factor
of intelligence) of the respective cognitive tests and outcomes, a situation which makes the g factor
an essential element in the study of socio-economic inequalities. More specifically, Jensen (1998)
proposed that SIRE group differences on cognitive tests are largely due to latent differences in
general mental ability. This is known as Spearman’s hypothesis (SH) which exists in two forms :
the strong and the weak form, the latter of which was endorsed by Jensen (1998). The strong form
asserts that the differences are solely due to g factor differences while the weak form asserts that the
differences are mainly due to differences in g. The alternative contra hypothesis states that group
differences reside entirely or mainly in the tests’ group or broad factors and/or test specificity and
that g differences contribute little or nothing to the overall ones.
Regarding tests of SH, many studies have employed Jensen’s (1985, 1998) Method of Correlated
Vectors (MCV) to investigate the nature of group differences (e.g., te Nijenhuis et al., 2007;
Rushton & Jensen, 2010, pp. 15-16; Dragt, 2010). This method consists in correlating the subtests’
g-loadings with the variable of interest, after correcting for the subtests’ reliability. The results from
these studies are consistent with Spearman’s hypothesis in that they are as one would expect were
SH correct (Rushton & Jensen, 2010).
As Dolan (2000) and Dolan & Hamaker (2001) noted, however, the MCV fails to explicitly test
the contrary hypothesis. While it can demonstrate that the pattern of group differences on subtests is
consistent with SH, it cannot rule out non-SH models and thus establish SH as a fact. Indeed, with
the MCV, SH is either confirmed or rejected based on the strength of the Jensen effect or the
correlation between the magnitude of subtest differences and g-loadings. Yet a correlation is a mere
descriptive statistics. As Carroll (1997, pp. 131-132) suggested, any hypothesis can only be tested
by evaluating and comparing the probability of competing models (e.g., SH versus non-SH models)
to reproduce the data, a comparison which the MCV is unable to allow for.
To satisfy this requirement, Dolan (2000) proposed the use of Multi-Group Confirmatory Factor
Analysis (MGCFA). Applying MGCFA to Jensen & Reynolds’ (1982) data, Dolan (2000)
demonstrated that some non-SH models fitted the data just as well as SH models, despite the MCV
finding a strong Jensen effect. Based on his results, Dolan (2000, Table 4) concluded that it was
impossible to tell which hypothesis is to be preferred. While Dolan (2000), Dolan & Hamaker
(2001), Dolan et al. (2004) evaluated SH by way of higher-order factor (HOF) models, Frisby &
Beaujean (2015) argued that a bi-factor (BF) model would have been more appropriate. These latter
authors found support for SH, in the case of Black and White Americans, using a BF model,
although they did not compare the results of the BF approach with ones generated using the HOF
approach.
In this paper, the Spearman’s hypothesis is tested by means of MGCFA, using the higher-order
factor and bi-factor approaches. A major difference between HOF and BF models is that HOF
models allow for factors with two indicator variables with no cross-loadings, which the Woodcock
Johnson data had, while BF models don’t accept such specification. The difference between these
methods will be discussed further below.
2. Method
2.1 Data
The Woodcock-Johnson cognitive test battery was described as an operational representation of
Horn’s Gf-Gc theory (Horn, 1991), measuring seven broad cognitive abilities : comprehensionknowledge (Gc), long-term retrieval (Glr), visual processing (Gv), auditory processing (Ga), fluid
reasoning (Gf), processing speed (Gs), and short-term memory (Gsm).
The Woodcock-Johnson test has been used in Murray’s (2007) paper on the Black-White
cognitive difference over time. The WJ standardization consists in three waves. The initial version
of the test, WJ1 (WJ-I), was standardized using a sample of 4732 subjects aged 2 to 84, tested over
the period from April 1976 to May 1977. WJ2 (or WJ-R) was standardized with a sample of 6359
subjects aged 2 to 95, tested from September 1986 to August 1988. WJ3 (WJ-III) was standardized
with a sample of 8818 aged 2 to 98, tested from September 1996 to August 1999. Participants are
composed of four groups : Whites, Blacks, Hispanics, Asians. For this paper, the between-group
comparison is analyzed in a within-wave fashion and only the first two groups are considered. The
entire analysis is done with R. The syntax is supplied in the Supplementary file 1. The subtests
available in this combined data set are : Picture Vocabulary (PF), Spatial Relations (SR), Memory
for Sentences (MS), Visual Auditory Learning (VAL), Sound Blending (SB), Verbal
Comprehension (Vcm), Visual Matching (VM), Antonyms-Synonyms (ASn), Analysis-Synthesis
(ASt), Numbers Reversed (NR), Concept Formation (CF), Analogies (A), Picture Recognition (PR),
Memory for Words (MW), Visual Closure (VCl), Cross-out (C), General Information (GI),
Retrieval Fluency (RF), Auditory Attention (AA), Decision Speed (DS).
The WJ-I data involves 3764 subjects (3328 Whites with a mean age of 13.15 and 436 Blacks
with a mean age of 14.23) with 11 subtests : PF, SR, MS, VAL, SB, VM, ASn, ASt, NR, CF, A1.
The WJ-II data involves 4379 subjects (3573 Whites and 806 Blacks) with 14 subtests : PF, MS,
VAL, SB, VM, ASn, ASt, NR, CF, PR, MN, IW, VCl, C. The WJ-III data involves 3018 subjects
1
Sound Blending test has been removed from analyses pertaining to Woodcock-Johnson I in the current paper,
as it violates measurement invariance. See below for further details.
(2592 Whites with a mean age of 20.86 and 426 Blacks with a mean age of 15.70) with 14 subtests :
SR, VAL, SB, VCm, VM, ASt, NR, CF, PR, MW, GI, RF, AA, DS.
2.2. Statistical Analyses
Before testing SH models against non-SH models, one needs to ensure that measurement
invariance (or equivalence) holds with respect to group differences. The opposite would mean that
the structure of the subtests’ means has different meanings across groups, and this obscures the
interpretation of group differences. In principle, MGCFA proceeds by adding additional constraints
to the initial, free model. The model parameters are constrained to be the same across groups. The
following steps are taken : first, put a constraint on the factor structure (configural invariance),
second, add a constraint on the factor loadings (weak invariance), third, add a constraint on the
intercepts (strong invariance), fourth, add a constraint on the residual invariances (strict invariance).
Since the last equality constraint is sometimes dropped due to confoundings between measurement
errors and specific variance (Dolan & Hamaker, 2001, p. 15), strict invariance seems not to be a
condition for establishing measurement invariance. If the model fit shows a meaningful decrement
throughout one of the steps, invariance is immediately rejected.
With regards to testing SH models using the higher order factor (HOF) approach, one needs to
first build a baseline model in which all loadings as well as the residuals of the first-order latent
factors are constrained to be equal across groups while the intercepts and the residuals of the
second-order latent factor are not constrained to be equal. The strong and weak SH models are
nested under this baseline model. In the strong SH model, the added constraint is the intercept of the
subtests (i.e., the subtest means) and of the first-order latent factors (i.e., the latent means); since the
means of all of the first-order factors are equal, the difference in subtests’ means are entirely
explained by the second-order factor. In the weak SH model, the intercepts of the subtests are still
equal across groups, but the means and variances of some of the first-order factors (as well as the
mean and variance of the second-order factor) are not equal; in this case, the difference in subtests’
means is due to both the difference in the second-order factor and in some of the first-order factors.
Finally, the contra-SH model, referred to as the common (correlated) factor model in Dolan (2000),
requires the subtests’ intercept and latent factors’ covariances to be constrained to be equal across
groups while unconstrained means and variances are allowed for some of the latent factors. In this
situation, the groups differ only with respect to some of these correlated first-order latent factors.
As per Frisby & Beaujean’s (2015) recommendation, a bi-factor (BF) approach is employed as a
means to test the relevant hypotheses. BF may require additional restrictions. Unlike HOF, the BF
model cannot reliably estimates models with latent factors having only two indicators unless the
loadings of these two indicators are constrained to be equal to each other (Kenny, March 18, 2012,
Beaujean, 2014, p. 150). Otherwise, identification problems occur.
As the purpose is to identify latent constructs underlying measured variables, Fabrigar et al.
(1999, pp. 275-276) recommend using Exploratory Factor Analysis (EFA) over Principal
Component Analysis (PCA), the latter which is used to display the pattern of the factor loadings but
which can also be used in order to determine how many factors must be retained. For instance, a
pattern showing a residual variance being near zero or negative (Bollen, 1989, p. 282; Jöreskog,
1999) or a pattern of loadings having no clear theoretical interpretation should better be rejected. As
Costello & Osborne (2005, p. 3) noted, a pattern showing the cleanest factor structure, i.e., no low
loadings, no or few cross-loadings, no factors with fewer than three indicators, has the best fit to the
data. Generally, methods for deciding how many factors to extract were the Kaiser’s eigenvaluegreater-than-one rule, scree plots, parallel analysis, minimum average partial. Ledesma & Valero-
Mora (2007) recommend parallel analysis. In this paper, EFA is performed using promax as oblique
rotation. The best model is chosen based on EFA. If the Woodcock-Johnson’s theory-based 7-factor
model doesn’t fit the data well, it is not retained.
Model selection among CFA models in the HOF approach is done by examination of the
baseline model in which the intercept is added (in the BF approach, the baseline already
incorporates the intercept). To determine the best fitted model, the intercepts of the factors
displaying weak score differences between groups are constrained and the best model is selected
based on the remaining factors.
To assess invariance and to choose among latent models, the following model fit indices are
used : Chi-Square, Comparative Fit Index (CFI), Root Mean Square Error of Approximation
(RMSEA), McDonald’s noncentrality index (Mc), Akaike’s information criterion (AIC), and
Bayesian information criterion (BIC). Based on their simulation study, Hu & Bentler (1999)
recommend using the following cutoff values to judge adequate model fit : CFI=.95, Mc=.90, and
RMSEA=.06. CFI estimates the discrepancy between the proposed model and the null model :
larger values indicate better fit. RMSEA estimates the discrepancy related to the approximation, or
the amount of unexplained variance (residual), or the lack of fit compared to the saturated model :
smaller values indicate better fit. AIC and BIC are both comparative measures of fit used in the
comparison of two or more models, and they evaluate the difference between observed and
expected covariances : smaller values indicate better fit. ECVI, which is similar to AIC, measures
the difference between the fitted covariance matrix in the analyzed sample and the expected
covariance matrix in another sample of same size (Byrne, 2013, p. 82), and is used for comparing
models, hence the absence of threshold cut-off values for an acceptable model : smaller values
indicate better fit. Meade et al. (2008) compared CFI, Mc, RMSEA, and SRMR in the context of
measurement invariance testing and concluded that CFI and Mc are the most appropriate indices to
detect configural invariance with accuracy, and SRMR is the most sensitive to sample size. Finally,
regarding the recommended criteria for determining non-invariance in MGCFA studies, Cheung &
Rensvold (2002) argued that ΔCFI greater than -.010 and/or ΔMc greater than -.020 should indicate
that measurement invariance is violated. Chen’s (2007) simulations lead to the conclusion that for
testing either factor loading, intercept or residual invariance, a change of ≥-.005 in CFI
supplemented by a change of ≥.010 in RMSEA in the case of small sample and unequal sample size
(or ≥-.010 in CFI supplemented by a change of ≥.015 in RMSEA in the case of large sample and
equal sample size) would indicate non-invariance.
All analyses were conducted using the R statistical software, using the psych (Revelle, 2012) and
lavaan (Rosseel, 2012) packages. Full results are displayed in Supplementary file 2.
3. Results
3.1. Method of Correlated Vector
The method requires one to estimate the factor loadings of the WJ subtests of each groups and to
determine the standardized mean group difference and g-loading (i.e., the loading of the subtest on
the first factor) for each subtest, then correct both of them for subtest reliability and finally correlate
the g-loadings with the group differences.
In the first step, the original variables are corrected for age and gender effects. In the second
step, factor analysis (without rotation) is performed. Factor analysis is used to determine the g-
loadings of the subtests. Principal axis factoring was the factoring method of choice2. When
selecting the number of factors to be extracted for an unrotated solution, the Kaiser’s eigenvaluegreater-than-one rule and a scree plot are used.
Correlations between g-loadings and black-white differences are done using white loadings,
black loadings and the average loadings3.
Table 1, 2, 3 displays the g-loadings for the White group, the Black group, and the average of the
two. Mean subtest differences after correction for age and sex effects (but not for unreliability) are
provided in the same tables along with the subtests’ reliabilities, which were taken from Woodcock
(1978, 1990) and Shrank et al. (2001).
For WJ1, the correlation between the subtests’ g-loadings and group difference is very high
using the White group’s loadings (r=.778), the Black group’s loadings (r=.864) and the average
loadings (r=.836) before correction for reliability. After correction, the correlations are r=.625,
r=.807 and r=.732, respectively4. For WJ2, the correlation is medium using the White group’s
loadings (r=.520), the Black group’s loadings (r=.564) and the average loadings (r=.558) before
correction for reliability. After correction, the correlations are r=.411, r=.333 and r=.402,
respectively. For WJ3, the correlation is very high using the White group’s loadings (r=.766), the
Black group’s loadings (r=.743) and the average loadings (r=.761) before correction for reliability.
After correction, the correlations are r=.759, r=.735 and r=.759, respectively. Overall, these result
are consistent with those of earlier studies using Jensen’s MCV.
3.2.0. Testing Assumptions
Prior to performing latent variable regressions, such as CFA/MGCFA or SEM, one needs to
ensure that the data does not violate the normality assumption (Kline, 2011, p. 74). Several indices
are used : univariate skewness (which relates to the asymmetrical distribution around the mean),
univariate kurtosis (which relates to the peakedness of a distribution) and Mardia’s multivariate
skew and kurtosis. A normal distribution has a skewness less than 2 and a kurtosis less than 7. The
entire sample, Black sample and White sample of all three waves show skewness and kurtosis
values substantially below the recommended criteria. But at the same time, Mardia’s tests of
multivariate skew and kurtosis were substantially larger than the recommended cutoff values of 2
for skewness and 7 for kurtosis5. Additionally, the Q-Q plots of the multivariate distribution also
showed some departure from multivariate normal distribution at both the lowest and highest values
in WJ1 and WJ2 samples. This discrepancy however seems not to be large.
3.2.1. MGCFA : Woodcock-Johnson Wave 1
EFAs suggest that a 4-factor model, which also appears to have the cleanest factor structure,
clearly makes the most interpretive sense. The 3-factor model has many variables with large
2
3
In WJ1, ML method produced an impossible solution in the black sample.
According to Nyborg & Jensen (2001), "It would be incorrect to use the loadings in the combined samples,
because these would also reflect the between-groups variance in addition to the within-groups variance". These
authors suggested the use of the following formula : SQRT((white_loading^2+black_loading^2)/2). And that is the
method employed in the current paper.
4
The respective uncorrected correlations when Sound Blending is included are r=.308, r=.588, and r=.468. The
respective corrected correlations are r=.133, r=.473, r=.310.
5
Statistics for multivariate skew are : b1p=2.02 for the entire WJ1 sample, b1p=5.13 for the entire WJ2 sample,
and b1p=3.08 for the entire WJ3 sample. Statistics for multivariate kurtosis are : b2p=137.64 for the entire WJ1
sample, b2p=259.38 for the entire WJ2 sample, and b2p=246.13 for the entire WJ3 sample.
loadings in the Black group but loadings close to zero in the White group and vice versa. In the 5factor model, there are issues with the presence of cross-loadings. Parallel analysis also indicates
that four factors should be retained. Thus, a 4-factor model is chosen for subsequent analyses (Table
1). There are not enough variables to consider a 7-factor model, as per WJ’s theory-based model.
Using all the 11 subtests, measurement invariance did not hold (at the intercept level).
Modification indices reveal that Sound Blending is the cause of the misfit. The analysis was re-run
after removing the Sound Blending subtest. EFA still found that a 4-factor model fits the data best,
while 3- and 5- factor models did not make much sense.
Table 6, which displays MGCFA outcomes without the Sound Blending subtest, reveals that
measurement invariance now holds at all steps. There is no decrement in fit.
With regard to testing SH using the HOF approach to MGCFA, one needs first to decide which
latent factor means have to be constrained, by examining the baseline model (in which all the latent
means as well as the second-order factor variances are free across groups but intercepts are equal
across groups). It is observed that the Gf, Gsm and Gs factor mean differences are close to zero;
thus, they are constrained to be equal across groups. Only g and Gc display sizeable group
differences. Thus, the model g+Gc is chosen as the best fitted model for the weak SH model. If we
look at the model fit, shown in Table 6, the baseline model fits no better than the weak SH model,
which means that the weak SH model is a good approximation to the data. On the other hand, the
strong SH, weak SH and contra-SH models fit the data equally well6.
Next, a BF model is run, and the baseline model (in the BF baseline, the intercept is modeled)
shows that all factor means exhibit very minor differences; only the Gc factor mean difference is
statistically significant. For this reason, the best fitting model needed to be identified through
exploratory analysis7. It was found that the g+Gc model was the best model among weak SH
models and that the Gc model is the best model among contra-SH models. With this determined, the
model fits could be compared. As displayed in Table 6, the baseline, strong SH, weak SH and
contra-SH models fit equally well. Overall, based on both the HOF and BF results for WJ-1, while
neither strong nor weak SH can be rejected, neither can be confirmed.
3.2.2. MGCFA : Woodcock-Johnson Wave 2
EFAs suggest that the 7-factor models fit the data best. Other models do not display a clean
structure. For instance, Visual Closure has multiple and very small loadings on several factors
except in the case of a 7-factor model. Parallel analysis indicates that seven factors should be
retained for the black sample and six factors should be retained for the white sample. Thus, a 7factor model, shown in Table 7, is chosen for subsequent analyses8.
As can be seen in Table 8, when going from configural to scalar invariance, there is no
decrement in the fit indices. Only with respect to strict invariance, given the criteria of Cheung &
Rensvold (2002), is measurement invariance violated. But as mentioned before, strict invariance is
not a necessary assumption.
With regard to SH using the HOF approach to MGCFA, the baseline model is examined in order
6
7
Refer to the notes under Table 6.
The best fitted model can be chosen based on differences in Chi Square value or based on Chi Square test (a
model with lower Chi Square is prefered over a less complex model only if it displays significant).
8
To avoid cross-loading, some of the variables which load on several factors are forced to load onto a single
factor. This data is not ideal for the present analysis (refer to the Discussion section).
to decide which latent factor means have to be constrained. It is observed that the Gf, Gsm and Gs
factor means are close to zero; thus, a weak SH model is run that includes Gc, Glr, Ga, Gv and g.
There is no difference between the baseline and weak SH model in terms of fit. The strong SH
model fits worse than weak SH although the difference is very small. On the other hand, the contra
SH model fits better than weak SH model.
Next, a BF model is run, and the baseline model (in which the intercept is modeled) shows that
all factor means show mild differences, and that Gc, Gs and Ga factor means displayed only very
small differences. Thus, the weak SH model that seems to fit the data best includes Gf, Gsm, Glr,
Gv and g. As seen in Table 8, this weak SH model fits the data well, and fits a little bit better than
the strong SH model; unlike what is observed for the HOF approach, the contra SH model fits
worse than the weak SH model. If the BF approach is more reliable at testing the Spearman’s
Hypothesis, more weight should be given to results obtained from this approach, in which case the
WJ-2 results could be seen as supporting weak SH slightly more than contra-SH.
3.2.3. MGCFA : Woodcock-Johnson Wave 3
Parallel analysis indicates that four factors should be retained for the Black sample while five
factors should be retained for the White sample. EFAs suggest that a 4-factor model is by far the
cleanest factor structure. In the 3-, 5-, 6- and 7-factor models, the patterns are so different between
groups that these definitely are not interpretable. As the WJ’s theory-based 7-factor model is not
interpretable, it is not retained. Thus, a 4-factor model, shown in Table 9, is chosen for subsequent
analyses.
As seen in Table 10, the drop in fit, going from configural to strict invariance, is minimal and so
it can be concluded that measurement invariance holds.
As far as SH is concerned, the HOF approach shows again that there is no difference between the
baseline and strong SH model. In order to select which weak SH model to run, the intercept is
added to the baseline model, and the outputted model shows that Gs factor’s mean difference is
relatively weak. A weak SH model based on g+Gc+Gf+Gsm is fitted. This model fits well
compared to the baseline. Although this weak SH model fits better than the strong SH model, the
difference is very small. The corresponding contra-SH model (Gc+Gf+Gsm) fits a little bit better
than the weak SH model, but the difference is so small that one can say the models are equivalent.
As for the BF method, the baseline model was poorly identified, with some observed variables
have negative residuals (in particular, one variable, Memory for Words, had an anomalously high
negative variance). As such, no model comparisons for the BF method are reported here.
4. Discussion
As measurement non-invariance makes difficult the interpretation of group differences, testing
for measurement invariance is important. As MCV is incapable of dealing with this issue, MGCFA
was conducted first to test the assumption of measurement invariance and second to test Spearman’s
hypothesis. Although measurement invariance generally holds (although, in WJ-I data, invariance
holds only after the removal of one biased subtest), nothing definitive can be inferred about the
veracity of Spearman’s hypothesis. In WJ-I and WJ-III, the strong SH, weak SH and contra-SH
models fit equally well. In WJ-II data, BF method shows that the weak SH model fits better than
contra-SH model, while with the HOF method the opposite was true. As Murray & Johnson (2013,
p. 420) and Frisby & Beaujean (2015, p. 94) explained, BF approach is best suited for testing
theories related to the g factor. This is because with the bi-factor approach, the variances from firstorder latent factors are not confounded (i.e, correlated) with the variance from the second-order
factor, which allows g to be purely estimated. Hence, in general, results from BF should be given
more weight compared to HOF. On the other hand, the WJ-II data set, in particular, was not ideal
for our purposes, as each factors have only two indicators and the pattern loadings of the 7-factor
model from EFAs showed some departure from the WJ’s theory based 7-factor model.
In this paper, Frisby & Beaujean’s (2015) method for identifying comparison weak and contra
SH models has been applied, though it is not clear that it is the best for evaluating Spearman’s
Hypothesis. While Dolan et al. (2000, 2001, 2004) used an explanatory method in which they
directly test each Spearman’s model and each contra-Spearman’s model in order to find out which
model exhibits the best fit9, Frisby & Beaujean (2015) examined the baseline model in the BF
method to determine which factors exhibit weak group differences, and based on this examination,
they decide which model fits the data best. Although Dolan’s method suffers from comparability
issues10, Frisby & Beaujean’s method does not explore all models in detail.
Finally, the difficulty in distinguishing between models should be pointed out. Often very
different models, which one would expect to be able to distinguish, showed similar fits. Possible
explanations are insufficient sample sizes and insufficient number of subtests (Dolan, 2000; Dolan
& Hamaker, 2001). For instance, Frisby & Beaujean (2015), who were able to designate one model
as the best fitting one used data with a much larger number of subtests and number of subtests per
factor. While in the WJ-II data, there are enough variables and the 7-factor model has adequate fit,
in WJ-I there are not enough variables to run a 7-factor model and in WJ-III the 7-factor model
doesn’t even fit the data. Finding the appropriate data for a MGCFA test of SH seems to be a
challenging task. In summary, the conclusion that the Spearman’s model fits better than the contraSpearman’s model is tentative, but the data may not be appropriate.
5. References
Beaujean, A. A. (2014). Latent variable modeling using R: A step-by-step guide. Routledge.
Bollen, K. A. (1989). Structural equations with latent variables. John Wiley & Sons.
Byrne, B. M. (2013). Structural equation modeling with AMOS: Basic concepts, applications, and
programming. Routledge.
Carroll, J. B. (1997). Theoretical and technical issues in identifying a factor of general intelligence.
Intelligence, genes, and success: Scientists respond to the bell curve, 125-156.
Chen, F. F. (2007). Sensitivity of goodness of fit indexes to lack of measurement invariance.
Structural equation modeling, 14(3), 464-504.
Cheung, G. W., & Rensvold, R. B. (2002). Evaluating goodness-of-fit indexes for testing
measurement invariance. Structural equation modeling, 9(2), 233-255.
Costello, A. B., & Osborne, J. W. (2005). Best Practices in Exploratory Factor Analysis: Four
Recommendations for Getting the Most From Your Analysis. Practical Assessment Research &
Evaluation, 10(7), 1-9.
Dolan, C. V. (2000). Investigating Spearman’s hypothesis by means of multi-group confirmatory
factor analysis. Multivariate Behavioral Research, 35, 21–50.
Dolan, C. V., & Hamaker, E. L. (2001). Investigating Black–White differences in psychometric IQ:
Multi-group confirmatory factor analyses of the WISC-R and K-ABC and a critique of the method
9
One problem with this latter model selection approach is that the number of models exponentially increases as
the number of subtests does. While the fit values of a number of other models for WJ-I and WJ-III have been
examined (shown in Supplementary File 2), WJ-II was not analyzed, owing to time constraints.
10
The best model among weak SH can be g+Gc, while the best model among anti-SH could have been
Gc+Gsm+Gs, yet g+Gc is being compared to Gc+Gsm+Gs model.
of corrected factors. In F. Columbus (Ed.), Advances of Psychological Research, vol. 6 ( pp. 31–
59). Huntington: Nova Science.
Dolan, C. V., Roorda, W., & Wicherts, J. M. (2004). Two failures of Spearman’s hypothesis: The
GATB in Holland and the JAT in South Africa. Intelligence, 32(2), 155-173.
Dragt, J. (2010). Causes of group differences studied with the method of correlated vectors: A
psychometric meta-analysis of Spearman’s hypothesis.
Fabrigar, L. R., Wegener, D. T., MacCallum, R. C., & Strahan, E. J. (1999). Evaluating the use of
exploratory factor analysis in psychological research. Psychological methods, 4(3), 272.
Frisby, C. L., & Beaujean, A. A. (2015). Testing Spearman’s hypotheses using a bi-factor model
with WAIS-IV/WMS-IV standardization data. Intelligence, 51, 79-97.
Horn, J. L. (1991). Measurement of intellectual capabilities: A review of theory. In K. S. McGrew,
J. K. Werder, & R. W. Woodcock (Eds.), WJ-R technical manual (pp. 197–232). Rolling Meadows,
IL: Riverside
Hu, L. T., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis:
Conventional criteria versus new alternatives. Structural equation modeling: a multidisciplinary
journal, 6(1), 1-55.
Jensen, A. R. (1985). The nature of the black-white difference on various psychometric tests:
Spearman’s hypothesis. Behavioral and Brain Sciences, 8, 193-219.
Jensen, A. R. (1998). The g factor: The science of mental ability. Westport, CT: Praeger
Publishers/Greenwood.
Jensen, A. R. & Reynolds, C. R. (1982). Race, Social Class and Ability Patterns on the WISC-R.
Personality and Individual Differences, 3, 423-438.
Jöreskog, K. G. (1999). How large can a standardized coefficient be. Unpublished Technical
Report.
Kenny, David A., (March 18, 2012). Identification. Retrieved at :
http://davidakenny.net/cm/identify_formal.htm.
Kline, R. B. (2015). Principles and practice of structural equation modeling. Guilford publications.
Ledesma, R. D., & Valero-Mora, P. (2007). Determining the number of factors to retain in EFA: An
easy-to-use computer program for carrying out parallel analysis. Practical assessment, research &
evaluation, 12(2), 1-11.
Meade, A. W., Johnson, E. C., & Braddy, P. W. (2008). Power and sensitivity of alternative fit
indices in tests of measurement invariance. Journal of Applied Psychology, 93, 568–592.
Murray, C. (2007). The magnitude and components of change in the black–white IQ difference
from 1920 to 1991: A birth cohort analysis of the Woodcock–Johnson standardizations.
Intelligence, 35(4), 305-318.
Murray, A. L., & Johnson, W. (2013). The limitations of model fit in comparing the bi-factor versus
higher-order models of human cognitive ability structure. Intelligence, 41, 407–422.
Revelle, W. (2012). psych: Procedures for psychological, psychometric, and personality research.
(Version 1.4.8) [Computer Program]. Evanston, IL: Northwestern University.
Rosseel, Y. (2012). lavaan: An R package for structural equation modeling. Journal of Statistical
Software, 48, 1–36.
Rushton, J. P., & Jensen, A. R. (2010). Race and IQ: A theory-based review of the research in
Richard Nisbett’s Intelligence and How to Get It. The Open Psychology Journal, 3(1), 9-35.
Shrank, F. A., & McGrew, K. S. (2001). Woodcock Johnson III: Assessment service bulletin
number 2.
Wicherts, J. M., & Dolan, C. V. (2010). Measurement invariance in confirmatory factor analysis:
An illustration using IQ test performance of minorities. Educational Measurement: Issues and
Practice, 29(3), 39-47.
Woodcock, R. W. (1978). Development and standardization of the Woodcock-Johnson psychoeducational battery. Teaching Resources.
Woodcock, R. W. (1990). Theoretical foundations of the WJ-R measures of cognitive ability.
Journal of Psychoeducational Assessment, 8(3), 231-258.
Table 1. Unrotated factor loadings and reliabilities (WJ-1)
Subtests
Picture Vocabulary
Spatial Relations
Memory for Sentences
Visual Auditory
Learning
Visual Matching
Antonyms-Synonyms
Analysis-Synthesis
Numbers Reversed
Concept Formation
Analogies
White
Black
Average
g-loadings
g-loadings
g-loadings
.649
.473
.571
.712
.504
.555
.530
.509
.809
.567
.516
.585
.762
Black-White gap
Reliabilitie
s
.681
.489
.563
Adjusted
(age/sex)
16.543
10.484
8.855
.82
.86
.80
.542
.536
10.111
.95
.439
.841
.616
.582
.602
.763
.475
.825
.592
.550
.594
.763
7.525
16.218
9.212
9.701
10.659
13.038
.65
.90
.84
.82
.90
.84
Table 2. Unrotated factor loadings and reliabilities (WJ-2)
Subtests
Picture Vocabulary
Memory for Sentences
Visual Auditory
Learning
Sound Blending
Visual Matching
Antonyms-Synonyms
Analysis-Synthesis
Numbers Reversed
Concept Formation
Picture Recognition
Memory for Names
Incomplete Words
Visual Closure
Cross-out
White
Black
Average
g-loadings
g-loadings
g-loadings
.644
.597
.681
.581
.690
.527
.526
.737
.605
.585
.626
.493
.594
.444
.410
.545
Black-White gap
Reliabilitie
s
.663
.589
Adjusted
(age/sex)
14.194
7.600
.86
.90
.703
.697
8.142
.92
.590
.587
.725
.626
.684
.693
.500
.632
.570
.504
.636
.559
.557
.731
.616
.636
.660
.497
.613
.511
.459
.592
13.663
4.647
12.120
8.821
7.556
9.124
4.981
6.497
8.616
3.769
8.307
.87
.78
.87
.90
.87
.93
.82
.91
.82
.69
.75
Table 3. Unrotated factor loadings and reliabilities (WJ-3)
Subtests
White
Black
Average
Black-White gap
Reliabilitie
s
g-loadings
g-loadings
g-loadings
.432
.371
.403
Adjusted
(age/sex)
7.045
.616
.626
.621
8.333
.86
.515
.729
.551
.580
.517
.674
.334
.506
.652
.442
.451
.515
.506
.764
.568
.631
.473
.730
.359
.453
.714
.473
.451
.475
.511
.747
.560
.606
.495
.703
.347
.480
.684
.458
.451
.495
12.860
16.258
3.579
9.433
6.744
10.374
2.574
5.938
17.582
1.993
5.098
4.971
.89
.92
.91
.90
.87
.94
.76
.80
.89
.85
.88
.87
Spatial Relations
Visual Auditory Learnin
g
Sound Blending
Verbal Comprehension
Visual Matching
Analysis-Synthesis
Numbers Reversed
Concept Formation
Picture Recognition
Memory for Words
General Information
Retrieval Fluency
Auditory Attention
Decision Speed
.81
Table 4. Fit indices for CFA models
Chisquare
df
p-value
CFI
RMSE
A
Mc
AIC
BIC
72.716
74.231
356.591
366.747
28
30
28
30
.000
.000
.000
.000
.970
.970
.967
.967
.061
.058
.059
.058
.950
.950
.952
.951
34888
34886
261048
261054
35039
35028
261273
261268
216.227
286.582
533.818
796.724
56
70
56
70
.000
.000
.000
.000
.965
.953
.971
.955
.060
.062
.049
.054
.905
.874
.935
.903
90633
90676
399480
399715
90929
90906
399869
400017
107.993
115.888
712.777
741.018
71
73
71
73
.003
.001
.000
.000
.978
.974
.935
.933
.035
.037
.059
.059
.957
.951
.883
.879
46171
46175
282591
282615
46365
46361
282872
282885
Fit indices
W-J Wave 1
Black FOF model
Black HOF model
White FOF model
White HOF model
W-J Wave 2
Black FOF model
Black HOF model
White FOF model
White HOF model
W-J Wave 3
Black FOF model
Black HOF model
White FOF model
White HOF model
Table 5. ML estimates of promax rotated factor loadings from a 4-factor model (WJ-1)
Subtests
Picture Vocabulary
Spatial Relations
Factors
Gc
*
Gf
Gsm
Gs
*
Memory Sentences
Visual Auditory Learning
Visual Matching
Antonyms-Synonyms
Analysis-Synthesis
Numbers Reversed
Concept Formation
Analogies
*
*
*
*
*
*
*
*
*
Table 6. Fit indices for MGCFA models from W-J Wave 1
Chisquare
df
p-value
CFI
RMSE
A
Mc
AIC
BIC
426.455
433.581
436.131
461.565
56
63
69
79
.000
.000
.000
.000
.969
.969
.969
.968
.059
.056
.053
.051
.952
.952
.952
.950
327917
327910
327901
327906
328628
328577
328530
328473
516.673
534.215
502.711
507.942
84
93
91
91
.000
.000
.000
.000
.964
.963
.966
.965
.052
.050
.049
.049
.944
.943
.947
.946
327951
327951
327923
327928
328487
328431
328416
328421
510.688
526.683
511.451
517.848
85
89
88
89
.000
.000
.000
.000
.964
.963
.965
.964
.052
.051
.051
.051
945
.943
.945
.945
327943
327951
327938
327942
328473
328456
328449
328447
Fit indices
FOF model
Configural MI
Metric MI
Scalar MI
Strict MI
HOF model
Baseline
Strong SH
Weak SH*
contra-SH**
BF model
Baseline
Strong SH
Weak SH*
contra-SH**
* This weak SH model corresponds to the following model : g+Gc factors for both HOF and BF modeling.
** This contra-SH model corresponds to the following model : Gc factor for both HOF and BF modeling.
Table 7. ML estimates of promax rotated factor loadings from a 7-factor model (WJ-2)
Subtests
Picture Vocabulary
Memory Sentences
Visual Auditory
Learning
Sound Blending
Visual Matching
Antonyms-Synonyms
Analysis-Synthesis
Numbers Reversed
Concept Formation
Gc
*
Gf
Gsm
Factors
Gs
Glr
Ga
*
*
*
*
*
*
*
*
Gv
Picture Recognition
Memory for Names
Incomplete Words
Visual Closure
Cross-out
*
*
*
*
*
Table 8. Fit indices for MGCFA models from W-J Wave 2
Fit indices
FOF model
Configural MI
Metric MI
Scalar MI
Strict MI
HOF model
Baseline
Strong SH
Weak SH*
contra-SH**
BF model
Baseline
Strong SH
Weak SH*
contra-SH**
Chisquare
df
p-value
CFI
RMSE
A
Mc
AIC
BIC
766.721
794.594
877.793
1077.007
112
119
126
140
.000
.000
.000
.000
.969
.968
.965
.956
.052
.051
.052
.055
.928
.926
.918
.898
530991
531005
531074
531245
532153
532122
532147
532228
1375.340
1574.952
1449.53
1
1224.943
175
187
.000
.000
.944
.935
.056
.058
.872
.853
532108
531649
532868
532332
179
.000
.940
.057
.865
531540
532274
161
.000
.950
.055
.873
531351
532200
1463.136
1568.375
1464.896
1649.256
181
187
182
187
.000
.000
.000
.000
.940
.935
.940
.931
.057
.058
.057
.060
.864
.854
.864
.846
531549
531643
531549
531723
532271
532326
532264
532407
* This weak SH model corresponds to the following model : g+Gc+Glr+Ga+Gv factors for HOF modeling
and g+Gf+Gsm+Glr+Gv factors for BF modeling.
** This contra-SH model corresponds to the following model : Gc+Glr+Ga+Gv factors for HOF modeling
and Gf+Gsm+Glr+Gv factors for BF modeling.
Table 9. ML estimates of promax rotated factor loadings from a 4-factor model (WJ-3)
Subtests
Factors
Gc
Spatial Relations
Visual Auditory Learning
Sound Blending
Verbal Comprehension
Visual Matching
Analysis-Synthesis
Numbers Reversed
Concept Formation
Picture Recognition
Memory for Words
General Information
Gf
*
*
Gsm
Gs
*
*
*
*
*
*
*
*
*
Retrieval Fluency
Auditory Attention
Decision Speed
*
*
*
Table 10. Fit indices for MGCFA models from W-J Wave 3
Fit indices
FOF model
Configural MI
Metric MI
Scalar MI
Strict MI
HOF model
Baseline
Strong SH
Weak SH*
contra-SH**
Chisquare
df
p-value
CFI
RMSE
A
Mc
AIC
BIC
830.485
854.754
896.904
944.825
142
152
162
176
.000
.000
.000
.000
.946
.945
.942
.940
.057
.055
.055
.054
.892
.890
.885
.880
357413
357417
357440
357460
358327
358271
358233
358169
950.530
1051.717
987.387
954.711
177
190
184
181
.000
.000
.000
.000
.939
.932
.937
.939
.054
.055
.054
.053
.880
.867
.875
.880
357463
357538
357486
357459
358167
358164
358147
358139
* This weak SH model corresponds to the following model : g+Gc+Gf+Gsm factors for HOF modeling.
** This contra-SH model corresponds to the following model : Gc+Gf+Gsm factors for HOF modeling.