Research Note: Differential Item Functioning and Unidimensionality in the Pearson Test of English Academic Hye Pae, Ph.D. University of Cincinnati, Ohio, USA January 2011 1. Introduction Since the Pearson Test of English Academic (PTE Academic) was designed to assess skill differences among test-takers at all points along the ability continuum, rather than to determine cutoff scores, it is important to examine the extent to which the instrument assesses what it is intended to measure (validity) as well as the extent to which the test is consistent (reliability) in measuring ELLs’ academic English skills. A Rasch measurement model approach permits joint scaling of person abilities and assessment item difficulties for mapping the relationship between latent traits and responses to test items (Linacre, 2010a). Test items should not behave differently for particular subgroups of test takers. If an item functions differently for certain groups, the item reduces the validity of the measure for that construct, and test fairness is threatened. The Rasch measurement model enables the detection of test items which are biased toward different subgroups according to construct irrelevant factors , such as ability, gender, and ethnicity, subgroups, by calculating differential test functioning (DTF) and differential item functioning (DIF) measures. The assumptions of the Rasch model include unidimensionality (i.e., whether the items form a unitary latent trait) and local independence (i.e., the probability of the person correctly responding to an item does not depend on the other items in the test; Green, 1996; Lee, 1997). Unidimensionality and local independence are assessed using fit statistics, which report the extent to which the pattern of observed answers and the modeled expectations are evaluated in terms of item fit and person fit to the Rasch model. In order to evaluate the psychometric properties of Form 1 of PTE Academic, two research questions were addressed in this study. 1. How does PTE Academic function in terms of the interaction of the person-gender and person-language environment (i.e whether or not the test taker has lived in an English speaking country) with each item? 2. To what extent do the item responses of PTE Academic form a unidimensional construct according to the Rasch measurement model? The first research question dealt with DTF and DIF so as to explore evidence for statistical between-group (i.e., gender and language environment) differences in the measurement properties at the item level. The second research question concerned the unitary latent trait underlying person ability and item difficulty produced in PTE Academic. © Copyright Pearson Education Ltd. All rights reserved; no part of this publication may be reproduced without the prior written permission of Pearson Education Ltd. 2. METHODS 2.1 Participants One hundred forty ELLs took part in the study. The participants’ mean age was 26.45 (SD=5.82), ranging from 17 to 46 years of age. Females accounted for 53.6% (75 examinees) and males 46.4% (65 examinees). Sixty four percent of the participants had lived in English-speaking countries. 2.2 Measures The dataset used for this study was part of the larger field test administered by Pearson in 2007. Of 42 forms used for the two field tests, Form 1 of PTE Academic was analyzed in this study. It contained 86 items, including dichotomous (i.e., single point items) and polytomous (i.e., multipoint scale items) scales. Person-test reliability and item reliability of Form 1 were excellent (person r = .96 and item r = .99). 2.3 Procedure Examinees who wanted to demonstrate their academic English skills for different reasons were recruited from 33 countries worldwide in 2007. Testing took place individually at test centers designated by Pearson. 2.4 Analysis plan The psychometric properties of the items were analyzed utilizing the Winsteps software (Linacre, 2010b). Since the dataset comprised both dichotomous and polytomous they were analyzed utilizing the Partial Credit Model (Masters, 1982). Good-fit and misfit items were identified using infit and outfit mean-square values, and DTF and DIF were examined to test measurement invariance. The first DTF/DIF analysis was performed to examine whether the test items functioned differently by gender, and the second DTF/DIF examined whether the items favored examinees from the English-speaking setting over those from non-English-speaking countries. Unidimensionality was checked through principal component analysis (PCA), along with infit/outfit statistics and DTF/DIF. 3. Results 3.1 Differential Test Functioning and Differential Item Functioning DTF investigates whether the test functions the same way for different groups of test-takers, through a comparison of how the test items function. Figure 1 displays a scatterplot for the males’ and females’ item difficulties. Items 20 and 68 seemed to be more difficult for females, while items 39, 78, and 86 were more difficult for the males. Although there were two outliers (items 38 and 78), it is hard to assume a test bias because the source of the outlying position is unknown. The Student’s t-statistics for the items 38 and 78 were 4.91 (male measure = 2.36, female measure = 0.45; p < .05) and 3.72 (male measure = 1.07, female measure = -0.35; p < 0.5), respectively. The misfit-inflated standard errors were significantly small. The correlation coefficient between the males’ and females’ item difficulties was 0.84, and the measurement-error-removed disattenuated correlation was 0.98. 2 Because PTE Academic is expected to impact all ability levels in the same way across subgroups, uniform DIF analysis was run to investigate the interaction of the person-groups with each item, controlling for all other item and person measures. In order to determine whether the performance of a test item was significantly different for either gender subgroup, the difficulty was measured for the male and female subgroups. Figure 2 shows person-item interactions with the absolute logit difficulty of each item for male and female subgroups in the same frame of reference. Although the females and males had different group sizes and mean measures, DIF computations adjusted for these differences (Linacre, 2010a). The difficulty of each item for both subgroups was remarkably similar with a few discrepancies, indicating that the test items functioned similarly for different subgroups of examinees. Given no notable DIF within the Rasch model framework, overall, PTE Academic appeared to produce DIF-free person estimates. Figure 1: Differential Test Functioning by gender Figure 2. Differential Item Functioning by gender 3 In order to cross-validate the findings of the gender DTF and DIF, another DTF/DIF analysis was performed for learning-context subgroups (i.e., those who have lived in English-speaking countries vs. those who have not). Surprisingly, items 64, 5, 20, and 39 were more difficult for the examinees from English-speaking countries. Except these four items, the test items behaved well for the measurement model across the spoken language environment. Interestingly, item 78, which showed evidence of bias in both DTF and DIF analyses for the gender subgroups, did not show a bias for the language-setting subgroups (see Figure 3). Figure 4 displays a plot of DIF for the language subgroups. Item 21 appeared to be more difficult for individuals from non-English speaking countries than their counterparts. Item 21, which was one of the multiple-choice listening items that required a single answer, might be an item which had a potential threat to test fairness. Except item 21, the data supported the hypothesis that the difficulties of each pair of items in the two analyses were statistically not different. Figure 3. Differential Test Functioning by language environment Figure 4. Differential Item Functioning by language environment 4 3.2 Unidimensionality A one-dimensional construct, equal item discrimination, and low susceptibility to guessing are the fundamental requirements of the Rasch measurement (Linacre, 2010a; Green, 1996; Sick, 2010). Given that unidimensionality is closely related to a factor structure comprising one major latent trait, principal components analysis (PCA) offers a means to evaluate the suitability of data for Rasch analysis (Camilli, Wang & Fesq, 1995). In order to assess measurement dimensionality, PCA of the Rasch residuals was performed. The amount of the variance explained by different components in the data was 76.1% with 20.7% explained by persons and 55.4% explained by items. This indicated that a dominant first factor was present. According to Reckase (1979), the variance explained by the first factor should be greater than 20% as to be indicative of unidimensionality. The variance explained in this study exceeded the requirement of this criterion, demonstrating a unidimensional trait of the data. This indicated that the items did fit the model well with relatively good item-person targeting and wide dispersion of the items and persons. There were small amounts of unexplained variances in the components which came from the residuals (1.5%, 1.3%, 0.8%, and 0.7% for the first, second, third, and fourth contrasts, respectively). Residual factor loadings suggested that the data closely approximated the Rasch model, and there were no meaningful components beyond the primary dimension of measurement. PCA demonstrated, by and large, that there were no extraneous dimensions or sub-dimensions related to sub-skills. Since unidimensionality is achieved by forming a single underlying pattern in a data matrix, it was assumed that local independence was met. Local independence was achieved by controlling for all the abilities so that responses to items could be independent of one another (Hambleton, Swaminathan & Rogers, 1991). 4. Discussion This study examined the interaction of person abilities and item difficulties of PTE Academic. No conspicuously aberrant responses were observed. Hence, the hypothesis that Form 1 of PTE Academic data did fit the Rasch model was supported. The infit and outfit MNSQs did not excessively depart from the acceptable range, which indicated no unexpected responses and supported the measurement. The small departure suggested that misfit items might have happened due to chances or randomness predicted by the Rasch model. DIF analysis supported a similar probability of endorsing each item category across the gender subgroups as well as the language-context subgroups. The seemingly biased items in the gender DIF did not overlap with those in the language-context DIF. Therefore, it seemed that the probabilities of correct responses were comparable, without significantly favoring one subgroup over another. Given the low degree of DIF, the hypothesis that the difficulties of each pair of items in the two subgroups were not significantly different was supported. This suggests that PTE Academic scores a relatively free of construct irrelevant variance thus supports the argument for the construct validity. A concern about the multiplicities (Sick, 2010) of subgroup assignment suggests that keeping moderately biased items may not do harm because eliminating items with DIF by gender or language contexts does not guarantee that the test will be free of item biases by other subgroup classification. This study contributes to the field of language testing with empirical evidence for the use and interpretation of PTE Academic scores through a detailed evaluation of the response pattern, item fit, dimensionality, and the detection of item bias. Appraising constructs and test fairness for PTE Academic is important because it is currently in 5 use for ELLs’ academic English measurement worldwide. The appropriateness, meaningfulness, and usefulness of PTE Academic will allow decision-makers in education and business settings to make useful predictions or inferences based on PTE Academic scores. 5. References Camilli, G., Wang, M. & Fesq, J. (1995). The Effects of Dimensionality on Equating the Law School Admissions Test. Journal of Educational Management, 32, 1, 79-96. Green, K. E. (1996). Applications of the Rasch model to evaluation of survey data quality. New Directions Evaluation, 70, 81-92. Hambleton, R. K., Swaminathan, H. & Rogers, H. J. (1991). Fundamentals of Item Response Theory. Newbury Park: Sage. Lee, J. (1997). State activism in education reform: Applying the Rasch model to measure trends and examine policy coherence. Educational Evaluation and Policy Analysis, 19 (1), 29-43. Linacre, J. M. (2010a). A user’s guide to Winsteps. Retrieved December 1, 2010 http://www.winsteps.com/winman/index.htm?guide.htm. Linacre, J. M. (2010b). Winsteps (Version 3.70.02) [Computer Software]. Chicago: Winsteps.com. Masters, G.N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149-174. Pearson (2009). Pearson Test of English Academic. London: Pearson http://www.pearsonpte.com/PTEACADEMIC/Pages/home.aspx Reckase, M. D. (1979). Unifactor Latent Trait Models Applied to Multi-Factor Tests: Results and Implications. Journal of Educational Statistics, 4, 207-230. Sick, J. (2010). Assumptions and requirements of Rasch measurement. SHIKEN: JALT Testing & Evaluation SIG Newsletter. 14 (2), 23 – 29. 6
© Copyright 2026 Paperzz