A COMPARISON OF DIFFERENTIAL ITEM FUNCTIONING (DIF) DETECTION FOR DICHOTOMOUSLY SCORED ITEMS BY USING 2.1 IRTPRO, BILOG-MG 3, AND IRTLRDIF V.2 by MEI LING ONG (Under the Direction of Soeck-Ho Kim) ABSTRACT This paper addresses statistical issues of differential item functioning (DIF). The first purpose of this study is to present an empirical data comparison of the IRTPRO, BILOG-MG 3, and IRTLRDIF programs and to detect DIF across two samples with IRT models, 1PL, 2PL, and 3PL. The second purpose is to examine IRTPRO to determine its effectiveness in detecting DIF, and, finally, to consider whether DIF exists in the GHSGPT for different ethnicities only in Social Studies. The GHSGPT predicts 11th grade students’ future performance on the Georgia High School Graduation Test and consists of 79 dichotomously scored items. The results show that several DIF items exist in the GHSGPT. For instance, all three programs consistently indicate that Item 13 is beneficial to Whites. In addition, IRTPRO is effective in detecting DIF because its results parallel those of IRTLRDIF and BILOG-MG 3. INDEX WORDS: Differential item functioning (DIF), IRTPRO, BILOG-MG 3, IRTLRDIF, IRT, 1PL, 2PL, and 3PL. A COMPARISON OF DIFFERENTIAL ITEM FUNCTIONING (DIF) DETECTION FOR DICHOTOMOUSLY SCORED ITEMS BY USING 2.1 IRTPRO, BILOG-MG 3, AND IRTLRDIF V.2 by MEI LING ONG B.A, Fu-Jen Catholic University, Taiwan, 1999 A Thesis Submitted to the Graduate Faculty of The University of Georgia in Partial Fulfillment of the Requirements for the Degree MASTER OF ARTS ATHENS, GEORGIA 2012 © 2012 MEI LING ONG All Rights Reserved A COMPARISON OF DIFFERENTIAL ITEM FUNCTIONING (DIF) DETECTION FOR DICHOTOMOUSLY SCORED ITEMS BY USING 2.1 IRTPRO, BILOG-MG 3, AND IRTLRDIF V.2 by MEI LING ONG Electronic Version Approved: Maureen Grasso Dean of the Graduate School The University of Georgia August 2012 Major Professor: Soeck-Ho Kim Committee: Allan S. Cohen Stephen E. Cramer ACKNOWLEDGEMENTS I sincerely appreciate those who supported and encouraged me throughout this process. I would like to thank my advisor, Dr. Soeck-Ho Kim, for his guidance and technical support throughout this study, without which I would not have completed this thesis. In addition, I would like to thank the members of my committee, Dr. Allan S. Cohen and Dr. Stephen E. Cramer, for their comments and helpful suggestions while completing this thesis. Furthermore, I want to thank my friends, Yoonsun, Youn-Jeng, Sunbok, Stephanie Short, Mary Edmond, and many other friends, who provided their opinions in terms of this thesis. Lastly and importantly, I wish to express my deepest appreciation to my parents, my elder brother, my younger sister, and my younger auntie for their support and encouragement. To my lovely husband, Man Kit Lei, thanks for cooking lunch and dinner for me while I was researching, writing and revising this study. Because of your unending encouragement and full support, I have had an opportunity to obtain my Master’s Degree. Thank you very much. iv TABLE OF CONTENTS Page ACKNOWLEDGEMENTS ........................................................................................................... iv LIST OF TABLES ........................................................................................................................ vii LIST OF FIGURES ..................................................................................................................... viii CHAPTER 1 INTRODUCTION .........................................................................................................1 1.1 Overview ............................................................................................................1 1.2 Item Bias, Differential Item Functioning (DIF), and Impact .............................2 1.3 The Purpose of the Study ...................................................................................6 2 LITERATURE REVIEW ..............................................................................................7 2.1 Classical Test Theory.........................................................................................7 2.2 Modern Test Theory ..........................................................................................9 2.3 Estimation of Item Parameters .........................................................................10 2.4 Dichotomously Scored Items ...........................................................................12 2.5 The DIF Detection Method ..............................................................................18 2.6 Current Research ..............................................................................................27 3 METHOD ....................................................................................................................28 3.1 Research Structure ...........................................................................................28 3.2 Instrumentation ................................................................................................29 3.3 Sample..............................................................................................................29 3.4 Computer Programs .........................................................................................30 v 4 RESULTS ....................................................................................................................33 4.1 Item Analysis ...................................................................................................33 4.2 Racial Differential Item Functioning (DIF) Analysis ......................................41 5 SUMMARY AND DISCUSSION ...............................................................................80 5.1 Summary ..........................................................................................................80 5.2 Discussion ........................................................................................................83 REFERENCES ..............................................................................................................................88 APPENDICES A IRTPRO Input File for DIF Detection for Two Groups with 3PL ..............................95 B BILOG-MG 3 Input File for DIF Detection for Two Groups with 3PL....................101 C IRTLRDIF Input File for DIF Detection for Two Groups with 3PL.........................103 vi LIST OF TABLES Page Table 1: The Development of Item Response Models and Computer Programs ..........................14 Table 2: The 2-by-2 Contingency Table ........................................................................................19 Table 3: The DIF Detection for Ethnicity ......................................................................................30 Table 4: Raw Score Summary Statistics for the GHSGPT ............................................................33 Table 5: Item Statistics Based on Classical Test Theory ...............................................................36 Table 6: Item Statistics Based on Item Response Theory..............................................................39 Table 7: The Summary of Goodness of Fit Using BILOG-MG 3 .................................................42 Table 8: The Summary of Goodness of Fit Using IRTPRO ..........................................................42 Table 9: The Summary of BILOG-MG 3 and IRTPRO for Three Comparison Groups with 1PL ..............................................................................................................................44 Table 10: The Summary of IRTLRDIF for Three Comparison Groups with 2PL ........................47 Table 11: The Summary of IRTLRDIF, BILOG-MG 3, and IRTPRO for Three Comparison Groups with 2PL ................................................................................................................52 Table 12: The Summary of IRTLRDIF for Three Comparison Groups with 3PL ........................56 Table 13: The Summary of IRTLRDIF, BILOG-MG 3, and IRTPRO for Three Comparison Groups with 3PL ................................................................................................................61 Table 14: The Summary of BILOG-MG 3 and IRTPRO for All Ethnicities/Races with 1PL ......65 Table 15: The Summary of BILOG-MG 3 and IRTPRO for All Ethnicities/Races with 2PL ......68 Table 16: The Summary of BILOG-MG 3 and IRTPRO for All Ethnicities/Races with 3PL ......71 vii LIST OF FIGURES Page Figure 1: No DIF between two groups ...........................................................................................4 Figure 2: DIF exists in two groups called uniform DIF...................................................................4 Figure 3: Non-uniform DIF ............................................................................................................5 Figure 4: The research structure ...................................................................................................28 Figure 5: Item 13 between Whites and Blacks ..............................................................................73 Figure 6: Item 14 between Whites and Blacks ..............................................................................73 Figure 7: Item 15 between Whites and Blacks ..............................................................................74 Figure 8: Item 32 between Whites and Blacks ..............................................................................74 Figure 9: Item 44 between Whites and Blacks ..............................................................................75 Figure 10: Item 45 between Whites and Blacks ............................................................................75 Figure 11: Item 56 between Whites and Blacks ............................................................................76 Figure 12: Item 57 between Whites and Blacks ............................................................................76 Figure 13: Item 78 between Whites and Blacks ............................................................................77 Figure 14: Item 13 between Whites and Hispanics .......................................................................77 Figure 15: Item 19 between Whites and Hispanics .......................................................................78 Figure 16: Item 51 between Whites and Hispanics .......................................................................78 Figure 17: Item 44 between Whites and the Multi-Racial Group ..................................................79 viii CHAPTER 1 INTRODUCTION 1.1 Overview A well-constructed test is the best way to evaluate a student’s mastery in a particular field. Gronlund (1993) stated that tests not only aid teachers in making various instructional decisions by having a direct influence on students’ learning, but they also assist in a number of other ways. For instance, tests can increase students’ motivation. The purposes of tests are to obtain an accurate and fair assessment of a student’s abilities. Nevertheless, a test cannot properly evaluate skills or knowledge bases if it is affected by irrelevant factors that could bias the results. These potentially biasing factors could include gender, ethnic, and cultural differences. Without properly accounting for these compounding factors, the results of the test will be an unfair representation of students’ abilities (Gronlund, 1993). In other words, if a test is unfair for examinees because of gender, ethnic origin, or cultural bias, then its results are essentially meaningless. For instance, Freedle and Kostin (1988) investigated whether the GRE verbal item types differed across races and resulted in different item function. They found that most of the GRE verbal items advantaged Whites. Thus, test fairness is an important issue with which researchers must be concerned. There are several ways to measure students’ cognitive abilities in standardized testing. Currently, multiple-choice tests are commonly used for measuring students’ cognitive abilities (Ling & Lau, 2005). Most schools use standardized scores to evaluate educational quality and student performance (Brescia & Fortune, 1988). If test scores are an important factor in 1 evaluating students’ performance, test developers should make tests as fair as possible for examinees of different races, genders, or handicapping conditions (APA, 1988). In order to ensure that all items are as free as possible from irrelevant sources of variance, all items should be reviewed because the presence of bias may unfairly affect examinees’ scores (Hambleton & Swaminathan, 1985). Hence, detecting differential item functioning (DIF) can be seen as a critical step in detecting biased items. 1.2 Item Bias, Differential Item Functioning (DIF), and Impact Research on item bias first appeared in the literature in the 1960s. Angoff (1993) characterized bias as “An item is biased if equally able (or proficient) individuals, from different groups, do not have equal probabilities of answering the item correctly” (p. 4). Lord (1980) also noted that a test would be unbiased if each item has exactly the same item response function in each group, and examinees have exactly the same opportunity of obtaining the correct item at any given level of ability, θ. However, if each item has a different item response function between a reference group and a focal group, the item, obviously, is biased. Furthermore, Shealy and Stout (1993) indicated that “if the matching criterion is judged to be construct-valid in the sense that it is matching examinees on the basis of the latent trait (target ability) the test is designed to measure without contamination from other unintended to be measured abilities then the DIF item is said to be biased” (p. 197). For example, the word commodious, a verbal aptitude item, advantages Hispanic examinees. The word commodious was considered biased because it has a similar form and meaning in Spanish (Zieky, 1993). While researchers have determined the need to identify such bias in testing, the very word “bias” is sometimes confusing and evokes 2 negative emotional reactions similar to the words “discrimination” and “racism” (Berk, 1982). Eventually, researchers proposed DIF to replace the term bias (Angoff, 1993). DIF involves testing examinees from different populations that share the same abilities but differ in their probabilities of giving correct responses on test items (Crocker & Aligina, 2008). For example, a mathematics test requires skills in computation and reading and assumes that all examinees have the same computational ability. Nonetheless, if one group is proficient in reading English, but another group is made up of English as a second language (ESL) individuals, these groups would not have equal English proficiency. In this situation, even though all examinees are matched in their computational abilities, they will provide different answers on the mathematic items because they have differential English proficiency. DIF exists on mathematics test. On the other hand, if two groups exhibit different performances on a mathematics test because they do not share the same ability, then this situation displays impact rather than DIF. Impact refers to a difference in performance on an item between two groups and is what Holland and Thayer (1988) called “differential item performance.” If DIF exists for a focal group associated with some reference group, then the item characteristic curves (ICC) differ for the two groups (Cai et al., 2011). In other words, there is no DIF if the ICCs are equal as shown in Figure 1. On the other hand, DIF exists when the ICCs differ as shown in Figure 2. Thus, Lord (1980) argued that DIF detection questions could be approached by comparing estimates of the item parameters between groups, as the ICCs for an item are determined by the item parameters. DIF exists in the test, which means that items have been detected as construct-irrelevant factors in the test, and this will affect the validity of the items utilized. If a test is found to contain a biased item, this item should be omitted in order to achieve a fairer test. Thus, determining DIF is an 3 important step in maintaining items’ effectiveness and fairness as well as in enhancing the validity of a test. Figure 1. No DIF between two groups. Figure 2. DIF exists in two groups called uniform DIF. 4 Two types of DIF, uniform and non-uniform, have been defined by Mellenberg (1982). Uniform DIF refers to the pattern of difference between two groups’ probabilities of obtaining a correct response to an item as the same across all ability levels. Uniform DIF presents no interaction between the level of two groups and their abilities as shown in Figure 2. Non-uniform DIF, so-called crossing DIF (CDIF), refers to an item that discriminates across ability levels differently for separate groups, which means that the probability of giving correct responses on test items for different groups is not the same at all ability levels as shown Figure 3. Thus, there is an interaction between ability levels and separate groups when non-uniform DIF exists (Swaminathan & Rogers, 1990). Overall, bias is not a simple synonym for DIF. The differentiation between bias and DIF depends on “the extent to which a convincing construct validity argument has been given for the matching criterion” (Shealy & Stout, 1993, p. 197). Therefore, most analyses of test data examine DIF rather than item bias. Figure 3. Non-uniform DIF. 5 1.3 The Purpose of the Study In order to provide a fair and equitable test, the detection of DIF is necessary. Traditionally, classical test theory was widely used because of its computational simplicity. However, several computer programs, such as BILOG-MG 3 (Zimowski et al., 2003), flexMIRT (Cai, 2012), and IRTPRO (Cai et al., 2011), have recently been developed which can address complex mathematic computations. As a result, item response theory has grown in popularity. This current study analyzes the data of the Georgia High School Graduation Predictor Test (GHSGPT) to investigate DIF across multiple groups using several computer programs with three popular IRT models for dichotomously scored items. This study has three main objectives. The first objective is to present an empirical data comparison of three programs, IRTPRO, BILOG –MG 3,and IRTLRDIF in order to detect DIF across majority and minority groups with one parameter logistic (1PL), two-parameter logistic (2PL), and three-parameter logistic (3PL) models. The second purpose is to examine IRTPRO to determine its effectiveness in detecting DIF. Finally; this study considers whether the GHSGPT exhibits DIF for different ethnicities. 6 CHAPTER 2 LITERATURE REVIEW Currently, classical test theory (CTT) and item response theory (IRT) are popular statistical structures for addressing measurement problems such as test development, test-score equating, and the identification of biased test items. Forty years ago, Frederic Lord indicated that examinees’ observed scores and true scores were not the same as their ability scores because ability scores are test independent (Hambleton & Jones, 1993). On the other hand, examinees’ observed- and true- scores are test-dependent (Lord, 1953). Thus, the CTT and the IRT are widely perceived as representing two measurement frameworks. 2.1 Classical Test Theory Classical test theory (CTT) or traditional measurement theory, which is referred to as the “classical test model,” is regarded as the “true score theory” and includes three concepts: 1) the observed score (test score); 2) the true score; and 3) the error score. Each observed score is made up of two components, which are the “true score (T)” and the “error score (E)” (Hambleton & Jones, 1993). The model of CTT is defined as: X = T + E, (1) where X is the test score (observed score), T is the true score, and E is the error score. Observed scores are simply the scores individuals obtain on the measuring instrument. The true score is the one that each observer desires to obtain. However, the true score, in fact, is an unknown value and cannot be directly observed. It is inferred from the observed scores, and it 7 can merely be estimated. For individuals, the theoretical value of the true scores represents a real psychological operation or academic performance. The true score for examinee j is given as: Tj = E(Xj) = μxj. (2) Errors include systematic errors, random errors, and measurement errors (Spector, 1992). The CTT assumes each examinee has a true score if there were no errors in measurement, that is, X = T. If the expected value of X is T, E’s expectation is zero (Lord, 1980): 𝜇𝐸|𝑇 ≡ 𝜇(𝑋−𝑇)|𝑇 ≡ 𝜇𝑋|𝑇 − 𝜇 𝑇|𝑇 = 𝑇 − 𝑇 = 0, (3) where μ is the mean, and the subscripts state that T is fixed. Equation 3 indicates that the error of measurement is unbiased. If T and E are independent, the observed-score variance is defined as: 𝜎𝑋2 = 𝜎𝑇2 + 𝜎𝐸2 , (4) where 𝜎𝑋2 is the variance of the observed score (total score), 𝜎𝑇2 is the variance of the true score, and 𝜎𝐸2 is the variance of errors. Reliability refers to the stability and consistency of assessment results. The index of reliability can be stated as the ratio of the standard deviation of true scores to the standard deviation of the observed scores (Lord & Novick, 1968) and is defined as: 𝜎𝑇 𝜌𝑋𝑇 = 𝜎𝑋 , (5) where 𝜌𝑋𝑇 is the correlation between true and observed scores, σT is the standard deviation of the true score, and σX is the standard deviation of the observed score. Nevertheless, the true score is unknown, so getting the Pearson correlation between the observed scores on parallel tests is a way to estimate the reliability coefficient. The reliability coefficient is given by: 𝜌𝑋𝑋′ = 𝜎𝑇2 2 𝜎𝑋 , (6) where 𝜌𝑋𝑋′ is the correlation between observed scores on two parallel test, X and X′ are referred to as parallel measurements. 8 The assumptions of CTT are that: (1) true- and error- scores are independent, (2) the average error score in the population of test takers is zero, and (3) error scores on parallel tests are independent. The important advantage of CTT is its weak theoretical assumptions which make it easy to employ in many testing situations. However, CTT’s major limitations are that: (1) the person statistics are item dependent and (2) the item statistics, such as item difficulty and item discrimination, are sample dependent (Hambleton & Jones, 1993). Although CTT is easy to compute and to understand, its theory is based on weak assumptions, such as the sample dependent index. Thus, there is trouble in obtaining a consistency of difficulty, discrimination, and reliability on the same test. In order to overcome the disadvantages of CTT, modern test theory, which is based on the item response theory framework, was developed. 2.2 Modern Test Theory The theoretical structure of modern test theory (or modern measurement theory) is item response theory (IRT). IRT, which is also known as “latent trait theory,” is a general statistical theory concerning an examinee’s item and test performance and how his or her performance relates to the abilities that are measured by the items in the test (Hambleton & Jones, 1993). In other words, IRT mainly focuses on item-level information. The essential elements of an IRT model are ability or proficiency, which is an unobservable (latent) variable, usually denoted by θ, that varies within the population of examinees and the item characteristic curve (ICC) (Thissen et al., 1993). The ICC is the curve that describes the functional relationship between the probability of a correct response to an item and the ability scale. The ICC is denoted by the following: (Baker & Kim, 2004) 𝑃(𝛽𝑖 , 𝛼𝑖 , 𝜃𝑗 ) ≡ 𝑃𝑖 (𝜃𝑗 ), 9 (7) where 𝑃𝑖 (𝜃𝑗 ) is the probability of the correct response at any point θj on the ability scale ( j = 1, 2, 3,…,N), i is an item (i = 1, 2, 3, …,n), βi is the difficulty parameter, and αi is the discrimination parameter (Baker & Kim, 2004). Item responses can be discrete or continuous and dichotomously or polychotomously scored. Item score categories can be ordered or unordered. The assumptions of IRT are: (1) dimensionality, which includes uni- or multi-dimensional and (2) local independence, so called conditional independence, means that every person has a certain probability of giving a predefined response to each item, and this probability is independent of the answers given to the preceding items (Croker & Algina, 2008). The characteristics of IRT are parameter invariance and information function; however, CTT does not have these two characteristics. The major limitation in IRT is that it tends to be complex in its computations. 2.3 Estimation of Item Parameters This study applies three computer programs, IRTPRO, BILOG-MG 3 and IRTLRDIF, to analyze DIF. These three programs implement the method of marginal maximum likelihood estimation (MMLE) and maximum likelihood estimation (MLE) for item parameter estimation. Hence, this study utilizes only MMLE and MLE. 2.3.1 Marginal Maximum Likelihood Estimation (MMLE) The method of marginal maximum likelihood estimation (MMLE) was proposed by Bock and Lieberman (1970). However, their approach was practical only for very short tests; the computation was complicated, and the estimation was slow. Thus, in order to solve these problems, Bock and Aitkin (1981) developed the expectation- maximization (EM) algorithm to 10 improve the effectiveness of the MMLE. Baker and Kim (2004) indicated that the MMLE assumes that examinees represent a random sample from a population where ability is distributed based on a density function g (θ|τ), where τ refers to the vector containing the parameters of the examinee population’s ability distribution. Currently, this situation corresponds to a mixed-effect ANOVA model with items are considered to be a fixed effect and abilities a random effect. The essential feature of the Bock and Lieberman solution is its ability to integrate over the ability distribution and to remove random nuisance parameters from the likelihood functions (Baker & Kim, 2004). Therefore, item parameters are estimated in the marginal distribution; the item parameter estimation is freed from its dependency on the estimation of each examinee's ability, although it is not from its dependency upon the ability distribution. The ability is estimated together with the item parameters if the ability distribution is correctly identified (Baker & Kim, 2004). Because increasing sample size does not require the estimation of additional examinee parameters, this produces consistent estimates of item parameters for samples of any size (Harwell et al., 1988). The marginal likelihood function will be maximized in order to obtain item parameters, and the equation is identified below (Baker & Kim, 2004): 𝑁 𝑛 𝑗=1 𝑖=1 𝐿 = � � � 𝑃𝑖 (𝜃𝑗 )𝑢𝑖𝑗 𝑄𝑖 (𝜃𝑗 )1−𝑢𝑖𝑗 𝑔�𝜃𝑗 �𝜏�𝑑𝜃𝑗 , (8) where uij is the probability of obtaining a dichotomous response, 0 or 1, and 𝑔�𝜃𝑗 �𝜏� is the probability of density function of ability in the population of examinees (Baker & Kim, 2004). 2.3.2 Maximum Likelihood Estimation (MLE) The maximum likelihood estimation (MLE) began with a mathematical expression known as the likelihood function, which is the likelihood of a set of parameter values that is the 11 probability getting the particular set of parameter values, given the chosen probability distribution model including unknown model parameters (Czepiel, 2002). The parameter values maximize the sample likelihood, which is known as the maximum likelihood estimates (MLE). The MLE procedures will be presented for the two-parameter logistic model (Baker & Kim, 2004) that is given by: 𝑃𝑗 = Ψ�𝑍𝑗 � = 1 −(𝜍+𝜆𝜃𝑗 ) 1+𝑒 , (9) where 𝑍𝑗 = 𝜍 + 𝜆𝜃𝑗 and is the logit, ς is the slope, and λ is the intercept. The likelihood function is defined by: Prob (R) = ∏kj=1 𝑓𝑗 ! 𝑟 𝑃 𝑗 (1 𝑟𝑗 !(𝑓𝑗 −𝑟𝑗 ) 𝑗 − 𝑃𝑗 ) 𝑓𝑗−𝑟𝑗 , (10) where 𝑟𝑗 represents the correct response, 𝑓𝑗 − 𝑟𝑗 represents the incorrect response, Pj is the true probability of correct response. There are �𝑓𝑟 𝑗� different ways to arrange rj successes from among 𝑗 fj trials for each population; the probability of the success of any one of the 𝑓𝑗 trials is Pj, and the 𝑟 probability of 𝑟𝑗 successes is 𝑃𝑗 𝑗 (Czepiel, 2002). Similarly, the probability of 𝑓𝑗 − 𝑟𝑗 failures is (1 − 𝑃𝑗 ) 𝑓𝑗−𝑟𝑗 . The maximum likelihood estimated are the values for R that maximizes the likelihood function in Equation 10 (Czepiel, 2002). 2.4 Dichotomously Scored Items For psychological and educational testing, dichotomous scoring, polytomous scoring, and continuous scoring are commonly used in the scoring of item responses. Previously, DIF research primarily focused on dichotomously scored items (Embretson & Reise, 2000); recently, however, several studies mention polytomously scored items (Raju et al., 1995). Because this 12 study is focused on the unidimensional dichotomously scored items, it discusses only the unidimensional dichotomously scored items. For dichotomously scored items, scored items, either correct or incorrect, are the majority of the multiple-choice test score items analyzed, even though a multiple-choice test item has four options (Potenza & Dorans, 1995). Van der Linden and Hambleton (1996) mentioned that if the examinees j respond to the item I denoted by a random variable Uij, the two scores are codes as Uij = 1 (correct) and Uij = 0 (incorrect). The probability of the ability of the examinees getting a correct response is presented by parameter θ (-∞, ∞). The properties of item i that have an effect on the probability of success are its difficulty, bi (-∞, ∞), and discriminating power, ai (-∞, ∞). The probability of success on item i is usually denoted by Pi(θ), which is a function of θ specific to item i, known as the item response function (IRF), item characteristic curve (ICC), or trace line. Because the IRF cannot be linear in θ, it usually has to be monotonically increasing when θ rises. In addition, it provides the different probability of that response across the ability continuum (Thissen et al., 1993). 2. 4.1 Item Response Models Dimensionality, which is one of the assumptions under IRT, includes unidimensionality and multidimensionality. Both the unidimensional item response theory (UIRT) model and the multidimensional item response theory (MIRT) model include dichotomously and polytomously scored items. Based on the different scoring and dimensionality, researchers developed different item response models. Table 1 briefly displays the dimensionality, scoring, parameters, model presented by researchers, and computer programs that are appropriate to use in different models. 13 Table 1 The Development of Item Response Models and Computer Programs Dimentionality Scoring Parameters Unidimentionality Dichotomous One-Parameter Logistic Model or Rasch Models (1PLM) Two-Parameter Logistic Model (2PLM) Three-Parameter Logistic Model (3PLM) Nominal Response Model Rasch (1960) Rating Scale Model Andrich (1978) Graded Response Model Partial Credit Model Generalized Partial Credit Model Samejima (1969) Master (1982) Muraki (1991) Multidimensional Extension of the Rasch Model (M1PL) Multidimensional Extension of the TwoParameter Logistic Model Multidimensional Extension of the ThreeParameter Logistic Model Multidimensional Extension of the Graded Response (MGP) Model Multidimensional Extension of the Partial Credit (MPC) Model Multidimensional Extension of the Genralized Partial Credit (MGPC) Model Adams, Wilson & Wang (1997) Polytomous Multidimensionality Dichotomous Polytomous Presented by Birnbaum (1968) Computer Programs Winstep, BILOG-MG, IRTPRO, flexMIRT, TESTFACT Birnbaum (1968) Bock (1972) Mckinley & Reckase (1991) MULTILOG, PARSCALE, IRTPRO, flexMIRT, ConQuest TESTFACT, NOHARM, ConQuest, BMIRT, IRTPRO, flexMIRT Reckase (1985) Muraki & Carlson (1993) Kelderman & Rijkes (1994) POLYFACT, BMIRT, IRTPRO, flexMIRT Yao & Schwarz (2006) Note. Adapted from Multidimensional item Response Theory , by M. D. Reckase, 2009. Copyright 2009 by Springer. 14 The one-parameter logistic (1PL) model, or the so-called Rasch model, the two-parameter logistic (2PL) model, and the three-parameter logistic (3PL) model are the three popular unidimensional IRT models for dichotomous tests. Because this study focuses on the UIRT, it discusses only three models. 2.4.1.1 The One-Parameter Logistic (1PL) Model or The Rasch Model In the 1950s, Georg Rasch (1960) developed his Poisson models for reading tests and a model for intelligence and achievement tests, which is called the Rasch model. Under the Rasch model, both guessing and discrimination are negligible or constant. The main motivation of the Rasch model was to remove references to populations of examinees in analyses of tests. The test analysis would only be worthwhile if it were individual centered with separate parameters for the items and the examinees. The Rasch model was derived from the initial Poisson model defined as (Van der Linden & Hambleton, 1996): 𝜉= 𝛿 𝜃 , (11) where 𝜉 is a function of parameters describing the ability of an examinee and difficulty of the test, θ is the ability of the examinee, and δ is the difficulty of the test that is estimated by the summation of errors in a test. The model was enhanced to suppose that the probability of a student who will correctly answer a question is a logistic function of the different between the students’ abilities θ and questions’ difficulties β. Currently, Rasch model is specified as: P(𝜃) = 𝑒 (𝜃−𝑏) 1+𝑒 (𝜃−𝑏) 15 , (12) where P(𝜃) depends upon the particular ICC model used, e is the constant, 2.718, θ is the ability, and b is an item difficulty parameter. The difficulty parameter, b, describes the item functions on the ability scale. It is defined as the point on the ability scale at which the probability of a correct response to the item is .5 (Baker & Kim, 2004). The discriminations of all items are supposed to be equal to one under the Rasch model. The Rasch model is appropriate for dichotomous responses and models the probability of an individual’s correct response on a dichotomous item. 2.4.1.2 The Two-Parameter Logistic (2PL) Model Unlike Rasch, Birnbaum’s aim was to finish the work begun by Lord (1952) on the normal-ogive model. The contribution of Birnbaum was to replace the normal-ogive model with the logistic model. Thus, Birnbaum (1968) proposed the two-parameter logistic (2PL) model, which extends the 1PL by estimating an item discrimination parameter (a) and an item difficulty parameter (b). The 2PL model is given as: P(𝜃) = 𝑒 𝑎(𝜃−𝑏) 1+𝑒 𝑎(𝜃−𝑏) , (13) where a is the discrimination parameter without the scaling constant D= 1.702. The discrimination parameter, a, describes how well an item can differentiate between examinees’ abilities below or above the item location. It also reflects the steepness of the ICC in its middle section. The steeper the curve is the higher the value of a; thus, the better items can discriminate. On the other hand, the flatter the curve is the lower the value of a; therefore, the less items can differentiate (Baker, 2001). 16 2.4.1.3 The Three-Parameter Logistic (3PL) Model Besides 2PL, Birnbaum (1968) proposed a third parameter for inclusion in the model to consider the nonzero performance, which is the probability of guessing correct answers, of lowability examinees on multiple-choice items. The three-parameter logistic (3PL) model is defined as: P(𝜃) = c +(1- c) where c is the lower asymptote of an ICC. 𝑒 𝑎(𝜃−𝑏) 1+𝑒 𝑎(𝜃−𝑏) , (14) The lower asymptote, c, which is commonly referred to as the “pseudo-chance level” parameter, represents the probability of examinees with low ability correctly answering an item. In general, the c parameter assumes that values are smaller than the value that would result if examinees of low ability were to guess randomly on the item. Thus, Lord (1974) has noted that c is no longer called the “guessing parameter” because this phenomenon can probably be attributed to item writers developing “attractive” but incorrect choices. A side effect of using the guessing parameter c is that the definition of the difficulty parameter is changed, and the lower limit of the ICC is the value of c rather than zero. The difficulty parameter is the point on the ability scale. The equation is given as: P (θ) = (1+c)/2, (15) and the discrimination parameter is proportional to the slope, that is : 𝑎(1−𝑐) 4 , (16) of the item characteristic curve at θ = b (Baker, 2001). McDonald (1999) stated that the 3PL was designed specifically for multiple-choice cognitive items in discussing this model, and it is appropriate to refer to the latent trait as the 17 ability common to the m items in the test. With the introduction of the pseudo-guessing parameter, there is no quantity calculated from the response pattern that serves as a sufficient statistic for ability. 2.5 The DIF Detection Method The two frameworks, CTT and IRT frameworks, are mostly used to detect DIF. There are two methods that can be used to detect DIF: the non-item response theory (non-IRT) based method (or observed score methods), such as the Mantel-Haenszel (MH) procedure, standardization, SIBTEST, and logistic regression (Dorans & Holland, 1993), and the item response theory (IRT) based method, such as Lord’s chi-square test, area measures, and the likelihood function (Hambleton et al., 1991). This study applies the IRT based approach to detect DIF and will present a comparison of three programs, IRTLRDIF 2.1, BILOG-MG 3, and IRTPRO, using the Georgia High School Graduation Predictor Test data with the three IRT models. 2.5.1 The Non-Item Response Theory (Non-IRT) Based Method There are several methods to detect DIF on the non-IRT method, including the MantelHaenszel (MH) procedure, standardization, SIBTEST, and logistic regression. 2.5.1.1 Mantel-Haenszel Method The Mantel-Haenszel method was proposed by Mantel and Haenszel (1959). This method is attractive because it is easy to implement, has an associated test of significance, and can be used with small sample sizes. Thus, this method is commonly used in non-IRT based methods, 18 and it may be widely used in contingency two-by-two table procedures shown in Table 2 and has been the object of considerable evaluation since it was firstly recommended by Holland and Thayer (Dorans & Holland, 1993). Table 2 The 2-by-2 Contingency Table Group Reference Group(R) Right Ak Item Score Wrong Total Bk n RK Focal Group (F) Ck Dk n FK Total Group (T) m 1k m 0k Tk Note: k = 1,2,..,j There is a chi-square test associated with the MH approach, namely a test of the null hypothesis: H0 = αMH = 1; H1 = αMH ≠ 1, (17) where αMH is the common odds ratio (Dorans & Holland, 1993). The equation of an estimate of the constant odds ratio, αMH, is given as: 𝛼�𝑀𝐻 = ∑𝑘 𝐴𝑘 𝐷𝑘 ⁄𝑇𝑘 ∑𝑘 𝐵𝑘 𝐶𝑘 ⁄𝑇𝑘 . (18) The equation of the MH procedure is given as: MHχ2 = where 𝐸(𝐴𝑘 ) = 𝑛𝑅𝐾 𝑚1𝑘 𝑇𝑘 , 𝑉𝑎𝑟(𝐴𝑘 ) = [|𝛴𝑘 𝐴𝑘 −𝛴𝑘 𝐸(𝐴𝑘 )|−.05]2 ∑𝑘 𝑉𝑎𝑟(𝐴𝑘 ) 𝑛𝑅𝐾 𝑛𝐹𝐾 𝑚1𝑘 𝑚0𝑘 𝑇𝑘2 (𝑇𝑘 −1) , , and -.5 in the expression for 𝑀𝐻𝜒 2 serves as a continuity correction to improve the accuracy of the chi-square percentage points as approximations to observed significance levels. The MH approximates the chi-square distribution with one degree of freedom when the null hypothesis is true. 19 (19) This estimate is an estimate of the DIF effect size in a metric that ranges from 0 to ∞ with a value of 1 indicating null DIF (Clauser & Mazor, 1998). However, this score makes it more difficult to interpret. Thus, this score will transform into: MH D-DIF (Δ𝑀𝐻 ) = -2.35 ln(αMH). (20) According to Δ𝑀𝐻 , the three categories were developed at ETS for using in test development (Dorans & Holland, 1993): (1). Negligible DIF (A) Items are classified as A either if MH D-DIF is not significantly different from zero or |Δ𝑀𝐻 | < 1. (2) Intermediate DIF (B) Items in level B are those that do not meet either of the other criteria. (3) Large DIF (C) Items in level C indicate that |Δ𝑀𝐻 | both exceed 1.5 and are significantly greater than 1. The limitation of the MH procedure is that it can only detect uniform DIF, even though it is widely used in non-IRT based method. 2.5.1.2. Standardization The standardization approach has been developed for use at the Educational Testing Service (ETS). This approach was developed by Dorans and Kulick (1986) for use with the Scholastic Assessment Test (SAT). When the expected performance on an item, which can be operationalized by nonparametric item test regressions, differs from examinees of equal ability from different groups, DIF exists. Doran and Holland (1993) stated that one of the main purposes of the standardization approach is to use all available appropriate data to estimate the 20 conditional item performance of each group at each level of the matching variable. The matching does not require the use of stratified sampling procedures to produce equal numbers of examinees at a given score level across group memberships. In addition, the standardization approach is straightforward to obtain standardized response rates for distractors, omits, and not reaches (Schmitt & Dorans, 1990). Dorans and Kulick (1986) indicated that when the probability of giving correct responses to an item is lower for examinees from one group than for examinees of equal ability from another group, DIF is exhibited in this item. Therefore, DIF does not exist in an item when this item satisfies: Pg (X = 1|S) – Pg' (X = 1|S), (21) where S refers to developed ability as measured by the total score on a test, X is an item score (X = 1 for a correct answer and X = 0 for an incorrect answer), and Pg (X = 1|S) is as referred to the probability that a candidate from subpopulation g who has a total test score equal to S will provide the correct answer. The standardization approach of the DIF measure is the observed proportion of correct differences on an item between two groups at the kth matching variable level. The measure is given as: 𝐷𝑘 = 𝑃𝑓𝑘 − 𝑃𝑟𝑘 , (22) where 𝑃𝑓𝑘 is the proportion correct of the studied item for the focal group and 𝑃𝑟𝑘 for the reference group at the kth level of a matching variable. Standardized p-difference, DSTD, is one of the important DIF indices used in this approach, and it can range from -1 to 1 (Dorans & Schmitt, 1991). DSTD is given as: DSTD = ∑𝑆𝑠=1 𝐾𝑠 �𝑃𝑓𝑠 − 𝑃𝑟𝑠 �⁄∑𝑆𝑠=1 𝐾𝑠 , 21 (23) where [𝐾𝑠 ⁄𝛴𝐾𝑠 ] represents the weighing factor at score level S supplied by the standardization group to weight differences in performance between the 𝑃𝑓𝑠 and 𝑃𝑟𝑠 . 2.5.1.3. SIBTEST SIBTEST is a nonparametric procedure. It estimates the amount of DIF in an item and statistically tests whether the amount is different from zero. In addition, it assesses differences in item performance from two groups through their conditional ability levels. The main characteristic of SIBTEST is that it employs a regression correction method to match examinees from reference and focal groups at the same latent ability levels in order to compare their performances on the studied items. This correction controls the inflation of a Type I error; otherwise, results in the measurement error of the test and differences in the ability distributions may exist across groups (Bolt, 2000). SIBTEST requires two non-overlapping subsets of items in the test. One is a valid subtest, which means that items are assumed to measure ability. The other is a suspect subtest, which contains items to be tested for DIF. Scores on the valid subtest are used to match examinees having the same ability levels across group memberships so as to test items from the suspect subtest for DIF (Bolt, 2000). 2.5.1.4. Logistic Regression The logistic regression was proposed by Swaminathan and Rogers (1990). This model can be used to model DIF by identifying separate equations for the two groups of interest. The equation is given by: P(Uij = 1|θij) = 𝑒 (𝛽0𝑗 +𝛽1𝑗 𝜃1𝑗 ) [1+𝑒 22 (𝛽0𝑗 +𝛽1𝑗 𝜃1𝑗 ] , (24) where Uij is the response of person i in group j of an item, β0j is the intercept of group j, β1j is the slope of group j, and θij is the ability of an examinee i in group j. If DIF does not exist, the logistic regression curves for the two groups must be equal, that is, β01 is equal to β02, and β11 is equal to β12. However, uniform DIF may be inferred if β01 is not equal to β02, and the curves are parallel but not equivalent. In addition, the presence of non-uniform DIF may be inferred if β01 is equal to β02, but β11 is not equal to β12, and the curves are not parallel (Swaminathan & Rogers, 1990). 2.5.2 The Item Response Theory (IRT) Based Method IRT based methods include a comparison of item parameters, area measures, and likelihood functions. 2.5.2.1. The Comparison of Item Parameters This method was proposed by Lord (1980). This method attempted to perform a statistical test of the equality of item parameters, and it can simultaneously investigate either the difference of a, b, and c parameters or merely the difference of a and b parameters (Lord, 1980). Lord proposed two tests for evaluating the statistical significance of DIF. 2.5.2.1.1. The Test of b Difference The equation compares the difficulty parameters, b, for the reference and focal group and is defined as (Thissen et al., 1993): 𝑑𝑖 = 𝑏�𝐹𝑖 −𝑏�𝑅𝑖 �𝑉𝑎𝑟(𝑏�𝐹𝑖 )+𝑉𝑎𝑟(𝑏�𝑅𝑖 ) 23 , (25) where 𝑏�𝐹𝑖 and 𝑏�𝑅𝑖 is the maximum likelihood estimate of the item difficulty parameter for a focal group and reference group, 𝑉𝑎𝑟(𝑏�𝐹𝑖 ) and 𝑉𝑎𝑟(𝑏�𝑅𝑖 ) are the variance of the b values of the focal and reference groups. The null hypothesis H0 : di = 0, and di is the standard normal distribution. If the di is greater than 1.96 or smaller than -1.96, two-tailed p ≤ .05, which rejects the null hypothesis, DIF exists. Besides this test, Lord proposed a test of the joint difference between ai and bi for two groups (Thissen et al., 1993), known as Lord’s chi-square. 2.5.2.1.2. The Lord’s Chi-Square Lord (1980) employed the chi-square method to test whether the two groups (focal and reference groups) achieve a significant difference. Thus, Lord’s chi-square, which examines the hypothesis that each of the parameters of the item response function are consistent across groups (Cohen & Kim, 1993), is the difference in the two vectors of item parameter estimated weighted by the inverse of the variance and covariance metric, that is, the Wald statistics. However, the item parameter estimates should be placed onto the same scale when comparing the item parameters estimated in two groups of examinees. The equation is defined as: 𝜒 2 = (𝑏�𝐹𝑖 − 𝑏�𝑅𝑖 )′ 𝛴 −1 �𝑏�𝐹𝑖 − 𝑏�𝑅𝑖 �, (26) where 𝛴 −1 is the estimate of the sampling variance and covariance matrix of the differences between the item parameter estimates and 𝜒 2 on two degrees of freedom for large samples. The Lord’s chi-square has been shown to be efficient for detection of DIF based on several assumptions that include asymptotic, known θ, and maximum likelihood estimate (Kim et al., 1995). 24 2.5.2.2. Area Measure Before computing the area between two item characteristic curves (ICCs), it is necessary to transform the estimates obtained from the reference and focal group on the same scale. Thus, the areas between the two ICCs of the same items should be equal to 0. If the areas are not equal to 0, then DIF exists (Runder et al., 1980). Raju (1988) stated that “the area between two ICCs is only estimated either by integrating the appropriate function between two finite points or by adding successive rectangles of width 0.005 between two finite points” (p. 495). In addition, he proposed the signed and unsigned area formulas for calculating the exact area between two ICCs for 1PL, 2PL, and 3PL. The signed area (SA) is referred to as the difference between two curves, and it is defined as: ∞ Signed Area (SA) = ∫−∞(𝐹1 − 𝐹2 )𝑑𝜃. (27) The unsigned area (UA) refers to the distance, and it is given as: ∞ Unsigned Area (UA) = ∫−∞|𝐹1 − 𝐹2 |𝑑𝜃. (28) For 3PL, if F1 and F2 stand for two ICCs with the stipulation a1 a2 and c=c1=c2, then: SA = (1-c)(b2-b1). 2(𝑎2 −𝑎1 ) UA = (1 − 𝑐) � 𝐷𝑎1 𝑎2 ln�1 + 𝑒 𝐷𝑎𝑖 𝑎2 (𝑏2 −𝑏1)⁄(𝑎2 −𝑎1 ) � − (𝑏2 − 𝑏1 )�. (29) (30) The area between two ICCs is finite when the lower asymptotes, c, are equal. On the other hand, when the c parameters are unequal, the area between two ICCs is infinite, and this will yield misleading results. In other words, if the area measure needs to be meaningful and valid, the area between two ICCs must be finite, and its estimate must be fairly accurate (Raju, 1988). 25 2.5.2.3. The Likelihood Function The likelihood function uses the likelihood ratio (LR) test, which was proposed by Thiseen, Steinberg, and Gerrard (1986) and Thiseen, Steinberg, and Wainer (1993), to evaluate the differences between item responses from two groups (Cohen et al., 1996). In this approach, the null hypothesis, the item parameters between two groups that are equal, is to be tested. Moreover, it can test both uniform and non-uniform DIF. The uniform DIF analyzes the difference in the item difficulty parameters between a reference and focal group. By contrast, non-uniform DIF examines the difference in the item discrimination parameters (Cohen et al., 1996). The LR procedure involves a compact model (C) and an augmented model (A). Thissen et al. (1993) stated that the compact model is the item response to be tested, and the anchor items across two groups are constrained to be equal. Cohen et al. (1996) stated that “in the augmented model, item parameters for all items except the studied item(s) were constrained, which are referred to as the common or anchor set, to be equal in both the reference and focal groups (p. 19).” Because the augmented model includes all parameters of the compact model and additional parameters, the compact model is hierarchically nested within the augmented model (Cohen et al., 1996). The LR is the difference between the values of -2log likelihood for the compact model (LC) and for the augmented model (LA) (Cohen et al., 1996). LR is defined as: 𝐺 2 (𝑑. 𝑓. ) = −2log𝐿𝐶 − (−2log𝐿𝐴 ), (31) where [·] is the likelihood of the data given the maximum likelihood estimated of the parameters of the model, d.f. is the difference between the number of parameters in the augmented- and the compact- model, and G2(d.f.) is distributed as χ2(d.f.) under the null hypothesis. Therefore, if the 26 value of G2(d.f.) is large, the null hypothesis will be rejected (Thissen et al., 1993). In other words, if the test’s result is statistical significant, DIF exists in the studied item. 2.6 Current Research The aim of this study is to employ the IRT framework to detect DIF across ethnicity/race using three computer programs with three popular dichotomous models. Several studies, such as Kim et al. (1995) and Raju and Drasgow (1993), adopted BILOG-MG 3 to detect DIF. In addition, many studies, such as Woods (2009), employed IRTLRDIF in detecting DIF. To my knowledge, few studies employ IRTPRO in detecting DIF because it is a new computer program. The first hypothesis of this study is to compare the difference of testing results using IRTPRO, BILOG-MG 3, and IRTLRDIF. This study expects that the three programs will exhibit consistent results. The second hypothesis is to examine IRTPRO to determine its effectiveness in detecting DIF. The present study expects that IRTPRO is effective in detecting DIF if it exhibits consistent results with BILOG-MG 3 and IRTLRDIF. Hypothesis three examines a goodness of fit model in detecting DIF in the Georgia High School Graduation Predictor Test (GHSGPT) with three models, 1PL, 2PL, and 3PL. The paper argues that 3PL is a goodness of fit model because it was designed specifically for multiple-choice cognitive items, so in discussing this model it is appropriate to refer to the latent trait as the ability common to the m items in the test (McDonald, 1999). The fourth hypothesis examines the differences between ethnicity groups taking the GHSGPT in Social Studies. Because of the differences in culture, social economic status (SES), and neighborhood characteristics, this study argues that Whites will perform better than other races. Hypothesis five investigates whether DIF exists in the GHSGPT between ethnicity groups’ item responses. The current research anticipates that DIF will exist in several items. 27 CHAPTER 3 METHOD 3.1 Research Structure This study utilizes the fall 2010 empirical data of the GHSGPT, which measures high school achievement in the fields of Social Studies and Science, from the Georgia Center for Assessment. It detects DIF across races using the three programs, IRTPRO, BILOG-MG 3, and IRTLRDIF and compares whether these three programs are consistent and, thus, appropriate to investigate DIF. Figure 4 shows the research structure. Empirical Data 1. GHSGPT 2. Scored Item: Dichotomous 3. Number of Item: 79 items Examining the DIF of Race/Ethnicity 1. IRTPRO, BILOG-MG, and IRTLRDIF 2. 1PL, 2PL, and 3PL The Analysis DIF of Race/Ethnicity Comparing DIF Results in Different Ethnicity When Using Three Programs Figure 4. The research structure. 28 3.2 Instrumentation An empirical comparison of the three programs is presented using the fall 2010 data of the GHSGPT. Although GHSGPT measures high school achievement in the fields of Social Studies and Science, this study detects DIF for different ethnicities only in Social Studies, which consists of 79 dichotomously scored items. Note that the original items were 80 questions; however, Item 26 was considered a problematic item because its biserial correlation was -.052, so Item 26 was removed, and the remaining subsequent items were renumbered to maintain consecutive numbering. The GHSGPT contains multiple-choice questions, and each multiplechoice item has four response options. This test is a standardized test, and it follows the blueprint of the Georgia High School Graduation Tests (GHSGT), including the same strands and objectives. There are six strands for Social Studies that include World Studies (18-20%), U.S. History to 1865 (18-20%), U.S. History since 1865 (18-20%), Citizenship/Government (1214%), Map and Globe Skills (15%), and Information Processing Skills (15%). Because both GHSGPT and GHSGT are built on the same content, the GHSGPT is able to predict 11th grade students’ future performance on the GHSGT (Georgia Department of Education, 2010). 3.3 Sample The data for the 11th grade GHSGPT in Social Studies consists of 2,654 respondents after deleting the non-response data. Respondents were 11th grade students attending 18 different high schools from 17 different counties in Georgia. Table 3 shows the DIF detection for ethnicity. Whites are treated as the reference group, and Blacks, Hispanics, and a Multi-Racial group are treated as the focal groups. 29 Table 3 The DIF Detection for Ethnicity Races Whites Blacks Hispanics Multi-Racial Total Sample Sizes 1,536 872 114 132 2,654 3.4 Computer Programs Three computer programs, IRTLRDIF, BILOG-MG 3, and IRTPRO, are used in this study. 3.4.1 IRTLRDIF IRTLRDIF refers to likelihood-ratio testing for differential item functioning, and this program employs the IRT (Woods, 2009). It was developed to implement a version of IRT-LR DIF analysis for large-scale testing applications (Thissen, 2001). In previous studies, IRT-LR DIF detection has been used in disparate research contexts. For example, Wainer et al. (1991) used this procedure to study the testlets for DIF. In addition, Wang et al. (1995) used it to investigate the consequences of item choice in an experimental section. Furthermore, Steinberg (1994) has used this procedure to effectively answer questions about item serial-position and context effects with experimental data. These studies presented that “IRT-LR DIF analysis tests precisely specified and straightforwardly interpretable hypotheses about the parameters of item response models” (Thissen, 2001, p. 3). IRTLRDIF employs the likelihood ratio test and implements the methods of marginal maximum likelihood (MML) for item parameter estimation. 30 3.4.2 BILOG-MG 3 BILOG-MG 3 is an extension of the BILOG 3 program. Zimowski et al. (2003) stated that it is designed for the effective analysis of binary items, it is capable of large-scale production applications without limited numbers of items or respondents, and it can perform item analysis and the scoring of any number of subtests or subscales. In addition, it can analyze DIF and DRIFT (Item Parameter Drift) associated with multiple groups, and it can perform the equating of test scores. The response models include the one-, two-, and three-parameter models (Zimowski et al., 2003). BILOG-MG 3 applies likelihood ratio chi-square and executes the method of marginal maximum likelihood estimation (MMLE) for item parameter estimation. 3.4.3 IRTPRO IRTPRO (Item Response Theory for Patient-Reported Outcomes) is a new IRT program for item calibration and test scoring (Cai et al., 2011). Item response theory (IRT) models for which item calibration and scoring are implemented based on unidimensional responses, such as multiple choice or short-answer items scored correctly or incorrectly, and multidimensional ones include confirmatory factor analysis (CFA) or exploratory factor analysis (EFA). In addition, it is capable of calibrating large-scale production applications with unrestricted numbers of items or respondents. The response functions of IRTPRO include 1PL, 2PL, 3PL, graded, generalized partial credit, and nominal response models. “These item response models may be mixed in any combination within a test or scale and may be specified in any user-specified equality constraints among parameters, or fixed value for parameters” (Cai et al., 201, p. 4). IRTPRO is applied to the Wald test, an application proposed by Lord (1980). It implements the methods of marginal maximum likelihood (MML) and maximum likelihood estimation (MLE) for item parameter 31 estimation. However, if prior distributions are particular for the item parameters, IRTPRO calculates Maximum a posteriori (MAP) estimates (Cai et al., 2011). 32 CHAPTER 4 RESULTS 4.1 Item Analysis To analyze items of the GHSGPT and to search for problematic items, the item parameters are estimated by marginal maximum likelihood using BILOG-MG 3. The original data set consists of 80 items from the GHSGPT, which was administered to 2,654 11th grade high school students from different counties and different high schools. All Pearson and biserial correlations were positive except for Item 26, which were -.40 and -.053, respectively, and some items fell below .30. Hence, Item 26 was considered a problematic item and was omitted when calibrating, and the remaining subsequent items were renumbered to maintain consecutive numbering. Thus, 79 items in total were used in this study. Table 4 presents the summary statistics for each ethnicity/race. Table 4 Raw Score Summary Statistics for the GHSGPT Statistics Number of Items Mean Standard Deviation Coefficient Alpha Races Blacks Hispanics Whites 79 43.24 12.59 .902 79 36.63 11.46 .877 79 40.46 10.73 .859 Multi-Racial 79 42.67 11.752 .884 4.1.1 Classical Test Theory Table 5 presents 79 items of the classical item statistics for multiple groups for the 33 GHSGPT. It displays the item right, discrimination, difficulty (p-value), which is the rate of correct-responses, the Pearson- and the biserial- correlations. First, this study analyzes the probability of giving correct answers for multiple groups using SPSS (Statistical Package for the Social Sciences). When an item is dichotomously scored, the mean item score corresponds to the proportion of examinees who answer the item correctly. This proportion for item i is denoted as pi and is called the item difficulty or p-value (Crocker & Algina, 2008). The equation for the p-value is defined as: pi = the number of examinees getting the item right total number of examinees . (32) The value of pi may range from .00 to 1.00. The p-value expresses the proportion of examinees that answered an item correctly. For example, the p-value of Item 1 is .492, which means that only 49.2% of examinees’ responses to Item 1 are correct as shown in Table 5. Items with difficulties near zero are difficult; however, items with difficulties near one are easy. In order to avoid very difficult and very easy items, the ranges of difficulties that are acceptable are .3 to .7 (Allen & Yen, 2008). The difficulty or p-value is from .23 to .92. Item 52, 59, 74, 77, and 79 are considered difficult because the p-value is lower than .3 and Item 52 is the hardest item (p=.231). In addition, Items 17, 41, 53, 61, 62, 63, 65, 66, 67, 70, and 73 are considered easy because the p-value is higher than .7, and Item 67 is the easiest item (p=.922). There are 62 items (78%) between .3 and .7, the mean of the correct response rates is .518, and the degree of difficulty is moderate to easy. Second, the item discriminations address an index of how well an item differentiates between people who do well on the test and those who do not do well. The discrimination’s index can range between -1.00 and +1.00. The range of item discrimination would accept from 34 .30 to .70 (Allen & Yen, 2002). The item-discrimination index for item i, di, is defined as (Allen & Yen, 2002): di = 𝑈𝑖 𝑛𝑖𝑈 – 𝐿𝑖 𝑛𝑖𝐿 , (33) where Ui is the number of examinees who have total test scores and have item i correct in the upper range of total test scores, Li is the number of examinees who have total test scores and have item i correct in the lower range of total test scores, niU is the number of examinees who have total test scores in the upper range, and niL is the number of examinees who have total test scores in the lower range. Table 5 shows that 43 items are smaller than the criterion .3, which means that these items tend to have the lowest discrimination, and Item 11 (.004) is the lowest discrimination. The ranges of item discrimination are from .004 to .427. The average item discrimination is .262. The Pearson correlations (i.e., point-biserial) are from .025 to .456, and the average is .304. The biserial correlations are from .034 to .644, and the average is .399. The reliability is .899. 35 Table 5 Item Statistics Based on Classical Test Theory Item 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 Item Right Discrimination 1306 .133 1762 .268 1264 .390 1190 .324 906 .173 1413 .231 915 .175 1408 .187 1510 .367 1070 .315 1221 .004 1260 .335 1279 .377 1587 .347 1912 .362 1289 .213 2153 .291 1447 .388 1538 .311 1030 .356 1392 .201 1099 .134 1659 .427 1254 .281 1738 .389 812 .153 1387 .191 827 .144 1061 .096 1281 .397 1663 .403 1215 .284 1395 .359 1140 .174 878 .159 1166 .295 805 .208 1175 .341 1159 .323 1341 .414 Difficulty Pearson Biserial (p -value) Correlation Correlation .492 .142 .178 .664 .293 .379 .476 .408 .512 .448 .344 .433 .341 .187 .242 .532 .256 .322 .345 .199 .257 .531 .180 .226 .569 .379 .478 .403 .344 .436 .460 .041 .052 .475 .371 .465 .482 .393 .493 .598 .380 .482 .720 .420 .560 .486 .232 .290 .811 .429 .621 .545 .417 .524 .580 .343 .432 .388 .398 .507 .524 .222 .279 .414 .160 .202 .625 .456 .582 .472 .300 .377 .655 .437 .563 .306 .139 .182 .523 .219 .274 .312 .175 .229 .400 .109 .139 .483 .413 .517 .627 .425 .543 .458 .297 .373 .526 .368 .461 .430 .187 .236 .331 .187 .242 .439 .352 .443 .303 .227 .299 .443 .365 .459 .437 .359 .452 .505 .436 .547 36 Table 5 (continued) Item Statistics Based on Classical Test Theory Item 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 Item Right Discrimination 1971 .310 1110 .148 1578 .385 1160 .170 1639 .323 1226 .389 1360 .344 1150 .348 1354 .357 1026 .291 985 .260 613 .065 2011 .344 1246 .330 1586 .379 1335 .335 1008 .330 1491 .308 667 .065 1608 .352 1931 .338 2119 .245 2357 .190 1306 .269 2381 .166 2264 .212 2447 .137 975 .241 1376 .199 2329 .158 1756 .292 1623 .344 2335 .179 659 .023 907 .080 1033 .206 705 .109 1303 .367 766 .217 Difficulty (p -value) .743 .418 .595 .437 .618 .462 .512 .433 .510 .387 .371 .231 .758 .469 .598 .503 .380 .562 .251 .606 .728 .798 .888 .492 .897 .853 .922 .367 .518 .878 .662 .612 .880 .248 .342 .389 .266 .491 .289 37 Pearson Correlation .398 .167 .438 .179 .364 .427 .350 .392 .378 .330 .305 .064 .436 .338 .389 .361 .348 .329 .084 .398 .411 .355 .379 .293 .362 .380 .351 .251 .211 .303 .350 .376 .363 .025 .091 .224 .139 .399 .252 Biserial Correlation .539 .211 .555 .226 .464 .536 .438 .494 .474 .420 .389 .088 .598 .424 .493 .452 .444 .414 .114 .506 .551 .506 .628 .367 .613 .586 .644 .322 .265 .489 .453 .479 .590 .034 .117 .285 .188 .500 .335 4.1.2 Item Response Theory This study employs BILOG-MG 3 to compute p-value with 1PL, 2PL, and 3PL. The total sample contains 2,654 (Whites=1,536, Blacks=872, Hispanics=114, and Multi-Racial=132), and α = .05. If the p-value is less than .05, it is statistically significant. Table 6 shows that the range of item difficulty with 1PL is from -3.679 to 1.832, the reliability is .898 (Zimowski et al., 2003), the root-mean square (RMS) is .3261, the mean of item difficulty is -.173, and six items’ p-values (8%) are greater than .05. For 2PL, the ranges of item difficulty are from .102 to 1.767 and item discrimination from -1.819 to 6.458. The means of item discrimination and item difficulty are .519 and .266 respectively, the reliability is .916, the RMS is .293, and there are 38 items (53%) to determine the goodness of fit index. For 3PL, the ranges of the item discrimination parameter are from -1.788 to 3.213, the item difficulty parameter from .390 to 2.347, and the pseudoguessing parameter from .054 to .435. The means of the item discrimination parameter is .968, item difficulty parameter is .650, and pseudo-guessing parameter is .224. The reliability of 3PL is .923, the RMS is .2844, and 60 items (76%) determine the goodness of fit index. 38 Table 6 Item Statistics Based on Item Response Theory Item 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 IPL b p -value 0.042 .00 * -1.049 .49 0.140 .00 * 0.313 .00 * 1.004 .00 * -0.206 .03 ** 0.981 .00 * -0.194 .00 * -0.433 .00 * 0.597 .00 * 0.240 .00 * 0.149 .01 ** 0.105 .00 * -0.616 .00 * -1.451 .00 * 0.082 .04 ** -2.214 .00 * -0.285 .00 * -0.499 .05 0.694 .00 * -0.157 .00 * 0.528 .00 * -0.791 .00 * 0.163 .70 -0.988 .00 * 1.251 .00 * -0.146 .00 * 1.211 .00 * 0.619 .00 * 0.100 .00 * -0.801 .00 * 0.254 .01 ** -0.164 .00 * 0.430 .00 * 1.076 .00 * 0.369 .00 * 1.270 .03 ** 0.348 .04 ** 0.385 .00 * -0.039 .00 * 2PL a 0.194 0.426 0.591 0.477 0.255 0.347 0.270 0.240 0.540 0.485 0.106 0.523 0.571 0.565 0.778 0.306 1.012 0.625 0.490 0.576 0.299 0.223 0.753 0.407 0.725 0.197 0.293 0.251 0.163 0.597 0.678 0.405 0.524 0.253 0.267 0.492 0.319 0.513 0.507 0.653 b p -value 0.096 .79 -1.061 .01 ** 0.094 .00 * 0.278 .00 * 1.584 .18 -0.247 .62 1.464 .68 -0.317 .96 -0.376 .09 0.534 .13 0.894 .00 * 0.116 .47 0.070 .63 -0.514 .76 -0.971 .01 ** 0.110 .60 -1.264 .07 -0.235 .06 -0.460 .03 ** 0.543 .00 * -0.212 .01 ** 0.944 .17 -0.554 .00 * 0.165 .91 -0.699 .02 ** 2.512 .07 -0.199 .09 1.935 .42 1.495 .27 0.062 .06 -0.593 .18 0.261 .00 * -0.153 .02 ** 0.685 .34 1.626 .00 * 0.322 .00 * 1.633 .02 ** 0.292 .96 0.328 .02 ** -0.048 .01 ** Note. * p < .001, ** p <.05 39 a 0.651 0.809 1.308 1.046 0.753 0.629 0.608 0.476 0.757 1.114 1.872 0.788 1.004 0.858 1.149 0.426 1.098 1.283 0.878 1.547 0.390 0.559 1.040 0.682 0.952 0.724 0.986 0.991 0.769 1.074 1.080 1.000 0.939 0.698 1.223 1.376 0.820 0.846 1.068 0.754 3PL b 1.854 0.332 0.669 0.858 1.822 0.837 1.817 1.268 0.218 0.980 1.902 0.584 0.608 0.216 -0.239 0.839 -0.834 0.485 0.438 0.891 0.553 1.869 -0.033 0.830 -0.149 2.253 1.174 1.843 2.219 0.586 0.116 0.977 0.532 1.581 1.648 0.935 1.659 0.746 0.846 0.172 c p -value 0.396 .05 0.413 .00 * 0.241 .22 0.239 .06 0.236 .92 0.301 .12 0.208 .82 0.337 .76 0.209 .00 * 0.215 .91 0.435 .00 * 0.174 .36 0.212 .42 0.264 .85 0.326 .03 ** 0.185 .87 0.285 .08 0.283 .91 0.304 .64 0.199 .08 0.181 .27 0.284 .11 0.216 .00 * 0.221 .49 0.230 .01 ** 0.233 .01 ** 0.395 .83 0.236 .52 0.339 .89 0.211 .42 0.282 .19 0.276 .96 0.250 .03 0.300 .47 0.255 .22 0.269 .68 0.183 .01 0.180 .41 0.222 .13 0.075 .36 Table 6 (continued) Item Statistics Based on Item Response Theory Item 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 IPL b p -value -1.621 .00 * 0.502 .00 * -0.595 .00 * 0.383 .00 * -0.742 .00 * 0.228 .00 * -0.083 .02 ** 0.407 .00 * -0.069 .00 * 0.704 .00 * 0.805 .35 1.832 .00 * -1.742 .00 * 0.182 .20 -0.614 .00 * -0.025 .00 * 0.748 .00 * -0.388 .20 1.665 .00 * -0.667 .00 * -1.505 .00 * -2.093 .00 * -3.108 .00 * 0.042 .00 * -3.244 .00 * -2.655 .00 * -3.679 .00 * 0.830 .00 * -0.120 .00 * -2.961 .00 * -1.034 .00 * -0.703 .00 * -2.991 .00 * 1.689 .00 * 1.001 .00 * 0.687 .01 ** 1.552 .00 * 0.049 .00 * 1.377 .02 ** 2PL a 0.763 0.227 0.699 0.244 0.552 0.653 0.492 0.579 0.554 0.463 0.419 0.133 0.913 0.476 0.592 0.502 0.493 0.474 0.161 0.637 0.773 0.776 1.406 0.394 1.375 1.125 1.767 0.335 0.290 0.842 0.563 0.580 1.161 0.102 0.155 0.304 0.221 0.584 0.357 b p -value -1.091 .23 0.883 .94 -0.440 .79 0.631 .08 -0.625 .02 ** 0.148 .00 * -0.085 .25 0.309 .00 * -0.071 .18 0.655 .00 * 0.816 .60 5.383 .75 -1.062 .00 * 0.158 .87 -0.498 .00 * -0.033 .01 ** 0.662 .00 * -0.369 .88 4.046 .08 -0.516 .00 * -1.009 .02 ** -1.377 .01 ** -1.497 .00 * 0.041 .00 * -1.566 .00 * -1.422 .00 * -1.605 .00 * 1.021 .00 * -0.165 .06 -1.819 .00 * -0.852 .00 * -0.575 .90 -1.560 .00 * 6.458 .00 * 2.528 .33 0.922 .71 2.791 .00 * 0.023 .38 1.605 .02 ** Note. * p < .001, ** p <.05 40 a 0.810 0.452 0.993 0.891 0.679 2.033 0.807 1.574 0.877 1.678 0.679 0.767 0.992 0.826 0.705 0.791 1.014 0.772 2.347 0.687 0.817 0.754 1.259 1.142 1.212 1.033 1.581 0.656 0.445 0.791 0.596 0.833 1.043 1.954 0.609 0.728 1.736 0.948 0.916 3PL b -0.733 1.836 0.107 1.578 -0.109 0.697 0.545 0.820 0.482 1.033 1.136 3.213 -0.738 0.734 -0.119 0.549 0.996 0.438 2.013 -0.321 -0.712 -1.227 -1.550 0.966 -1.643 -1.450 -1.739 1.403 0.906 -1.788 -0.580 0.108 -1.630 2.247 2.664 1.488 1.745 0.532 1.582 c p -value 0.188 .48 0.252 .99 0.218 .46 0.338 .73 0.185 .01 ** 0.252 .81 0.223 .30 0.239 .21 0.205 .22 0.240 .09 0.147 .73 0.208 .62 0.183 .00 * 0.213 .42 0.137 .00 * 0.208 .16 0.179 .00 * 0.268 .64 0.232 .25 0.067 .00 * 0.151 .13 0.122 .52 0.085 .01 0.330 .87 0.089 .05 0.054 .00 * 0.075 .00 * 0.188 .01 ** 0.254 .33 0.122 .00 * 0.106 .10 0.250 .96 0.065 .00 * 0.235 .01 ** 0.279 .33 0.244 .97 0.218 .03 0.197 .21 0.169 .10 4.2 Racial Differential Item Functioning (DIF) Analysis DIF analyses were employed to determine whether items are advantaged/disadvantaged across ethnicities/races. Whites were identified as the reference group, and Blacks, Hispanics, and the Multi-Racial group were regarded as the focal groups. Thiseen (2001), in the manual for IRTLRDIF, noted that “IRTLRDIF has implemented two of the most commonly-used IRT models: the three-parameter logistic (3PL) model and Samejima’s graded model. Both of those models include the two-parameter logistic (2PL) model as a special case” (p. 5). Thus, this study adopts BILOG-MG 3 and IRTPRO to examine the 79 items in Social Studies with 1PL and employs IRTPRO, BILOG-MG 3, and IRTLRDIF with 2PL and 3PL. If both BILOG-MG 3 and IRTPRO identity an item as a DIF item, then it was considered a DIF item with 1PL. In addition, when three programs identically detect DIF phenomenon for 2PL and 3PL, those items are included as DIF. This study determines whether any race is favored in each item based on the outcomes from IRTLRDIF, BILOG-MG 3, and IRTPRO. Because the results of IRTPRO and IRTLRDIF are similar, this study will simultaneously employ two programs, IRTPRO and BILOG-MG 3, to compare multiple groups, White vs. Blacks, Hispanics, and the Multi-Racial group, to investigate which items exist in DIF for specific ethnicities with 1PL, 2PL, and 3PL. 1PL, 2PL, and 3PL are the three main models used to estimate item parameters for the dichotomous items (Hambleton et al., 1991). The 1PL assumes that all discriminations, a, are equal, so it only considers item difficulty, b, while calibrating. The 2PL calibrates item difficulty and discrimination, and the 3PL calibrates item difficulty, discrimination, and the lower asymptote, c. 41 This study adopts -2loglikelihood (-2logL) for each race to determine the goodness of fit. The item fit statistics provided by both IRTPRO and BILOG-MG 3 indicated that the 3PL model provided a good fit to the data shown in Table 7 and Table 8. Table 7 The Summary of Goodness of Fit Using BILOG-MG 3 Comparison Groups Whites vs. Multi-Racial Whites vs. Blacks Whites vs. Hispanics 225310.41 152592.65 154221.83 248561.46 2PL (-2loglikelihood) 221722.24 150104.10 151702.80 244621.80 3PL (-2loglikelihood) 220417.99 149214.33 149166.39 243439.83 Model 1PL (-2loglikelihood) Whites vs. All Races Table 8 The Summary of Goodness of Fit Using IRTPRO Comparison Groups Whites vs. Blacks Whites vs. Hispanics 225352.18 152604.19 154235.26 248614.63 2PL (-2loglikelihood) 221704.61 150057.77 151661.38 244426.20 3PL (-2loglikelihood) 220702.07 149472.04 151046.77 243451.64 Model 1PL (-2loglikelihood) 42 Whites vs. Multi-Racial Whites vs. All Races 4.2.1 Three Comparison Groups Using BILOG-MG 3 and IRTPRO with 1PL To investigate DIF items, this study employs Lord‘s (1980) technique, which is the comparison of item parameters between two groups divided by the standard errors of differences while the ability parameter is known. This is done using BILOG-MG 3 and IRTPRO. The item parameters determined the ICC for an item, and “Lord (1980) noted that the question of DIF detection could be approached by computing estimates of the item parameters within each group” (Thissen et al., 1993, p. 68). The equation is defined as (Thissen et al., 1993): Zi = Δ𝑏 𝑆𝐸(𝐺𝐹−𝐺𝑅 ) , (34) where Δ𝑏 is 𝑏𝐹 − 𝑏𝑅 . 𝑏𝐹 and 𝑏𝑅 are the item difficulty parameter for the focal group and the reference group, 𝑆𝐸 (𝐺𝐹−𝐺𝑅 ) is the standard errors of the differences of focal group and the reference group, and Zi is the approximated standard normal distribution. If the absolute value Zi is greater than 1.96, which is a two-tailed test (p ≤ .05), DIF exists. Table 9 presents the outcomes of the three comparison groups using two computer programs with 1PL. For the Whites vs. Blacks, both computer programs, BILOG-MG 3 and IRTPRO, indicate that Items 1, 2, 7, 8, 11, 13, 14, 15, 17, 20, 22, 23, 25, 26, 27, 28, 29, 30, 31, 34, 44, 49, 52, 56, 57, 59, 60, 61, 62, 66, 69, 71, 72, 74, and 78 are DIF. Items 1, 7, 8, 11, 22, 26,27, 28, 29, 34, 44, 52, 59, 69, and 74 advantaged Blacks , and Items 2, 13, 14, 15, 17, 20, 23, 25, 30, 31, 49, 56, 57, 60, 61, 62, 66, 71, 72, and 78 advantaged Whites. In addition, Items 2, 13, 19, 51, and 74 are DIF for Whites vs. Hispanics groups, Items 2, 13, and 19 favor Whites, and Items 51 and 74 favor Hispanics. Moreover, Items 8, 44, and 56 are DIF in Whites vs. the Multi-Racial group, and all these items disadvantaged the Multi-Racial group as shown in Table 9. In sum, there are 35 DIF items in Whites vs. Blacks, and only a few DIF items exist for Whites vs. Hispanics (five items) and Whites vs. the Multi-Racial group (three items). 43 Table 9 The Summary of BILOG-MG 3 and IRTPRO for Three Comparison Groups with 1PL Whites vs. Multi-Racial Whites vs. Blacks Whites vs. Hispanics BILOG-MG 3 IRTPRO BILOG-MG 3 IRTPRO BILOG-MG 3 IRTPRO 2 2 (d ) (d ) (d ) (χ2 ) (χ ) (χ ) Item 1 -5.216 * 24.1 * 1.057 .8 -.077 .0 2 2.926 * 10.7 * 2.919 * 8.7 * .137 .1 3 1.886 5.7 1.877 3.5 -2.385 5.5 4 .442 1.6 1.097 1.1 -1.979 3.4 5 -1.792 3.6 -.486 .3 -1.000 .9 6 -1.252 2.3 -.471 .3 .365 .2 7 -3.008 * 8.8 * -1.435 2.2 -.191 .0 8 -5.442 * 27.1 * -1.339 1.9 -2.724 * 6.4 * 9 -1.331 2.6 -.865 1.0 -1.542 2.3 10 -.447 1.3 1.004 .9 -.242 .0 11 -7.861 * 51.3 * -1.931 3.6 -2.266 4.2 12 1.867 5.5 .737 .5 -1.074 .9 13 6.350 * 45.4 * 3.412 * 12.0 * 1.093 1.6 14 4.554 * 24.4 * -.148 .1 .938 1.1 15 4.106 * 21.0 * 1.723 2.9 .140 .1 16 .270 1.4 .672 .3 -.135 .0 17 2.820 * 10.7 * 1.698 3.0 -1.236 1.6 18 1.132 3.0 .942 .9 .031 .0 19 1.445 3.9 3.069 * 8.9 * .275 .2 20 3.445 * 15.0 * .763 .5 -.742 .5 21 -1.496 2.8 .288 .1 -.021 .0 22 -4.257 * 16.3 * .988 .8 .000 .0 23 2.968 * 12.1 * .618 .3 .601 .6 24 -1.641 3.3 -.583 .5 -1.029 .9 25 2.661 * 10.0 * 1.010 1.0 -.385 .1 26 -4.238 * 16.6 * -1.556 2.5 -.221 .0 27 -3.442 * 11.2 * .556 .3 .377 .3 28 -3.602 * 12.4 * -.623 .5 -.138 .0 29 -5.848 * 29.5 * -1.627 2.7 -1.738 2.5 30 2.256 * 7.4 * 1.737 2.8 1.241 2.0 31 2.000 * 6.2 * -.256 .1 1.532 3.0 32 -2.261 5.6 1.307 1.5 -2.194 4.2 33 .361 1.5 .476 .2 -2.030 4.1 34 -3.518 * 11.5 * -.469 .3 -.706 .4 35 -1.017 1.8 -2.348 5.5 .528 .4 36 -2.025 4.6 -.041 .0 -.929 .8 37 1.054 2.6 -.760 .8 -.057 .0 38 -1.893 4.3 .184 .0 -2.099 4.2 39 .529 1.7 .580 .3 .680 .7 40 -.623 1.4 .152 .0 -.176 .0 Note. * DIF Items 44 Table 9 (Continued) The Summary of BILOG-MG 3 and IRTPRO for Three Component Groups with 1PL Whites vs. Blacks Whites vs. Hispanics Whites vs. Multi-Racial BILOG-MG 3 IRTPRO BILOG-MG 3 IRTPRO BILOG-MG 3 IRTPRO 2 2 2 (d ) (d ) (d ) (χ ) (χ ) (χ ) Item 41 1.963 5.9 -1.296 2.1 1.461 2.6 42 -2.307 5.3 -.734 .6 .332 .2 43 .837 2.3 -1.208 1.8 .340 .2 44 -8.263 * 62.8 * -1.724 3.2 -2.756 * 6.6 * 45 -2.197 5.5 .296 .1 -1.007 .9 46 .260 1.4 -.086 .0 .321 .2 47 .559 1.7 .119 .0 -.170 .0 48 1.828 5.4 1.498 2.1 1.092 1.5 49 2.132 * 6.8 * .749 .5 .449 .3 50 -1.492 2.9 -.794 .8 -1.256 1.5 51 -2.314 5.7 -3.383 * 12.2 * 1.707 3.1 52 -4.123 * 15.3 * -.997 1.1 -.189 .0 53 .151 1.3 -.856 1.0 -.133 .0 54 -1.890 4.1 .040 .0 -1.775 3.1 55 -.189 1.2 -.774 .8 -.293 .1 56 4.770 * 26.9 * -.278 .1 2.510 * 6.9 * 57 4.016 * 19.1 * 1.340 1.8 1.377 2.3 58 .778 2.1 .944 .8 .715 .8 59 -2.706 * 6.8 * .107 .0 1.528 2.5 60 2.179 * 7.1 * .036 .0 .532 .5 61 3.504 * 15.7 * 1.026 1.0 2.244 5.9 62 2.632 * 9.2 * .678 .5 1.811 3.9 63 .604 1.7 -.499 .4 1.076 1.6 64 .845 2.2 -1.153 1.6 -1.683 2.4 65 2.000 5.9 -.317 .1 .888 1.0 66 2.859 * 10.8 * .234 .0 .893 1.1 67 1.505 3.9 -1.657 3.1 1.102 1.6 68 1.270 3.2 .815 .6 .415 .3 69 -2.752 * 7.4 * 1.163 1.1 -.325 .1 70 1.814 5.0 .565 .3 1.019 1.3 71 2.886 * 10.7 * -.756 .7 1.244 2.0 72 3.446 * 14.7 * -.228 .1 .500 .4 73 .795 2.0 -1.058 1.3 .540 .4 74 -4.706 * 19.3 * -3.412 * 10.5 * -.316 .1 75 -2.259 5.0 1.216 1.2 -.004 .0 76 -.487 1.3 .698 .4 -.264 .0 77 -1.921 4.0 .503 .2 .154 .1 78 3.746 * 17.4 * .341 .1 .374 .3 79 -1.481 2.9 -1.695 3.3 .518 .4 Note. * DIF Items 45 4.2.2 Three Comparison Groups Using Three Computer Programs with 2PL This study employs three computer programs, IRTPRO, BILOG-MG 3, and IRTLRDIF with 2PL. IRTPRO and BILOG-MG 3 use the same methods used for 1PL to detect 2PL. For IRTLRDIF, Thissen (2001) stated that “if the value of G2(d.f.) exceeds 3.84 at α = .05 critical value of the chi-square distribution for one degree of freedom, df, fit additional models to compute single d.f., likelihood ratio tests appropriate for the item response model” (p.8). Table 10 displays the uniform and non-uniform DIF among three comparison groups with 2PL. There are 39 items that were identified as statistical significant DIF items, including 15 uniform DIF items and 24 non-uniform DIF items, for Whites vs. Blacks. A total of 16 items are DIF items that include two uniform DIF and 14 non-uniform DIF for Whites vs. Hispanics. 24 items were identified as statistically significant DIF items that include five uniform DIF and 19 nonuniform DIF for the Whites vs. the Multi-Racial group. Table 11 shows the outcome of the three computer programs for three comparison groups with 2PL. First, the three computer programs show that Items 1, 2, 8, 11, 13, 14, 15, 36, 38, 44, 45, 56, 57, 68, 72, and 78 are DIF items for Whites vs. Blacks, and Items 2, 13, 14, 15, 56, 57, 68, 72, and 78 advantaged Whites and Items 1, 11, 8, 36, 38, 44, and 45 favored Blacks. Second, three DIF items (Items 2, 3, 51) exist in the Whites vs. Hispanics test, Items 2 and 13 favored Whites , and Item 51 advantaged Hispanics. Third, Items 3, 4, 8, and 44 are DIF items in the Whites vs. the Multi-Racial group, and all items disadvantaged Whites. Overall, several DIF items (16 items) exist for Whites vs. Blacks, the same as for 1PL, and only a few DIF items exist for Whites vs. Hispanics (three items), and Whites vs. the Multi-Racial group (four items) test. 46 Table 10 The Summary of IRTLRDIF for Three Comparison Groups with 2PL Whites vs. Blacks Whites vs. Hispanics 2 2 Item H0 : all equal G H0 : a H0 : b equal equal 1 6.1 0.0 2 3 12.1 0.7 2.7 4 5 6 7 2.1 0.2 0.4 2.7 8 11.2 6.1 Uniform Non9.5 uniform H0 : all equal G H0 : a H0 : b equal equal 7.5 5.9 9.6 2.6 1.4 Non1.7 uniform Non8.2 uniform 2.2 2.5 1.3 1.4 0.1 9 8.7 3.3 10 11 4.4 10.3 3.5 0.5 12 13 14 1.6 32.0 14.3 15 16 9.1 3.6 17 0.3 0.4 0.2 3.5 11.2 Uniform Non5.5 uniform Non1.0 uniform 9.8 Uniform 31.6 Uniform 14.0 Uniform Non5.6 uniform Whites vs.Multi_Racial 2 G H0 : all H0 : a H0 : b equal equal equal 0.7 0.0 8.2 11.6 3.0 0.2 0.9 2.4 7.5 5.0 3.5 Non1.5 uniform 3.1 3.3 3.0 11.1 0.5 6.1 1.0 1.6 11.1 Uniform 3.1 2.6 7.1 7.4 Uniform Non4.5 uniform 1.2 Non6.2 uniform 4.6 Non1.5 uniform 2.4 Non2.9 uniform 3.2 1.1 1.6 0.1 0.7 0.5 0.4 4.8 2.0 47 Non2.8 uniform 5.3 Table 10 (Continued) The Summary of IRTLRDIF for Three Comparison Groups with 2PL Whites vs. Blacks Whites vs. Hispanics 2 Item 18 H0 : all equal 0.6 G H0 : a H0 : b equal equal 19 20 21 22 1.0 6.1 2.4 3.3 23 24 3.4 2.1 25 26 4.7 3.5 3.8 27 9.3 5.9 28 4.1 1.4 29 10.2 4.7 0.3 5.8 Uniform Non0.9 uniform 30 31 4.0 0.2 3.2 Non3.4 uniform Non2.7 uniform Non5.5 uniform Non0.7 uniform 32 33 13.5 1.0 8.6 Non4.9 uniform 2 H0 : all equal 1.3 G H0 : a H0 : b equal equal 9.1 0.9 2.1 3.5 0.9 Non8.3 uniform Whites vs.Multi_Racial 2 G H0 : all H0 : a H0 : b equal equal equal 0.1 1.3 1.2 0.2 0.3 0.2 1.9 0.5 1.8 1.7 1.3 0.5 2.3 2.0 0.2 0.8 0.4 1.7 2.1 5.9 0.5 4.0 1.8 2.0 Non1.8 uniform 1.4 2.3 5.3 5.5 48 0.7 0.2 Non4.6 uniform 5.3 Uniform Table 10 (Continued) The Summary of IRTLRDIF for Three Comparison Groups with 2PL Whites vs. Blacks Whites vs. Hispanics 2 Item 34 H0 : all equal 2.1 G H0 : a H0 : b equal equal 35 5.8 5.4 36 10.3 2.6 37 4.6 0.0 38 9.4 1.6 39 40 41 42 43 0.1 7.2 0.1 0.6 1.8 44 Non0.4 uniform Non7.7 uniform 4.5 Uniform Non7.8 uniform 0.1 7.1 Uniform 36.6 0.4 45 17.0 6.2 36.3 Uniform Non10.9 uniform 46 2.9 47 1.8 48 8.7 8.6 Non0.2 uniform 2 H0 : all equal 0.4 G H0 : a H0 : b equal equal 5.4 1.4 Non4.0 uniform 5.8 Non0.4 uniform 0.5 Whites vs.Multi_Racial G2 H0 : all H0 : a H0 : b equal equal equal 0.7 3.4 1.5 Non0.0 uniform 4.0 4.0 3.1 5.6 0.2 1.6 0.5 3.8 0.7 3.7 4.7 0.3 3.4 1.0 1.9 4.4 5.4 Uniform Non0.4 uniform 2.4 7.4 1.1 Non6.3 uniform 3.0 2.0 2.2 7.5 7.5 0.1 4.3 4.2 3.4 6.3 5.4 6.2 49 Non0.0 uniform Non0.1 uniform Non0.9 uniform Table 10 (Continued) The Summary of IRTLRDIF for Three Comparison Groups with 2PL Whites vs. Blacks Whites vs. Hispanics 2 Item H0 : all equal G H0 : a H0 : b equal equal 49 1.0 50 6.6 3.4 51 52 12.7 0.7 6.6 Non3.2 uniform Non6.0 uniform 53 8.6 1.2 Non7.4 uniform 54 5.5 0.1 55 7.6 5.2 56 57 58 22.0 13.6 1.8 3.0 0.4 59 3.2 60 61 62 63 4.8 3.5 1.1 5.1 4.2 0.2 5.4 Uniform Non2.4 uniform Non19.0 uniform 13.3 Uniform Non0.6 uniform 5.0 Uniform 2 H0 : all equal G H0 : a H0 : b equal equal Whites vs.Multi_Racial 2 G H0 : all H0 : a H0 : b equal equal equal 1.4 6.1 1.6 3.8 13.9 0.3 0.2 13.8 Uniform 7.0 0.0 6.0 Non0.1 uniform 4.4 Non2.6 uniform 0.6 Non4.0 uniform 2.5 1.5 0.6 4.7 1.3 0.6 0.8 2.2 0.8 6.4 2.1 2.1 0.5 5.9 Uniform 2.4 4.0 1.2 Non2.8 uniform 0.1 0.9 2.2 0.3 5.2 3.5 0.1 5.1 Uniform 9.7 9.4 50 Non0.3 uniform 3.6 Table 10 (Continued) The Summary of IRTLRDIF for Three Comparison Groups with 2PL Whites vs. Blacks Whites vs. Hispanics 2 Item H0 : all equal 64 2.3 65 0.8 66 7.9 67 1.0 68 69 6.1 1.4 70 0.3 G H0 : a H0 : b equal equal 7.6 0.5 0.7 0.3 Non0.3 uniform 5.6 Uniform Non4.1 uniform 6.6 Uniform 2 H0 : all equal G H0 : a H0 : b equal equal 4.3 1.5 1.0 4.1 3.1 Non2.9 uniform Non1.0 uniform 1.2 0.8 2.4 Non2.2 uniform 4.1 Non0.1 uniform 6.3 2.4 4.7 1.7 3.7 Non3.9 uniform Non0.9 uniform 7.1 6.8 Non0.2 uniform 4.8 6.9 3.0 74 3.4 9.1 4.3 75 76 2.3 1.3 4.2 2.2 1.3 77 78 79 8.5 8.1 1.0 Non0.1 uniform 7.9 Uniform 2 G H0 : all H0 : a H0 : b equal equal equal 2.4 71 72 73 8.4 0.2 Whites vs.Multi_Racial 2.3 2.3 1.7 4.6 0.6 1.9 1.0 2.0 0.5 0.5 0.7 1.0 3.7 Non4.8 uniform Non2.9 uniform 0.5 0.1 0.5 4.1 1.6 0.3 51 Table 11 The Summary of IRTLRDIF, BILOG-MG 3, and IRTPRO for Three Comparison Group with 2PL Item 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 Whites vs. Blacks IRTLRDIF BILOG-MG 3 IRTPRO 6.1 * -2.368 * 6.1 * 12.1 * 2.919 * 11.0 * 0.7 1.258 0.5 2.1 0.464 2.1 0.2 0.386 0.1 0.4 -0.035 0.4 2.7 -0.721 206.0 11.2 * -3.135 * 11.0 * 8.7 -1.619 7.8 4.4 -0.322 4.0 10.3 * -3.083 * 10.5 * 1.6 1.752 1.5 32.0 * 5.888 * 29.7 * 14.3 * 4.198 * 13.4 * 9.1 * 2.768 * 8.2 * 3.6 1.815 3.4 0.3 1.026 0.2 0.6 0.412 0.7 1.0 1.545 0.9 6.1 2.730 5.7 2.4 0.386 2.3 3.3 -1.351 3.3 3.4 1.769 3.2 2.1 -0.888 2.0 4.7 1.598 3.9 3.5 -0.720 3.5 Whites vs. Hispanics IRTLRDIF BILOG-MG 3 IRTPRO 7.5 1.313 6.7 9.6 * 2.711 * 8.6 * 2.6 1.739 2.4 2.2 0.995 2.3 2.5 -0.095 1.9 1.3 -0.335 1.2 1.4 -0.942 1.2 2.4 -0.975 3.0 5.0 -0.992 3.9 1.1 0.970 1.0 1.6 -1.148 1.8 3.0 0.599 1.9 11.1 * 3.382 * 10.6 * 0.5 -0.300 0.4 3.1 1.650 2.9 2.6 0.880 2.3 4.8 1.691 5.1 1.3 0.733 0.7 9.1 2.992 8.8 0.9 0.553 0.7 2.1 0.425 1.7 3.5 1.165 3.4 0.2 0.411 0.1 1.9 -0.567 1.7 1.7 0.834 1.2 1.3 -0.711 1.4 Note. * DIF Items 52 Whites vs.Multi-Racial IRTLRDIF BILOG-MG 3 IRTPRO 0.7 0.121 1.3 0.0 0.198 0.0 8.2 * -2.455 * 7.5 * 11.6 * -1.976 * 12.3 * 3.0 -0.682 3.3 0.2 0.470 0.2 0.9 0.017 1.6 7.5 * -2.277 * 7.4 * 3.2 -1.471 3.3 3.1 -0.171 2.7 3.3 -1.718 3.3 6.1 -1.029 7.4 1.0 1.216 0.9 1.6 1.049 1.7 0.5 0.335 0.4 0.4 0.030 0.1 5.3 -0.989 3.0 0.1 0.102 0.1 1.3 0.362 1.0 1.2 -0.740 1.2 0.2 0.116 0.2 0.3 0.184 0.4 0.5 0.756 0.3 1.8 -0.903 1.9 0.5 -0.284 0.4 2.3 0.023 2.3 Table 11 (Continued) The Summary of IRTLRDIF, BILOG-MG 3, and IRTPRO for Three Comparison Group with 2PL Whites vs. Blacks Whites vs. Hispanics Item IRTLRDIF BILOG-MG 3 IRTPRO IRTLRDIF BILOG-MG 3 IRTPRO 27 9.3 -1.570 9.5 2.0 -0.565 1.9 28 4.1 -0.903 4.1 0.8 -0.164 1.0 29 10.2 -1.732 10.6 1.7 -0.985 2.0 30 4.0 1.633 3.9 5.9 1.593 5.4 31 0.2 1.049 0.2 0.5 -0.495 0.4 32 13.5 -1.634 12.3 1.8 1.345 1.8 33 1.0 0.287 0.9 2.0 0.374 1.4 34 2.1 -1.018 2.1 0.4 -0.119 0.4 35 5.8 0.603 5.8 5.4 -1.856 5.6 36 10.3 * -2.047 * 9.7 * 0.5 -0.223 0.4 37 4.6 1.643 4.5 6.2 -0.490 4.4 38 9.4 * -2.068 * 8.5 * 3.1 0.058 2.2 39 0.1 0.495 0.1 1.6 0.485 1.0 40 7.2 -1.800 6.2 0.5 -0.148 0.6 41 0.1 0.659 0.1 3.8 -1.443 2.9 42 0.6 0.238 0.5 0.7 -0.263 0.9 43 1.8 -0.421 1.7 3.7 -1.586 3.4 44 36.6 * -5.495 * 35.8 * 2.4 -1.367 2.1 45 17.0 * -2.485 * 14.7 * 3.0 0.175 2.5 46 2.9 -0.859 2.8 2.2 -0.443 2.1 47 1.8 0.771 19.0 0.1 0.009 0.1 48 8.7 1.094 9.1 3.4 1.264 3.2 49 1.0 1.720 0.9 1.4 0.595 1.5 50 6.6 -1.092 6.5 1.6 -0.925 1.2 51 12.7 -1.649 12.8 13.9 * -3.547 * 13.5 * 52 0.7 -0.212 0.6 0.3 -0.177 0.2 Note. * DIF Items 53 Whites vs.Multi-Racial BILOG-MG 3 IRTPRO IRTLRDIF 0.2 0.494 0.2 0.4 0.070 0.1 2.1 -1.299 2.1 1.4 1.374 1.2 2.3 1.734 1.9 5.3 -1.989 4.8 5.5 -1.956 4.9 0.7 -0.472 0.9 3.4 0.592 3.4 1.5 -0.925 1.4 4.0 0.116 4.2 5.6 -2.092 5.5 4.7 0.752 3.6 0.3 -0.150 0.2 3.4 1.676 3.0 1.0 0.493 1.0 1.9 0.465 2.0 7.4 * -2.368 * 7.7 * 2.0 -0.917 1.8 7.5 0.354 4.5 4.3 -0.099 3.9 6.3 1.204 5.9 6.1 0.545 7.1 3.8 -1.218 3.4 7.0 1.733 7.4 0.0 0.057 0.0 Table 11 (Continued) The Summary of IRTLRDIF, BILOG-MG 3, and IRTPRO for Three Comparison Group with 2PL Item 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 Whites vs. Hispanics Whites vs. Blacks IRTLRDIF BILOG-MG 3 IRTPRO IRTLRDIF BILOG-MG 3 8.6 -1.616 6.4 2.5 -1.074 5.5 -1.704 5.2 0.6 -0.054 7.6 -0.772 6.8 1.3 -0.916 22.0 * 4.687 * 21.0 * 0.8 -0.304 13.6 * 3.638 * 13.0 * 2.2 1.254 1.8 1.106 1.8 0.8 0.862 3.2 0.153 3.2 2.4 0.404 4.8 1.456 4.1 0.1 -0.130 3.5 2.136 3.1 0.9 0.866 1.1 1.279 1.0 2.2 0.528 5.1 -1.096 3.9 9.7 -0.457 2.3 1.652 2.3 2.4 -1.129 0.8 0.236 0.5 1.0 -0.264 7.9 1.037 5.5 1.2 0.182 1.0 -0.088 0.8 6.3 -1.648 6.1 * 2.155 * 6.1 * 4.7 0.972 1.4 -0.814 1.4 1.7 1.308 0.3 0.573 0.1 7.1 0.453 4.8 2.387 4.8 2.3 -0.803 6.9 * 3.040 * 6.7 * 2.3 -0.358 3.0 -0.735 2.8 1.7 -1.108 3.4 -0.284 3.8 9.1 -1.456 2.3 0.559 2.1 4.2 1.254 1.3 0.933 1.3 2.2 0.870 8.5 0.217 8.6 0.7 0.639 8.1 * 3.430 * 7.8 * 1.0 0.162 1.0 -0.187 1.0 3.7 -1.438 Note. * DIF Items 54 IRTPRO 1.7 0.4 1.2 0.7 2.1 0.6 2.5 0.2 0.7 1.7 1.8 1.9 0.4 0.4 8.6 4.0 1.6 9.4 2.4 1.7 1.4 10.0 4.1 1.9 0.6 1.1 3.3 Whites vs.Multi-Racial IRTLRDIF BILOG-MG 3 IRTPRO 1.5 0.070 1.4 4.7 -1.729 4.2 0.6 -0.210 0.6 6.4 2.580 6.0 2.1 1.436 2.0 2.1 0.807 1.4 4.0 1.122 3.6 0.3 0.629 0.2 5.2 2.451 4.9 3.5 1.888 3.0 3.6 1.612 2.7 4.3 -1.528 3.7 4.1 1.377 3.7 0.8 1.165 0.8 4.6 1.920 2.5 0.6 0.533 0.4 1.9 -0.173 1.5 1.0 1.218 1.2 2.0 1.301 2.5 0.5 0.588 0.7 0.5 0.945 0.4 0.5 -0.003 0.3 0.1 0.199 0.1 0.5 -0.074 0.4 4.1 0.270 3.8 1.6 0.447 1.6 0.3 0.566 0.3 4.2.3 The Three Comparison Groups Using Three Computer Programs with 3PL Table 12 displays the uniform and non-uniform DIF among three comparison groups with 3PL. There are 43 items that were identified as statistically significant DIF items, including 17 uniform DIF items and 26 non-uniform DIF items, for Whites vs. Blacks. A total of 24 items are DIF items that include four uniform DIF and 20 non-uniform DIF for Whites vs. Hispanics. 25 items were identified as statistically significant DIF items that include six uniform DIF and 19 non-uniform DIF for the Whites vs. the Multi-Racial group. Table 13 presents the outcomes of the three comparison groups using the three computer programs. For Whites vs. Blacks, the programs indicated that Item 13, 14, 15, 32, 44, 45, 56, 57, and 78 are DIF items. Item 32, 44, and 45 advantaged Whites, and Item 13, 14, 15, 56, 56, and 78 favored Blacks. Additionally, Item 13, 19, and 51 are DIF items for Whites vs. Hispanics, and Items 13 and 19 favored Whites and Item 51 Hispanics. Furthermore, only one DIF (Item 44) is determined in the Whites vs. the Multi-Racial group, and this item advantaged the Multi-Racial group. 55 Table 12 The Summary of IRTLRDIF for Three Comparison Groups with 3PL Whites vs. Blacks Whites vs. Hispanics 2 2 G H0 : all H0 : a H0 : b H0 : c Item equal equal equal equal H0 : all H0 : a equal equal G H0 : b equal 1 5.1 0.4 0.0 4.8 Uniform 7.7 0.6 2.5 2 3 11.8 3.8 2.5 0.7 8.5 Uniform 9.5 2.3 6.5 2.9 4 0.5 H0 : c equal Non4.6 Uniform Non0.1 Uniform 2.7 5 6 2.2 3.1 7 5.8 0.0 2.3 3.7 8 11.4 0.5 0.3 10.6 9 10.6 0.2 1.7 8.7 10 7.1 0.5 4.1 2.4 11 11.3 9.6 0.0 1.7 12 2.2 13 34.4 0.9 6.4 27.1 NonUniform Uniform NonUniform NonUniform NonUniform Uniform 4.0 2.3 3.3 0.6 Non0.1 Uniform 4.1 0.7 0.2 Non3.3 Uniform 2.4 Whites vs.Multi-Racial G H0 : all H0 : a H0 : b equal equal equal 3.8 1.7 4.1 2.4 1.0 4.2 2.2 1.0 6.0 6.0 0.0 10.1 0.1 8.5 56 Non1.3 Uniform Non0.8 Uniform Non1.0 Uniform Non0.1 Uniform Non1.5 Uniform H0 : c equal 0.6 0.0 9.9 0.0 3.8 6.0 12.0 11.9 0.0 0.0 4.1 0.0 1.9 2.1 0.1 Uniform NonUniform NonUniform 5.1 1.9 0.2 NonUniform 3.9 3.8 0.0 0.2 7.6 7.5 0.1 0.0 1.7 7.2 6.9 2 3.8 3.1 1.0 NonUniform NonUniform Table 12 (Continued) The Summary of IRTLRDIF for Three Comparison Groups with 3PL Whites vs. Blacks G2 H0 : all H0 : a H0 : b H0 : c Item equal equal equal equal 14 17.1 0.0 2.3 15.1 Uniform Non15 9.8 0.2 8.6 1.0 Uniform 16 2.8 Whites vs. Hispanics G2 H0 : all H0 : a H0 : b H0 : c equal equal equal equal 0.5 3.4 2.2 2.5 8.8 1.0 0.0 7.9 5.3 0.6 5.0 0.0 NonUniform NonUniform 4.7 0.0 0.2 NonUniform 17 0.8 4.7 18 1.1 1.0 19 20 1.7 7.9 0.3 0.0 7.6 8.5 0.4 21 4.5 4.8 0.1 0.0 22 3.0 23 24 6.0 2.0 5.8 0.0 0.2 25 4.6 4.4 0.0 0.2 26 6.9 1.6 5.3 0.0 27 6.4 4.6 0.0 1.8 28 8.0 1.0 6.7 0.3 Uniform NonUniform NonUniform NonUniform NonUniform NonUniform NonUniform NonUniform Whites vs.Multi-Racial G2 H0 : all H0 : a H0 : b H0 : c equal equal equal equal 1.3 1.0 4.1 Non0.0 Uniform 0.0 0.5 8.7 0.0 Uniform 4.9 1.1 1.5 0.5 3.2 0.2 0.0 1.7 3.4 3.1 1.2 1.2 1.4 1.9 1.6 5.7 0.4 4.9 0.3 57 Non0.5 Uniform 0.6 Table 12 (Continued) The Summary of IRTLRDIF for Three Comparison Groups with 3PL Whites vs. Blacks G Whites vs. Hispanics 2 G H0 : all H0 : a H0 : b H0 : c Item equal equal equal equal H0 : all H0 : a equal equal NonUniform NonUniform 29 7.4 4.4 0.0 3.0 30 31 8.3 2.9 4.3 2.1 1.9 33 34 3.1 2.3 1.4 0.6 35 2.2 5.6 36 6.1 4.7 0.3 1.1 NonUniform 37 38 39 6.9 7.6 0.1 0.1 0.0 0.2 7.8 6.7 0.0 Uniform Uniform 40 41 42 5.0 1.4 1.5 1.0 6.5 0.0 NonUniform 4.9 2.8 3.4 43 6.3 2.5 1.4 2.3 NonUniform 44 45 34.2 12.4 6.2 1.2 12.7 11.4 15.3 0.0 Uniform Uniform H0 : b equal G H0 : c equal 2.3 6.0 0.6 2 H0 : all H0 : a H0 : b equal equal equal H0 : c equal 2.1 0.1 4.4 Non1.6 Uniform 1.5 2.1 6.1 1.0 5.6 0.1 Non0.0 Uniform 0.6 6.8 2.0 1.4 Whites vs.Multi-Racial 2 0.7 5.3 0.1 NonUniform 3.9 0.0 0.7 3.8 0.0 3.0 NonUniform Uniform 2.9 3.3 0.8 3.6 Non2.4 Uniform 2.5 3.6 0.9 4.7 6.8 3.5 0.0 3.3 2.2 0.0 4.8 58 0.1 Uniform 4.7 0.6 0.6 3.4 7.7 2.8 4.4 3.3 0.0 NonUniform NonUniform Table 12 (Continued) The Summary of IRTLRDIF for Three Comparison Groups with 3PL Whites vs. Blacks 2 G H0 : all H0 : a H0 : b H0 : c Item equal equal equal equal 46 7.0 0.0 6.4 0.6 47 7.7 0.0 0.4 7.3 48 6.8 3.8 2.2 0.8 49 50 1.5 4.4 0.0 3.8 0.6 51 52 53 54 15.2 1.2 8.8 5.0 16.0 0.2 0.0 0.0 0.9 8.9 1.0 0.0 3.2 55 4.0 1.2 3.1 0.0 56 32.5 31.5 0.3 0.7 57 58 17.9 3.0 15.5 2.1 0.3 59 60 61 62 2.5 0.0 4.0 0.2 0.0 3.7 0.3 Whites vs. Hispanics 2 G H0 : all H0 : a H0 : b H0 : c equal equal equal equal NonUniform NonUniform NonUniform Uniform NonUniform Uniform Uniform NonUniform NonUniform NonUniform Uniform 4.7 0.0 0.0 Non4.7 Uniform Whites vs.Multi-Racial G2 H0 : all H0 : a H0 : b H0 : c equal equal equal equal 6.6 1.5 2.2 2.9 NonUniform 0.0 2.7 2.6 7.1 2.5 4.5 0.1 1.5 3.0 7.2 5.0 6.2 0.3 0.9 4.7 0.1 0.0 7.6 0.1 2.2 4.6 3.4 4.9 0.0 NonUniform NonUniform Uniform NonUniform 0.0 4.9 0.0 Uniform 0.0 7.1 3.2 NonUniform 1.2 0.0 3.2 NonUniform 0.1 4.8 0.0 Uniform 14.5 0.6 2.2 1.4 1.1 12.1 1.3 Uniform 1.0 0.5 1.0 10.3 1.7 0.7 2.5 1.5 0.7 0.0 0.9 1.5 4.4 0.0 4.0 3.3 59 Table 12 (Continued) The Summary of IRTLRDIF for Three Comparison Groups with 3PL Whites vs. Blacks G2 H0 : all H0 : a H0 : b H0 : c Item equal equal equal equal Whites vs. Hispanics 2 G H0 : all H0 : a H0 : b H0 : c equal equal equal equal 9.8 8.5 0.3 2.8 0.0 0.1 4.5 0.5 1.1 1.1 3.0 Non1.0 Uniform Non0.4 Uniform 67 0.0 6.4 2.2 3.9 Non0.4 Uniform 68 69 12.2 1.7 63 4.8 64 65 66 0.0 11.2 7.1 0.2 0.0 0.8 Uniform NonUniform 2.9 2.2 70 71 0.3 4.2 72 73 11.0 1.2 74 75 76 3.7 2.8 2.3 6.7 3.1 1.5 77 78 79 0.9 9.2 0.3 0.8 1.0 4.0 0.0 5.3 0.0 8.2 1.5 1.4 0.9 6.9 1.5 Uniform NonUniform Uniform 7.4 1.9 Whites vs.Multi-Racial 2 G H0 : all H0 : a H0 : b H0 : c equal equal equal equal 1.4 5.0 2.6 0.0 2.0 0.0 3.1 Uniform 1.6 3.7 0.0 NonUniform 1.7 2.4 1.6 6.8 1.0 Non0.0 Uniform 1.9 1.1 0.8 1.3 0.6 0.0 9.3 0.1 0.0 4.4 60 Non0.0 Uniform 0.0 Uniform 1.3 2.9 1.1 4.8 0.8 0.9 Table 13 The Summary of IRTLRDIF, BILOG-MG 3, and IRTPRO for Three Comparison Groups with 3PL Whites vs. Blacks Whites vs. Hispanics Whites vs.Multi-Racial IRTLRDIF BILOG-MG 3 IRTPRO IRTLRDIF BILOG-MG 3 IRTPRO IRTLRDIF BILOG-MG 3 IRTPRO 2 2 2 2 2 2 (d ) (d ) (d ) (χ ) (χ ) (χ ) (χ ) (χ ) (χ ) Item 1 5.1 -1.475 4.9 7.7 1.194 1.5 0.6 0.367 0.5 2 11.8 1.239 7.8 9.5 1.354 8.4 0.0 -0.081 0.6 3 3.8 -0.100 2.0 2.3 1.013 5.7 9.9 -2.144 7.9 4 0.5 0.443 1.7 2.7 1.000 9.0 12.0 -0.816 20.9 5 2.2 -0.063 0.6 4.0 -0.873 1.2 4.1 -1.633 2.4 6 3.1 -0.328 2.5 2.3 -0.025 3.9 0.0 0.381 2.2 7 5.8 -1.191 2.7 4.1 -0.528 1.7 1.7 0.487 3.0 8 11.4 -2.299 6.4 2.4 -0.684 1.7 7.2 -2.100 6.9 9 10.6 -1.628 5.2 6.9 -1.299 2.8 3.8 -1.288 5.9 10 7.1 -1.738 3.8 4.1 1.264 9.5 3.1 -1.413 1.3 11 11.3 -1.137 11.0 4.2 -0.663 6.8 3.9 -0.797 10.0 12 2.2 1.413 4.1 6.0 0.038 0.9 7.6 -0.338 13.4 13 34.4 * 4.256 * 35.0 * 10.1 * 2.150 * 12.8 * 1.0 0.585 2.8 14 17.1 * 2.596 * 14.3 * 0.5 -0.764 1.2 1.3 0.962 5.1 15 9.8 * 2.287 * 10.6 * 3.4 1.503 6.8 2.2 0.439 6.5 16 2.8 1.023 2.0 2.5 0.799 2.0 8.8 0.240 1.0 17 0.8 0.319 1.5 4.7 1.678 10.2 5.3 -1.687 6.0 18 1.1 -0.032 2.2 1.0 -0.260 2.5 0.0 -0.446 5.5 19 1.7 0.673 2.0 8.5 * 2.026 * 10.9 * 4.9 -0.245 0.3 20 7.9 1.273 8.9 0.4 -0.295 4.3 1.1 -1.176 7.5 21 4.5 0.601 4.6 1.5 -0.112 0.4 0.5 0.107 0.8 22 3.0 -0.844 2.8 3.2 1.039 1.4 0.2 -0.228 0.0 23 6.0 0.444 3.2 0.0 0.036 1.0 3.4 0.403 0.4 24 2.0 -0.900 1.7 1.7 -1.077 0.9 3.1 -0.847 5.0 25 4.6 0.253 1.9 1.2 0.116 0.7 1.2 -0.724 2.7 26 6.9 -1.898 5.9 1.4 -0.523 1.0 1.9 -0.906 1.1 Note. * DIF Items 61 Table 13 (Continued) The Summary of IRTLRDIF, BILOG-MG 3, and IRTPRO for Three Comparison Groups with 3PL Whites vs. Blacks Whites vs. Hispanics Whites vs.Multi-Racial IRTLRDIF BILOG-MG 3 IRTPRO IRTLRDIF BILOG-MG 3 IRTPRO IRTLRDIF BILOG-MG 3 IRTPRO 2 2 2 2 2 2 (d ) (d ) (d ) (χ ) (χ ) (χ ) (χ ) (χ ) (χ ) Item 27 6.4 0.042 9.9 1.6 -0.102 1.3 0.4 -0.033 0.6 28 8.0 -2.322 4.8 5.7 -0.480 0.1 0.6 0.274 1.9 29 7.4 -0.425 7.2 2.3 -0.857 1.7 2.1 -1.313 1.8 30 8.3 1.441 11.4 6.0 1.707 9.3 1.5 0.778 3.7 31 2.9 0.024 2.4 0.6 -0.736 2.7 2.1 0.829 3.7 32 23.5 * -4.104 * 13.0 * 2.4 1.055 6.2 3.9 -1.619 8.3 33 3.1 -0.692 1.0 1.4 -0.415 0.5 6.1 -2.262 7.9 34 2.3 -1.095 1.5 0.6 -0.591 0.5 1.0 -1.028 0.6 35 2.2 1.144 4.4 5.6 -1.098 8.3 2.9 1.251 7.1 36 6.1 -1.033 9.0 * 0.6 -0.434 6.4 3.3 -1.103 3.7 37 6.9 1.176 5.9 6.8 -2.158 3.5 4.7 1.032 7.1 38 7.6 -2.289 4.7 2.0 -0.770 0.2 6.8 -1.960 6.2 39 0.1 0.027 0.4 1.4 0.692 5.9 3.5 -0.658 1.1 40 5.0 -1.716 6.1 2.5 -0.301 2.0 0.0 -0.554 3.9 41 1.4 0.183 1.8 3.6 -1.566 4.5 3.3 1.367 4.6 42 1.5 -0.288 0.5 0.9 0.076 0.3 2.2 -0.333 0.9 43 6.3 -0.649 5.7 4.9 -2.134 5.9 4.7 0.658 7.5 44 34.2 * -4.245 * 28.8 * 2.8 -1.758 2.8 7.7 * -2.134 * 8.2 * 45 12.4 * -2.366 * 9.5 * 3.4 -0.296 0.4 2.8 -0.978 2.1 46 7.0 -2.119 2.7 4.7 -0.277 14.1 6.6 -1.900 5.6 47 7.7 0.935 5.0 0.0 -0.188 0.9 2.7 -0.938 0.9 48 6.8 1.404 13.3 2.6 1.192 14.4 7.1 1.398 20.1 49 1.5 0.913 2.4 1.5 0.567 5.5 7.2 1.270 13.3 50 4.4 -1.941 1.5 3.0 -1.912 5.6 5.0 -2.617 4.9 51 15.2 -0.585 15.6 14.5 * -3.438 * 15.8 * 7.6 1.953 10.9 52 1.2 -0.577 1.0 0.6 -0.540 0.7 0.1 -0.033 0.1 Note. * DIF Items 62 Table 13 (Continued) The Summary of IRTLRDIF, BILOG-MG 3, and IRTPRO for Three Comparison Groups with 3PL Whites vs. Hispanics Whites vs. Blacks IRTLRDIF BILOG-MG 3 IRTPRO IRTLRDIF BILOG-MG 3 IRTPRO 2 2 2 2 (d ) (d ) (χ ) (χ ) (χ ) (χ ) Item 53 8.8 -1.820 7.3 2.2 -1.210 4.3 54 5.0 -1.321 3.7 1.4 -0.467 0.2 55 4.0 -1.330 2.6 1.0 -1.115 3.4 56 32.5 * 3.093 * 11.4 * 1.0 -0.086 1.9 57 17.9 * 2.856 * 12.3 * 1.7 0.607 1.1 58 3.0 1.207 5.5 0.7 0.509 1.0 59 2.5 0.780 6.1 0.7 0.481 5.5 60 0.0 0.752 0.9 0.0 -0.220 1.4 61 4.0 1.481 3.6 0.9 0.685 1.1 62 0.2 0.790 1.9 1.5 0.373 0.3 63 4.8 -0.751 7.5 9.8 -0.419 2.1 64 2.8 0.976 2.4 4.5 -1.701 2.4 65 0.0 0.216 2.3 0.5 -0.152 8.5 66 0.1 0.738 0.8 1.1 0.376 1.8 67 0.0 -0.053 4.3 6.4 -1.221 11.7 68 12.2 1.538 4.7 2.9 0.252 1.3 69 1.7 -0.468 1.6 2.2 1.063 1.6 70 0.3 0.394 3.1 7.4 0.778 15.6 71 4.2 1.585 4.5 1.9 -0.505 4.3 72 11.0 2.009 6.6 1.9 -0.682 0.5 73 1.2 -0.617 6.1 1.1 -0.799 6.6 74 3.7 -0.204 3.1 6.7 0.169 7.9 75 2.8 0.692 1.6 3.1 0.515 2.8 76 2.3 0.184 1.5 1.5 0.909 1.5 77 0.9 0.196 0.9 0.8 0.468 5.6 78 9.2 * 2.463 * 12.0 * 1.0 -0.630 1.3 79 0.3 -0.171 0.6 4.0 -2.288 3.9 Note. * DIF Items 63 Whites vs.Multi-Racial IRTLRDIF BILOG-MG 3 IRTPRO 2 2 (d ) (χ ) (χ ) 2.2 4.6 0.5 10.3 2.5 1.5 4.4 0.0 4.0 3.3 1.4 5.0 2.6 0.0 1.7 2.4 1.6 0.8 1.3 0.6 0.0 1.3 2.9 1.1 4.8 0.8 0.9 0.024 -2.184 -0.669 2.049 1.108 0.011 0.138 0.348 1.657 1.303 0.647 -0.821 0.620 0.759 0.989 0.212 -0.699 0.835 0.750 0.206 0.482 0.365 -0.127 0.004 1.340 -0.438 0.223 6.8 5.4 0.8 4.7 3.4 0.3 3.7 1.4 5.9 5.1 0.8 6.6 11.4 3.7 2.7 0.5 1.0 2.1 1.0 0.2 4.4 3.4 0.5 1.5 9.0 0.4 0.9 4.2.4 Multiple Groups Using two Programs with Three Models The results of the BILOG-MG 3 and IRTPRO for Whites vs. Blacks, Hispanics, and Multi-Racial group with 1PL are given in Table 14. BILOG-MG 3 was detected in 36 items for Whites vs. Blacks, in six items for Whites vs. Hispanics, and in 10 items for Whites vs. the Multi-Racial group. Items 2, 13, and 74 are DIF in both Whites vs. Blacks and Whites vs. Hispanics. In addition, Items 8, 11, 32, 44, 56, and 61are DIF in both Whites vs. Blacks and Whites vs. the Multi-Racial group. On the other hands, IRTPRO detected less DIF items among the three comparison groups. There are 12 items for Whites vs. Blacks, three items for Whites vs. Hispanics, and two items for Whites vs. the Multi-Racial group. Based on the results, both BILOG-MG 3 and IRTPRO consistently detect DIF, which include Items 2, 8, 11, 13, 29, 30, 44, 56, 61, and 74 for Whites vs. Blacks and Item 3 for Whites vs. the Multi-Racial group. There is no the consistent DIF detection for Whites vs. Hispanics. 64 Table 14 The Summary of BILOG-MG 3 and IRTPRO for All Ethicities/Races with 1PL BILOG-MG 3 Item 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 W vs. B -5.216 * 2.902 * 1.878 0.450 -1.792 -1.252 -3.008 * -5.442 * -1.331 -0.455 -7.861 * 1.867 6.350 * 4.554 * 4.106 * 0.270 2.820 * 1.140 1.445 3.445 * -1.487 -3.817 * 3.197 * -1.512 2.770 * -4.575 * -3.163 * -3.946 * -5.413 * 2.194 * 2.084 * -2.360 * 0.355 -3.370 * -0.953 -1.992 * 1.124 -1.877 0.184 -0.302 W vs. H 1.057 2.919 * 1.877 1.101 0.486 -0.471 -1.435 -1.333 -0.865 1.004 -1.931 0.734 3.412 * -0.145 1.723 0.676 1.698 0.942 3.069 * 0.763 0.292 0.988 0.618 -0.579 1.010 -1.560 0.556 -0.623 -1.627 1.737 -0.256 1.303 0.480 -0.466 -2.344 * -0.037 -0.764 0.184 0.584 -0.152 IRTPRO W vs. MR -0.073 0.137 -2.385 * -1.979 * -1.004 0.365 -0.191 -2.717 * -1.542 -0.242 -2.262 * -1.074 1.093 0.942 0.140 -0.130 -1.239 -0.031 0.275 -0.742 -0.021 0.000 0.601 -1.029 -0.382 -0.221 0.377 -0.137 -1.738 1.238 1.532 -2.190 * -2.030 * -0.703 0.528 -0.929 -0.057 -2.099 * 0.680 0.161 Note. *DIF Items 65 W vs. B 0.7 7.8 * 0.3 0.3 2.0 0.3 3.3 14.5 * 3.6 0.4 19.4 * 0.4 23.5 * 3.5 5.8 0.4 1.0 1.3 6.7 * 1.5 0.2 0.4 3.2 2.1 1.7 4.9 0.3 2.1 12.2 * 6.9 * 2.2 1.2 0.8 2.8 1.9 1.4 0.2 2.8 1.2 0.2 W vs. H 12.2 * 0.3 2.4 0.9 0.2 0.5 0.6 0.5 1.1 0.7 3.0 2.1 1.1 5.7 2.0 0.2 2.8 0.2 1.8 5.1 1.1 8.8 * 1.4 0.2 1.9 1.8 6.5 * 2.5 1.5 0.5 0.5 0.9 2.0 1.6 0.6 0.5 1.9 0.2 0.4 0.3 W vs. MR 0.4 3.9 9.5 * 4.3 0.0 0.5 1.1 0.6 0.1 0.7 0.0 1.5 3.1 0.9 1.2 0.2 4.7 0.4 3.9 1.0 0.0 0.3 0.0 0.0 0.9 1.2 0.0 0.2 0.0 0.1 2.2 5.6 3.2 0.0 4.7 0.3 0.5 2.4 0.1 0.0 Table 14 (Continued) The Summary of BILOG-MG 3 and IRTPRO for All Ethicities/Races with 1PL Item 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 BILOG-MG 3 W vs. B W vs. H 0.756 -1.296 -1.040 -0.734 0.356 -1.204 -3.613 * -1.724 -0.954 0.300 0.117 -0.086 0.246 0.119 0.823 1.498 0.981 0.749 -0.657 -0.794 -1.061 -3.383 * -1.805 -1.000 0.059 -0.856 -0.820 0.040 -0.082 -0.774 2.213 * -0.278 1.789 1.340 0.778 0.944 -2.698 * 0.110 2.171 * 0.036 3.511 * 1.026 2.625 * 0.678 0.604 -0.501 0.836 -1.153 2.000 * -0.319 2.859 * 0.234 1.500 -1.656 1.270 0.815 -2.752 * 1.163 1.803 0.596 2.886 * -0.756 3.446 * -0.228 0.795 -1.058 -4.706 * -3.412 * -2.267 * 1.216 -0.487 0.698 -1.929 0.503 3.746 * 0.341 -1.481 -1.693 W vs. MR 1.461 0.332 0.340 -2.756 * -1.004 0.321 -0.170 -1.092 0.449 -1.256 1.707 -0.193 -0.133 -1.775 -0.293 2.506 * 1.377 0.715 1.528 0.532 2.244 * 1.811 1.076 -1.683 0.888 0.893 1.102 0.411 -0.325 1.019 1.244 0.500 0.540 -0.320 -0.004 -0.264 0.154 0.374 0.514 Note. *DIF Items 66 W vs. B 0.4 0.8 0.3 23.6 * 1.2 0.3 0.2 5.0 2.2 3.0 2.9 3.4 0.5 2.6 0.7 7.9 * 8.9 * 2.0 0.3 1.3 9.7 * 5.6 0.4 2.2 0.9 2.5 0.4 1.6 0.2 2.5 1.4 1.6 0.2 11.1 * 0.3 0.2 0.2 2.8 1.5 IRTPRO W vs. H 1.9 1.1 1.8 3.8 0.8 0.2 0.4 0.4 0.6 0.5 0.2 2.4 1.0 0.2 0.7 2.9 0.8 0.5 6.0 1.3 0.2 0.2 0.2 6.0 1.0 1.2 2.9 0.2 4.4 0.2 2.5 4.2 1.2 0.4 4.0 0.5 2.2 4.0 0.2 W vs. MR 4.9 0.8 1.8 0.2 0.8 0.2 0.0 0.1 0.0 0.0 14.9 * 0.5 0.5 1.5 0.3 4.6 0.0 0.0 1.3 0.3 0.9 0.8 1.7 0.0 0.9 0.3 4.9 0.1 0.9 0.2 2.6 0.5 1.8 4.9 0.7 0.3 0.0 0.0 3.2 Table 15 shows the results of the BILOG-MG 3 and IRTPRO for Whites vs. Blacks, Hispanics, and the Multi-Racial group with 2PL. There are 20 items to be detected using BILOG-MG 3 for Whites vs. Blacks, four items for Whites vs. Hispanics, and 10 items for Whites vs. Multi-Racial. Items 2 and 13 are detected DIF for both Whites vs. Blacks and Whites vs. Hispanics. In addition, Items 8, 11, 33, 44, 56, and 61are identified DIF for both Whites vs. Blacks and Whites vs. the Multi-Racial group, On the other hand, IRTPRO detected fewer DIF items. There are 15 items for Whites vs. Blacks, four items for Whites vs. Hispanics, and six items for Whites vs. the Multi-Racial group. Based on the results, Items 2, 8, 11, 13, 44, and 45 are detected by BILOG-MG 3 and IRTPRO for Whites vs. Blacks and Item 3 and 67 for Whites vs. the Multi-Racial group. There is no consistent DIF detection for Whites vs. Hispanics using BILOG-MG 3 and IRTPRO. 67 Table 15 The Summary of BILOG-MG 3 and IRTPRO for all Ethicities/Races with 2PL BILOG-MG 3 Item 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 W vs. B -2.217 * 2.949 * 1.298 0.626 0.317 0.028 -0.667 -3.015 * -1.684 -0.353 -3.078 * 1.783 6.267 * 5.220 * 1.289 4.442 * 0.964 0.275 1.487 1.797 0.225 -4.377 * 1.105 -1.383 0.267 -0.758 -1.673 -0.859 -1.735 0.934 -2.366 * 0.167 -2.343 * 1.005 -0.696 4.500 * -0.762 0.490 -1.330 0.763 W vs. H 1.335 2.799 * 1.775 1.066 -0.097 -0.329 -0.964 -0.965 -1.013 0.950 -1.118 0.654 3.412 * -0.260 1.669 0.870 1.801 0.777 3.039 * 0.603 0.490 1.210 0.417 -0.543 0.848 -0.814 0.728 -0.192 -0.910 1.642 -0.484 1.320 0.374 -0.123 -1.749 -0.144 -0.453 0.060 0.491 -0.121 IRTPTO W vs. MR 0.124 0.203 -2.492 * -1.945 -0.707 0.467 0.009 -4.802 * -1.339 -0.049 -7.717 * -1.159 1.215 1.230 0.145 0.043 -1.033 0.089 0.349 -0.761 0.118 0.191 0.755 -0.921 -0.303 0.020 0.491 0.064 -1.271 1.351 1.737 -2.060 * -2.005 * -0.481 0.610 -0.907 0.107 -2.141 * 1.000 -0.170 Note. *DIF Items 68 W vs. B 5.0 7.4 * 0.3 8.2 * 4.0 0.7 1.9 11.8 * 8.5 * 1.4 7.1 * 0.1 18.0 * 1.3 4.2 1.2 0.3 0.1 4.2 0.1 0.1 0.8 1.1 2.7 1.4 1.2 0.3 0.7 5.8 6.5 * 0.2 1.8 3.9 1.7 6.5 * 3.6 0.4 6.4 * 0.9 1.8 W vs. H 9.6 * 0.1 1.6 3.6 3.8 1.3 2.0 2.9 0.2 1.5 0.0 1.7 0.5 4.7 0.5 3.3 0.6 1.0 1.8 3.7 2.3 3.8 0.2 0.2 0.2 0.4 8.2 * 3.1 0.5 0.7 0.1 6.1 1.4 0.6 3.4 0.4 3.6 0.2 1.5 1.2 W vs. MR 2.6 4.8 9.8 * 5.5 0.1 0.6 0.7 1.4 3.6 3.1 0.6 8.1 * 3.2 1.7 1.2 1.2 4.1 0.9 5.4 2.0 1.9 3.0 0.0 2.4 1.3 3.1 0.7 0.3 0.4 2.1 2.3 6.1 3.5 0.0 2.8 0.3 8.4 * 5.2 4.7 0.6 Table 15 (Continued) The Summary of BILOG-MG 3 and IRTPRO for all Ethicities/Races with 2PL Item 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 BILOG-MG 3 W vs. H W vs. B 0.693 -1.461 0.254 -0.277 -0.395 -1.533 -5.503 * -1.329 -2.604 * 0.154 -0.928 -0.433 0.670 0.051 1.245 1.354 1.887 0.641 -1.200 -0.858 -1.543 -3.321 * -0.209 -0.183 -1.603 -1.083 -1.783 -0.029 -0.769 -0.971 4.729 * -0.332 3.662 * 1.295 1.009 0.899 0.160 0.445 1.467 -0.147 2.161 * 0.904 1.284 0.561 -1.188 -0.466 1.667 -1.064 0.288 -0.267 1.025 0.223 -0.106 -1.705 2.157 * 0.985 -0.864 1.320 0.608 0.500 2.427 * -0.842 3.020 * -0.359 -0.735 -1.137 -0.251 -1.402 0.559 1.308 1.028 0.888 0.255 0.723 3.371 * 0.193 -0.232 -1.417 W vs. MR 1.710 0.493 0.446 -2.380 * -0.961 0.341 -0.102 1.182 0.532 -1.200 1.726 0.061 0.061 -1.732 -0.236 2.588 * 1.430 0.796 1.108 0.628 2.503 * 2.000 1.709 -1.524 1.511 1.299 2.008 * 0.533 -0.177 1.216 1.315 0.585 0.981 0.007 0.207 -0.085 0.287 0.436 0.582 Note. *DIF Items 69 W vs. B 0.3 0.0 2.5 17.6 * 7.4 0.9 1.3 11.5 * 5.1 3.8 9.4 * 0.2 3.5 4.4 3.0 5.3 6.1 1.4 4.3 0.9 4.4 2.6 11.4 * 2.3 0.6 1.6 0.8 5.0 0.3 2.3 0.4 2.4 0.9 6.8 * 3.1 1.0 3.5 2.7 1.0 IRTPTO W vs. H 0.4 0.3 0.4 0.9 2.5 2.1 4.7 0.5 5.4 4.1 0.2 0.1 0.4 1.3 1.0 5.4 1.0 2.7 1.8 0.5 0.9 0.9 8.9 * 7.6 * 2.7 1.2 1.4 1.6 2.3 1.5 1.4 3.3 0.2 2.4 2.6 2.5 0.5 4.8 1.1 W vs. MR 5.7 2.1 3.8 1.5 1.5 7.0 * 1.8 0.2 0.6 0.4 13.0 * 0.2 1.9 1.4 0.6 3.9 1.5 0.3 4.2 0.3 1.3 2.3 5.6 1.5 0.9 0.5 6.7 * 1.5 2.3 4.6 5.2 0.9 1.9 3.8 1.8 1.0 2.2 0.1 3.0 Table 16 shows the 3PL by using BILOG-MG 3 and IRTPRO for Whites vs. all focal groups. For BILOG-MG 3, there are 12 items detected DIF for Whites vs. Blacks, six items for Whites vs. Hispanics, and five items for Whites vs. the Multi-Racial group. Items 13 and 15 are detected DIF for both Whites vs. Blacks and Whites vs. Hispanics, Item 51 for both Whites vs. Hispanics and Whites vs. the Multi-Racial group, and Items 56 is identified DIF for both Whites vs. Blacks and Whites vs. the Multi-Racial group. On the other hand, IRTPRO detected more DIF items than BILOG-MG 3 for Whites vs. Blacks, which are 16 items, with 3PL, four items for Whites vs. Hispanics, and two items for Whites vs. the Multi-Racial group. Items 49 and 65 are investigated DIF for both Whites vs. Blacks and Whites vs. Hispanics and Item 51 for Whites vs. Blacks and Whites vs. the Multi-Racial group. Moreover, the results indicated that Items 13, 15, and 44 are consistently detected by BILOG-MG 3 and IRTPRO for Whites vs. Blacks and for Whites vs. the Multi-Racial group. Overall, DIF exists in the GHSGPT in Social Studies when employing the three computer programs for the three comparison groups for the dichotomously scored items using three models. Figure 5 to Figure 13 display the DIF items between Whites vs. Blacks, Figure 14 and 16 demonstrate DIF items between Whites and Hispanics, and Figure 17 shows that DIF exists between Whites vs. the Multi-Racial group, with 3PL because 3PL shows a good fit to the data. 70 Table 16 The summary of BILOG-MG 3 and IRTPRO for all Ethicities/Races with 3PL Item 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 BILOG-MG 3 W vs.B W vs. H -1.384 1.186 1.459 1.804 0.319 1.554 0.923 1.583 0.026 -0.718 -0.104 0.180 -0.945 -0.471 -1.960 -0.430 -1.316 -0.985 -1.329 1.457 -1.357 -0.993 1.655 0.542 4.563 * 2.852 * 2.852 * -0.202 -2.482 * 2.075 * 1.183 1.019 -0.500 2.144 * 0.316 0.369 0.881 2.548 * 1.776 0.341 0.510 0.265 -0.796 1.190 0.832 0.487 -0.637 -0.645 0.633 0.593 -1.664 -0.393 0.118 0.123 -2.113 * -0.147 -0.461 -0.421 1.802 -2.231 * 0.313 -0.186 -3.524 * 1.386 -0.305 -0.012 -0.989 -0.326 -1.333 -0.772 -0.649 0.222 1.420 -1.588 -2.011 * -0.339 0.337 1.144 -1.479 0.240 W vs. MR 0.449 0.160 -1.821 -0.582 -1.461 0.524 0.443 -1.932 -1.151 -0.896 -1.545 -0.213 1.033 1.265 0.717 0.305 -1.380 0.056 0.183 -0.697 0.281 -0.082 0.778 -0.620 -0.480 -0.775 0.193 0.493 -1.065 1.214 1.381 -1.486 -2.101 * -0.788 1.403 -0.642 1.024 -1.631 -0.189 -0.133 Note. * DIF Items 71 W vs.B 1.3 7.9 1.5 12.0 * 2.4 4.3 2.3 7.5 6.1 3.1 8.4 * 1.3 22.6 * 5.5 11.2 * 2.2 4.0 5.6 5.6 5.9 0.7 1.0 0.6 3.1 0.3 1.8 1.6 0.3 3.1 15.7 * 4.3 3.0 2.0 0.9 7.7 9.1 * 2.6 4.2 2.6 3.2 IRTPTO W vs. H 3.4 1.2 5.7 6.3 2.1 0.8 2.7 1.2 1.1 2.2 3.0 2.3 2.1 6.5 3.6 2.1 2.5 0.8 1.2 8.4 * 1.9 2.8 3.4 0.5 4.2 2.1 7.3 3.3 1.8 0.9 0.7 12.0 * 3.9 0.6 2.6 0.2 2.7 0.4 0.9 1.5 W vs. MR 0.8 4.6 8.9 * 4.8 0.1 0.3 0.7 1.2 2.4 3.9 0.5 5.1 3.5 1.5 1.8 1.2 7.5 0.6 6.1 1.2 1.0 1.3 0.4 1.5 1.3 1.2 0.3 0.5 0.3 1.4 1.7 5.7 3.5 0.1 2.7 0.7 5.2 2.8 3.6 0.7 Table 16 (Continued) The summary of BILOG-MG 3 and IRTPRO for all Ethicities/Races with 3PL Item 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 BILOG-MG 3 W vs.B W vs. H 0.353 -1.243 0.000 0.031 -0.323 -1.404 -4.028 * -1.201 -2.162 * 0.047 -1.620 0.713 1.117 0.217 1.906 1.768 1.280 1.091 -1.459 -1.228 -0.419 -2.699 * -0.219 -0.531 -1.513 -0.846 -1.089 0.005 -1.000 -0.813 3.325 * 0.170 3.181 * 0.989 1.302 0.907 0.876 0.491 0.953 0.094 1.523 1.027 0.797 0.601 -0.900 -0.391 1.288 -1.339 0.220 -0.147 0.845 0.494 -0.082 -1.400 1.758 0.462 0.438 1.301 0.451 0.895 1.560 -0.333 2.124 * -0.443 -0.657 -0.876 -0.274 -1.886 0.696 0.595 0.471 1.132 0.481 0.586 2.740 * -0.140 0.054 -1.911 W vs. MR 1.673 -0.036 0.919 -1.748 -0.721 -1.294 -0.690 1.930 1.450 -2.071 * 2.097 * -0.234 0.313 -1.875 -0.277 2.498 * 1.503 -0.356 0.366 0.634 2.090 * 1.682 1.187 -0.633 1.174 1.258 1.686 0.400 -0.384 1.161 1.111 0.518 0.969 -0.802 0.027 0.149 1.656 0.027 0.502 Note. * DIF Items 72 W vs.B 3.3 0.6 8.2 * 12.4 3.4 6.5 0.7 28.0 * 11.8 * 4.1 16.1 * 0.7 9.9 * 3.5 2.9 6.9 7.1 2.2 8.3 * 1.2 5.9 2.9 2.7 1.1 11.9 * 1.7 8.6 * 4.1 0.3 10.2 * 1.8 0.8 8.8 * 5.7 3.0 1.4 7.2 2.3 1.3 IRTPTO W vs. H 1.5 0.1 2.0 1.6 1.2 2.6 4.2 2.6 8.3 * 3.1 0.7 0.2 2.7 0.3 2.6 7.7 2.9 1.2 1.2 3.5 0.9 0.4 1.1 6.9 8.5 * 6.0 4.1 1.0 2.3 2.8 2.6 5.2 3.3 1.2 1.5 2.6 3.1 5.1 0.5 W vs. MR 4.9 0.8 2.6 1.1 1.1 5.4 1.4 0.2 0.4 0.4 11.0 * 0.3 0.9 1.9 0.6 3.0 0.7 0.5 1.6 0.1 0.8 1.4 1.1 1.3 0.4 0.3 4.2 0.6 2.3 1.9 3.0 0.5 1.1 2.2 1.0 0.8 0.9 0.1 2.3 Figure 5. Item 13 between Whites and Blacks. Figure 6. Item 14 between Whites and Blacks. 73 Figure 7. Item 15 between Whites and Blacks. Figure 8. Item 32 between Whites and Blacks. 74 Figure 9. Item 44 between Whites and Blacks. Figure 10. Item 45 between Whites and Blacks. 75 Figure 11. Item 56 between Whites and Blacks. Figure 12. Item 57 between Whites and Blacks. 76 Figure 13. Item 78 between Whites and Blacks. Figure 14. Item 13 between Whites and Hispanics. 77 Figure 15. Item 19 between Whites and Hispanics. Figure 16. Item 51 between Whites and Hispanics. 78 Figure 17. Item 44 between Whites and the Multi-Racial Group. 79 CHAPTER 5 SUMMARY AND DISCUSSION The purpose of this study, which employed the data from the Georgia High School Graduation Predictor Test (GHSGPT) for Social Studies, was to analyze academic performance by ethnicity/race. IRTPRO, BILOG-MG 3, and IRTLRDIF were utilized to investigate across reference and focal groups with 1PL, 2PL, and 3PL. Consequently, the two programs, IRTPRO and BILOG-MG 3, identically detected 35 DIF items for Whites vs. Blacks, five DIF items for Whites vs. Hispanics, and three DIF items for Whites vs. the Multi-Racial group with 1PL. For 2PL, three programs, IRTPRO, BILOG-MG 3, and IRTLRDIF, consistently detected DIF. There are 16 DIF items for Whites vs. Blacks, three for Whites vs. Hispanics, and four for Whites vs. the Multi-Racial group. Additionally, for 3PL, as well as 2PL, the three programs are identically investigated DIF. Nine DIF items exist for Whites vs. Blacks, three in Whites vs. Hispanics, and one in Whites vs. the Multi-Racial group. Based on the results of both BILOG-MG 3 and IRTPRO, 3PL provided a good fit for the data. 5.1 Summary This study employed GHSGPT data to consider whether DIF for different ethnicities/races exists in the GHSGPT for Social Studies. This thesis analyzed only 79 items from the GHSGPT for Social Studies rather than the total 80 items because the Pearson- and biserial- correlations of Item 26 were negative. They were -.40 and -.053, respectively. Hence Item 26 was omitted from the calibration, and the remaining subsequent items were renumbered to maintain consecutive numbering. The summaries of the results are described below: 80 1. The Results Based on the Classical Test Theory (CTT) The average p-value (the rate of correct responses) is .518. There are 62 (77%) items between .3 and .7. The difficulty is moderate and tends toward easy. The average of discrimination is .304. There are 30 (38%) items lower than .3. The discrimination is moderate, so the items do not have a high discrimination. In addition, the Pearson- and biserial- correlations are positive. 2. The Results Based on the Item Response Theory (IRT) a. Item Discrimination Parameter The average item discrimination with 2PL is .519 and with 3PL is .968; thus, the degree of discriminations of both 2PL and 3PL are acceptable. b. Item Difficulty Parameter The average of item difficulty with 1PL is -.173, 2PL is .266, and 3PL is .650. The degrees of difficulty for the three models are moderate; however, 1PL and 2PL tend toward easy, and 3PL tends toward difficult. c. The Lower Asymptote (Pseudo-Guessing Parameter) The mean of the pseudo-guessing parameter for 3PL is .224; therefore, it is not high. 3. Detecting DIF Using the Three Computer Programs IRTPRO, BILOG-MG 3, and IRTLRDIF were used to assess the79 items to detect whether DIF for ethnicities/races exists on the GHSGPT for Social Studies with α = .05. Whites were regarded as the reference group, and Blacks, Hispanics, and the Multi-Racial group were considered the focal groups. For 1PL, items are considered to be DIF when BILOG-MG 3 and 81 IRTPRO consistently detected DIF. In addition, for 2PL and 3PL, when the three programs identically detected the DIF phenomenon, those items are included as DIF. a. The One-Parameter Logistic Model There were 35 DIF items for Whites vs. Blacks; 15 items advantaged Blacks, and 20 items advantaged Whites. In addition, five DIF items existed for Whites vs. Hispanics; three items favored Whites, and two items favored Hispanics. Moreover, three DIF items existed for Whites vs. the Multi-Racial group, and those items all advantaged Whites. b. The Two-Parameter Logistic Model There were 16 DIF items for Whites vs. Blacks; nine items advantaged Whites, and seven items favored Blacks. Three items showed DIF for Whites vs. Hispanics; two items favored Whites, and one item advantaged Hispanics. Four DIF items existed for Whites vs. the Multi-Racial group, and all advantaged the Multi-Racial group. c. The Three-Parameter Logistic Model There were nine DIF items found for Whites vs. Blacks; three items advantaged Whites, and six items favored Blacks. Furthermore, three DIF items were shown for Whites vs. Hispanics; two items advantaged Whites and one Hispanics. Additionally, only one DIF item was found for Whites vs. the Multi-Racial group, and it advantaged the Multi-Racial group. 4. Using IRTPRO and BILOG-MG 3 to Investigate DIF in Multiple Groups DIF items were considered in multiple groups parallel to the three comparison groups. If both IRTPRO and BILOG-MG 3 identically detected DIF, then those items are included as DIF. 82 a. The One-Parameter Logistic Model There were ten DIF items for Whites vs. Blacks; five items favored Whites, and five favored Blacks. There was one DIF item for Whites vs. the Multi-Racial group, and this item advantaged the Multi-Racial group. IRTPRO and BILOG-MG 3 did not identically detect DIF for Whites vs. Hispanics. b. The Two-Parameter Logistic Model BILOG-MG 3 and IRTPRO both determined seven DIF items for Whites vs. Blacks; four items advantaged Whites and three items Blacks. Two items were detected for Whites vs. the Multi-Racial group; one item favored Whites, and one item favored the Multi-Racial group. c. The Three-Parameter Logistic Model The three DIF items were consistently detected by two programs for Whites vs. Blacks; two items advantaged Whites, and one advantaged Blacks. Only one DIF item was detected for Whites vs. the Multi-Racial group, and that one favored Whites. There was no consistent DIF item for Whites vs. Hispanics with 2PL and 3PL. 5.2 Discussion Currently, DIF detection procedures have been developed exclusively for comparisons between a reference group/majority group and a focal group/ minority group, such as between Whites and Blacks, or males and females. Some previous Social Science studies consider all minorities as a homogeneous group. For instance, several studies mentioned that racial differences in assessment have primarily been developed in reference to comparisons between Whites and minority groups, which include Blacks, Asians, Hispanics, and Native Americans. 83 However, there is no evidence that Blacks and Hispanics are similar in this regard (Logan et al., 2012). Thus, this study shows that DIF detection differs by ethnicity. In addition, previous studies (Freedle & Kostin, 1988; Coffman & Belue, 2009) investigated the scores for either Whites and Blacks or Whites and Hispanics or other single comparison groups. However, numerous focal groups, for example Asians, African Americans, Hispanics, Native Americans, females, and examinees with disabilities, are available for study (Zieky, 1993). Thus, this thesis extends the line of prior research by using three comparison groups—1) Whites vs. Blacks; 2) Whites vs. Hispanics; and 3) White vs. a Multi-Racial group— to determine which items contain bias for a specific race/ethnicity. IRTPRO, BILOG-MG 3, and IRTLRDIF with three popular IRT models were used to detect DIF. This study met with some problems when calibrating the 3PL using BILOG-MG 3. These problems may have resulted because of the small sample sizes of the focal groups, the Hispanic and the Multi-Racial groups, numbering 114 and 132, respectively. It could not employ the default (GPRIOR) of the prior BILOG-MG 3 because it stopped when calibrating Item 59 for two comparison groups, Whites vs. Hispanics and Whites vs. the Multi-Racial group. Therefore, this study changed the prior to TPRIOR instead of GPRIOR using BILOG-MG 3, and the beta employed (4, 16) when using IRTPRO. In addition, when calibrating for two comparison groups, Whites vs. Hispanics and Whites vs. the Multi-Racial group, with 3PL using IRTLRDIF, several values of discrimination appeared very large, such as Item 74 (186.82) for Whites vs. Hispanics and Item 16 (78.68) for Whites vs. the Multi-Racial group. Nevertheless, this might be an estimation error, so the present study did not change because its purpose is to detect DIF for the GHSGPT. The discussion below follows the order of the five hypotheses in order to present the result of the study’s findings. 84 Hypothesis one and two: The three programs, IRTPRO, BILOG-MG 3, and IRTLRDIF, will exhibit consistent results when testing for DIF and will examine IRTPRO to assess whether it is effective in detecting DIF. Based on the results for the detection of DIF, methods using IRTPRO, BILOG-MG 3, and IRTLRDIF for three comparison groups are consistent. The rate of consistency of IRTLRDIF and IRTPRO was the highest; the consistent rate of IRTLRDIF and BILOG-MG 3 and BILOG-MG 3 and IRTPRO was high. The rate of consistency of BILOG-MG 3 and IRTPRO for multiple groups was moderate. Overall, the three computer programs displayed high consistency for the detection of DIF in this study. Furthermore, because IRTPRO displayed identical results to IRTLRDIF and BILOG-MG 3 for the three comparison groups, it is effective in detecting DIF. Hypothesis three: Which models are goodness of fit models for detecting DIF? According to Tables 7 and 8, both BILOG-MG 3 and IRTPRO exhibited the 2loglikelihood of 3PL for each comparison group and is smaller than the -2loglikelihood of 2PL and 1PL. Thus, this finding concludes that 3PL is a goodness of fit model to detect DIF in the GHSGPT. Hypothesis four: Were their differences between the ethnic groups? The computation of total scores is: 85 Total of the item correct Total number of each race ×Total number of item × 100% (35) According to the total scores for each race, Whites were 55%, Blacks were 46%, Hispanics were 51%, and the Multi-Racial groups were 54%. In general, Whites performed better than other races. Perhaps, because of a different cultural background and community region, Blacks performed worse than other races. Hypothesis five: DIF exists between ethnic groups on the GHSGPT. The three computer programs consistently showed that DIF exists between ethnicity groups. In addition, these findings indicated that several items advantaged specific races. Although the results supported all of the hypotheses, there are several limitations. First, this study does not control for gender, individual social economic status (SES), and school regions. Second, because the present study was unable to obtain the items, it cannot analyze the distractor. Thus, it is unable to further investigate some items that have lower response rates and to investigate why Blacks performed worse than other races. Third, this finding does not employ simulated data; it only applies empirical data to determine IRTPRO. In order to obtain an accurate result to determine IRTPRO, researchers should employ simulated and empirical data in detecting DIF in future study. Additionally, researchers may consider that school regions might affect the probability of answering an item correctly. For example, if a school has enough funding to hire additional teachers for tutoring, students might perform better because of this additional help. Thus, researchers can adopt multilevel IRT, such as the HLM program or flexMIRT, to better understand school level variables that may influence the relationships observed here. 86 In sum, DIF is an important tool in helping test developers recognize some questions that may be unfair for test-takers because of their gender, ethnicity/race, or cultural background (Zieky, 1993). In other words, DIF is a particularly useful instrument for test developers. This study presents DIF detection results from empirical tests, and, in addition, it provides important DIF information for the test developers of the Georgia High School Graduation Predictor Test. They can consider eliminating or revising several items, such as Items 52, 59, 74, 77, and 79, that are beneficial or adverse for particular races. Furthermore, it examines the new program, IRTPRO, to demonstrate and determine its effectiveness for detecting DIF. 87 REFERENCES Allen, M. J., & Yen, W. M. (2002). Introduction to measurement theory. Long Grove, IL Waveland Press, Inc. American Psychological Association, c/o Joint Committee on Testing Practices. (1988). Code of fair testing practices in education. Washington, DC: Author. Angoff, W. H. (1993). Perspectives on differential item functioning methodology. In P.W. Hland & H. Wainer (Eds.), Differential item functioning (pp. 3-24). Hillsdale, NJ: Lawrence Erlbaum Associates. Baker, F. B. (2001). The basic of item response theory. New York, NY: Eric Clearinghouse on Assessment and Evaluation. Baker, F. B., & Kim, S-H. (2004). Item response theory: Parameter estimation techniques. Boca Raton, FL: Taylor & Francis. Berk, R. A. (1982). Handbook of methods for detecting test bias. Baltimore, MD: Johns Hopkins University Press. Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M. Lord & M. R. Novick, Statistical theories of mental test scores (pp. 392-479). Reading, MA: Addison-Wesley. Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: An application of an EM algorithm. Psychometrika, 46, 443-459. Bock, R. D., & Lieberman, M. (1970). Fitting a response model for dichotomously scored items. Psychometrika, 35, 179-197. 88 Bolt, D. M. (2000). A SIBTEST approach to testing DIF hypothesis using experimentally designed test items. Journal of Educational Measurement, 37, 307-327. Brescia, W., & Fortune, J. C. (1988). Standardized testing of American Indian students. Eric Clearinghouse on Rural Education and Small Schools, Las Cruces, N. Mex. Retrieved January 31, 2012, from http://www.enc.org/topics/equity/articles/document.shtm?=ACQ111498-1498. Cai, L., Thissen, D., & du Toit, S. (2011). IRTPRO 2.1 [Computer software]. Lincolnwood, IL: Scientific Software International. Cai, L. (2012). flexMIRTTM version 1.86: A numerical engine for multilevel item factor analysis and test scoring. [Computer software]. Seattle, WA: Vector Psychometric Group. Clauser, B. E., & Mazor, K. M. (1998). Using statistical procedures to identify differentially functioning test items. Educational Measurement: Issues and Practice, 17, 31-44. Coffman, D. L., & Belue, R. (2009). Disparities in sense of community: True race differences or differential item functioning? Journal of Community Psychology, 37, 547-558. Cohen, A. S., & Kim, S-H. (1993). A comparison of Lord’s χ2 and Raju’s area measures in detection of DIF. Applied Psychological Measurement, 17, 39-52. Cohen, A. S., Kim, S-H., & Wollack, J. A. (1996). An investigation of the likelihood ratio test for detection of differential item functioning. Applied Psychological Measurement, 20, 1526. Crocker, L., & Algina, J. (2008). Introduction to classical and modern test theory. Mason, OH: Cengage Learning. Czepiel, S. A. (2002). Maximum likelihood estimation of logistic regression models: Theory and implementation. Retrieved from http://czep.net/stat/mlelr.pdf. 89 Dorans, N. J., & Kulick, E. (1986). Demonstrating the utility of the standardization approach to assessing unexpected differential item performance on the Scholastic Aptitude Test. Journal of Educational Measurement, 23, 355-368. Dorans, N. J., & Schmitt, A. P. (1991). Constructed response and differential item functioning: A pragmatic approach (ETS-RR-91-47). Princeton, NJ: Educational Testing Service. Dorans, N. J., & Holland, P. W. (1993). DIF detection and description: Mantel-Haenszel and standardization. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp.3566). Hillsdale, NJ: Lawrence Erlbaum Associates, Inc.). Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologist. Mahwah, NJ: Lawrence Erlbaum Associates. Freedle, R., & Kostin, I. (1988). Relationship between item characteristics and an index of differential item functioning (DIF) for the four GRE verbal item types. ETSRR-88-29. Princeton, NJ: Educational Testing Service. Georgia Department of Education. Test content descriptions based on the Georgia performance standards social studies (2010). Retrieved from http://archives.gadoe.org/DMGet Document.aspx/GHSGT%20Social%20Studies%20Content%20Descriptions%20GPS%20 Version%20Update%20Oct%202010.pdf?p=6CC6799F8C1371F6A344D9C15C23A9D85 9A861593B934AB75F446073BD12714C&Type=D. Gronlund, N. E. (1993). How to make achievement tests and assessments (5th ed.) Boston, MA: Allyn and Bacon. Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: Principle and applications. Boston, MA: Kluwer-Nijhoff. 90 Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Newbury Park, CA: Sage. Hambleton, R. K., & Jones, R.W. (1993) Comparison of classical test theory and item response theory and their application to test development. Educational Measurement: Issues and Practice, 12, 38-47. Harwell, M. R., Baker, F. B., & Zwarts, M. (1988). Item parameter estimation via marginal maximum likelihood and an EM algorithm: A didactic. Journal of Educational Statistics, 13, 247-271. Holland, P. W., & Thayer, D. T. (1988). Differential item functioning and the Mantel-Haenszel procedure. In H. Wainer & H. I. Braun (Eds.) Test validity (pp. 129-145). Hillsdale, NJ: Lawrence Erlbaum Associates. Kim, S-H., Cohen, A. S., & Park, T. H. (1995). Detection of differential item functioning in multiple groups. Journal of Educational Measurement, 32, 261-278. Ling, S. E., & Lau, S. H. (2005). Detecting differential item functioning (DIF) in standardized multiple-choice test: An application of item response theory (IRT) using three parameter logistic model. Retrieved January 31, 2012, from http://www.ipbl.edu.my/inter/penyelidikan/seminarpapers/2005/lingUITM.pdf. Logan, J. R., Minca, E., & Adar, S. (2012, January 10). The geography of inequality: Why separate means unequal in American public schools. Sociology of Education. Advance online publication. doi:10.1177/0038040711431588. Lord, F. M. (1952). A theory of test scores. Psychometric Monograph, No 7. Lord, F. M. (1953). A relation of test score to the trait underlying the test. Educational and Psychological Measurement, 13, 517-548. 91 Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley. Lord, F. M. (1974).Estimation of latent ability and item parameters when there are omitted responses. Psychometrika, 39, 247-264. Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum Associates.). Mantel, N., & Haenszel, W. (1959). Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute, 22, 719-748. McDonald, R.P. (1999). Test theory: a unified treatment. Mahwah, NJ: Lawrence Erlbaum Associates. Mellenbergh, G. J. (1982). Contingency table models for assessing item bias. Journal of Educational Statistics, 7, 105-118 Potenza, M. T., & Dorans, N. J. (1995). DIF assessment for polytomously scored items: A framework for classification and evaluation. Applied Psychological Measurement, 19, 2337. Raju, N. S. (1988). The area between two item characteristic curves. Psychometrika, 53, 495502. Raju, N. S., & Drasgow, F. (1993). An empirical comparison of the area method, Lord’s chisquare test, and the Mantel-Haenszel technique for assessing differential item functioning. Educational and Psychological Measurement. 53, 301-314. Raju, N. S., van der Linder, W. J., & Fleer, P. F. (1995). IRT-based internal measures of differential functioning of items and tests. Applied Psychological Measurement, 19, 353368. 92 Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen: The Danish Institute for Educational Research. Rudner, L. M., Getson, P. R., & Knight, D. L. (1980). Biased item detection techniques. Journal of Educational Statistics, 5, 213-233. Schmitt, A. P., & Dorans, N. J. (1990). Differential item functioning for minority examinees on the SAT. Journal of Educational Measurement, 27, 67-81. Shealy, R. T., & Stout. W. F. (1993). An item response theory model for test bias and differential test functioning. In P.W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 197-239). Hillsdale, NJ: Lawrence Erlbaum Associates. Spector, P. E. (1992). Summated rating scale construction: An introduction. Newbury Park, CA: Sage. Steinberg, L. (1994). Context and serial-order effects in personality measurement: Limits on the generality of measuring changes the measure. Journal of Personality and Social Psychology, 66, 341-349. Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27, 361-370. Thissen, D, Steinverg, L. & Gerrard, M. (1986). Beyond group mean differences: The concept of item bias. Psychological Bulletin, 99, 118-128. Thissen, D., Steinverg, L., & Wainer, H. (1993). Detection of differential item functioning using the parameters of item response model. In P.W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 67–114). Hillsdale, NJ: Lawrence Erlbaum Associates. Thissen, D. (2001). IRTLRDIF v2.0b: Software for the computation of the statistics involved in item response theory likelihood-ratio tests for differential item functioning [Computer 93 software documentation]. Chapel Hill: L. L. Thurstone Psychometric Laboratory, University of North Carolina. Van der Linden, W. J., & Hambleton, R. K. (1996). Handbook of modern item response theory. New York, NY: Springer-Verlag. Wainer, H., Sireci, S. G., & Thissen, D. (1991). Differential testlet functioning: Definitions and detection. Journal of Educational Measurement, 28, 197-219. Wang, X-B, Wainer, H., & Thissen, D. (1995) On the viability of some untestable assumptions in equating exams that allow examinee choice. Applied Measurement in Education, 8, 211225. Woods, C. M. (2009). Empirical selection of anchors for tests of differential item functioning. Applied Psychological Measurement, 33, 42-57. Zieky, M. (1993). Practical questions in the use of DIF statistics in test development. In P.W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 337–347). Hillsdale, NJ: Lawrence Erlbaum Associates. Zimowski, M. F, Muraki, E., Mislevy, R. J, & Bock, R. D. (2003). BILOG-MG 3 [Computer software]. Lincolnwood, IL: Scientific Software International. 94 APPENDICES A. IRTPRO Input File for DIF Detection for Two Groups with 3PL Project: Name = WALL; Data: File = .\WALL.ssig; Analysis: Name = 3PL; Mode = Calibration; Title: Master Thesis 3PL DIF Comments: 3PL models fitted to each of the 79 items. Estimation: Method = BAEM; E-Step = 500, 1e-005; SE = S-EM; M-Step = 50, 1e-006; Quadrature = 49, 6; SEM = 0.001; 95 SS = 1e-005; Scoring: Mean = 0; SD = 1; Miscellaneous: Decimal = 2; Processors = 2; Print CTLD, P-Nums, Diagnostic; Min Exp = 1; Groups: Variable = group; Group G1: Value = (1); Dimension = 1; Items = Q1, Q2, Q3, Q4, Q5, Q6, Q7, Q8, Q9, Q10, Q11, Q12, Q13, Q14, Q15, Q16, Q17, Q18, Q19, Q20, Q21, Q22, Q23, Q24, Q25, Q26, Q27, Q28, Q29, Q30, Q31, Q32, Q33, Q34, Q35, Q36, Q37, Q38, Q39, Q40, Q41, Q42, Q43, Q44, Q45, Q46, Q47, Q48, Q49, Q50, Q51, Q52, Q53, Q54, Q55, Q56, Q57, Q58, Q59, Q60, Q61, Q62, Q63, Q64, Q65, Q66, Q67, Q68, Q69, Q70, Q71, Q72, Q73, Q74, Q75, 96 Q76, Q77, Q78, Q79; Codes(Q1) = 0(0), 1(1); Codes(Q2) = 0(0), 1(1); ⋮ Codes(Q78) = 0(0), 1(1); Codes(Q79) = 0(0), 1(1); Model(Q1) = 3PL; Model(Q2) = 3PL; ⋮ Model(Q78) = 3PL; Model(Q79) = 3PL; Referenced; Mean = 0.0; Covariance = 1.0; Group G2: Value = (2); Dimension = 1; Items = Q1, Q2, Q3, Q4, Q5, Q6, Q7, Q8, Q9, Q10, Q11, Q12, Q13, Q14, Q15, Q16, Q17, Q18, Q19, Q20, Q21, Q22, Q23, Q24, Q25, Q26, Q27, Q28, Q29, Q30, Q31, Q32, Q33, Q34, Q35, Q36, Q37, Q38, Q39, Q40, Q41, Q42, Q43, Q44, Q45, Q46, Q47, Q48, Q49, Q50, Q51, Q52, Q53, Q54, Q55, Q56, Q57, Q58, Q59, Q60, Q61, Q62, Q63, Q64, Q65, Q66, Q67, Q68, Q69, Q70, Q71, Q72, Q73, Q74, Q75, 97 Q76, Q77, Q78, Q79; Codes(Q1) = 0(0), 1(1); Codes(Q2) = 0(0), 1(1); ⋮ Codes(Q78) = 0(0), 1(1); Codes(Q79) = 0(0), 1(1); Model(Q1) = 3PL; Model(Q2) = 3PL; ⋮ Model(Q78) = 3PL; Model(Q79) = 3PL; Mean = Free; Covariance = Free; DIF All: Constraints: Equal = (G1, Q1, Slope[0]), (G2, Q1, Slope[0]); Equal = (G1, Q1, Intercept[0]), (G2, Q1, Intercept[0]); Equal = (G1, Q1, Guessing[0]), (G2, Q1, Guessing[0]); Equal = (G1, Q2, Slope[0]), (G2, Q2, Slope[0]); Equal = (G1, Q2, Intercept[0]), (G2, Q2, Intercept[0]); Equal = (G1, Q2, Guessing[0]), (G2, Q2, Guessing[0]); 98 ⋮ Equal = (G1, Q78, Slope[0]), (G2, Q78, Slope[0]); Equal = (G1, Q78, Intercept[0]), (G2, Q78, Intercept[0]); Equal = (G1, Q78, Guessing[0]), (G2, Q78, Guessing[0]); Equal = (G1, Q79, Slope[0]), (G2, Q79, Slope[0]); Equal = (G1, Q79, Intercept[0]), (G2, Q79, Intercept[0]); Equal = (G1, Q79, Guessing[0]), (G2, Q79, Guessing[0]); Priors: (G1, Q1, Slope[0]) = Lognormal, 0, 1; (G1, Q1, Intercept[0]) = Normal, 0, 3; (G1, Q1, Guessing[0]) = Beta, 4, 16; (G1, Q2, Slope[0]) = Lognormal, 0, 1; (G1, Q2, Intercept[0]) = Normal, 0, 3; (G1, Q2, Guessing[0]) = Beta, 4, 16; ⋮ (G1, Q78, Slope[0]) = Lognormal, 0, 1; (G1, Q78, Intercept[0]) = Normal, 0, 3; (G1, Q78, Guessing[0]) = Beta, 4, 16; (G1, Q79, Slope[0]) = Lognormal, 0, 1; (G1, Q79, Intercept[0]) = Normal, 0, 3; (G1, Q79, Guessing[0]) = Beta, 4, 16; (G2, Q1, Slope[0]) = Lognormal, 0, 1; 99 (G2, Q1, Intercept[0]) = Normal, 0, 3; (G2, Q1, Guessing[0]) = Beta, 4, 16; (G2, Q2, Slope[0]) = Lognormal, 0, 1; (G2, Q2, Intercept[0]) = Normal, 0, 3; (G2, Q2, Guessing[0]) = Beta, 4, 16; ⋮ (G2, Q78, Slope[0]) = Lognormal, 0, 1; (G2, Q78, Intercept[0]) = Normal, 0, 3; (G2, Q78, Guessing[0]) = Beta, 4, 16; (G2, Q79, Slope[0]) = Lognormal, 0, 1; (G2, Q79, Intercept[0]) = Normal, 0, 3; (G2, Q79, Guessing[0]) = Beta, 4, 16; 100 B. BILOG-MG 3 Input File for DIF Detection for Two Groups with 3PL Master Thesis All Races 3PL DIF >COMMENT An empirical comparison of the three programs is presented using the fall 2010 data of the GHSGPT. This study detects DIF for different ethnicities only in social studies, which consists of 79 dichotomously scored items. >GLOBAL DFName = 'D:\Thesis\Result\BL\WALL\WALL.1.dat', NPArm = 3; >LENGTH NITems = (79); >INPUT NTOtal = 79, NIDchar = 4, NGRoup = 4, DIF; >ITEMS ; >TEST1 TNAme = 'WALL3PL', INUmber = (1(1)79); >GROUP1 GNAme = 'WRFGROUP', LENgth = 79, INUmbers = (1(1)79); >GROUP2 GNAme = 'BFCGROUP', LENgth = 79, 101 INUmbers = (1(1)79); (4A1, 4X, I1, 4X, 79A1) >CALIB CRIt = 0.0050, PLOt = 1.0000, ACCel = 1.0000, TPRIOR; >SCORE ; 102 C. IRTLRDIF Input File for DIF Detection for Two Groups with 3PL 2654 79 111111111111111111111111111111111111111111111111111111111111111111111111111111 1 WBLR.dat 4 1 5-83 WBLR3PL.out 103
© Copyright 2025 Paperzz