An Analysis of Four Common Item Types used in Testing EFL Reading Comprehension Kyle Perkins Southern Illinois USA University at Carbondale Psychometric research has shown that different factors can reliability and validity of a test. Reliability can be affected by fluctuations in the subject and in test administration and test affect the characteristics. Invalid application of tests, inappropriate selection of content, sample truncation, and poor criterion selection can pose threats to a test’s validity. The research reported in this paper suggests that the readability level of a passage on which reading comprehension questions are based can affect empirical considerations of item analysis, reliability, and validity when the subject pool is a group of adult English as a Foreign Language students. Of the item types assessed, the true/false and multiple-choice items produced better test statistics than did the missing letters and grammar paraphrase items. Guidelines for a more tightly-controlled study are suggested. 1. Introduction When conducting psychometric research in English as a Foreign Language (EFL) reading comprehension, a researcher must be aware of a number of factors which include, at minimum the different skills or components of reading comprehension, the different item types which can be employed to assess reading skills or components, and the factors that can affect a reading test’s reliability and validity. Various attempts have been made to catalogue the skills and components which are thought to be crucial to the reading process. Perhaps one of the best known lists is Barrett’s (1976) taxonomy of comprehension tasks. Barrett’s taxonomy includes the following skills: literal recognition or recall of details, main ideas, sequence, comparisons, cause the effect relationships and character traits; inference of supporting details, the main idea, sequence, comparisons, cause and effect relationships, character traits, outcomes, and figurative language; evaluation of reality or fantasy, fact or opinion, adequacy or validity, appropriateness, and worth, desirability or acceptability; appreciation of emotional response to plot or theme, identification with characters and incidents, reactions to the author’s use of language; and imagery. 29 Downloaded from rel.sagepub.com at PENNSYLVANIA STATE UNIV on September 18, 2016 Harrison and Dolan (1979:16) list the following putative skills of reading: word meaning: in isolation; word meaning (context): the meaning of a selected word as it is used in a particular context; literal comprehension : one which calls for only a verbatim response; inference (single string): an inference is made from a single sentence or groups of words; inference (multiple strings): an inference is made from information drawn from a number of sentences/groups of words; metaphor: responses are sought from a passage wherein interpretation cannot be made at a literal level; salients: the ability to isolate the key points of the passage; evaluation : the ability to make a judgment or come to a decision, after assessing the content of a passage and setting this against knowledge gained from previous experience. Sim and Bensoussan (1979:38) offer a complementary list of reading components that can be assessed: questions on function words, used in a logical sequence; questions on content words, used denotatively for meaning; questions on content words, used connotatively for tone and implication; part-text questions on the ability to recognize a paraphrase of short stretches of text; and whole-text questions concerning the author’s purpose and manner of achieving that purpose. A variety of item types has been employed by test writers to assess EFL reading comprehension. Heaton (1975) discusses the more commonly used types which include word matching: the subject identified the word from a list of options which is the same as the stimulus word; sentence matching: the examinee is required to recognize sentences which consist of the same words in the same word order in the same grammatical and rhetorical type of sentence; pictures and sentence matching: the pupil chooses a sentence from the list of options which correctly describes the stimulus picture; true/false items which are complete in themselves and test general truths and true/false items which are based on a text; multiplechoice items which are based on a few sentences or on a reading passage; grammar paraphrase items which require the subject to identify the correct paraphrase of the stimulus from four or five options; completion items in which certain letters of missing words are given and each dash in the blank signifies a letter; and the cloze procedure. There are many factors which pose threats to the reliability of any test, and Henning (in press) discusses the more commonly cited phenomena which are known to affect the reliability of a test: fluctuations in the learner due to temporary psychological or physiological changes; fluctuations in test administration including regulatory fluctuations and changes in the administrative environment; test characteristics including length, difficulty, and boundary effects, discriminability, speededness, and homogeneity; and examinee factors which include response arbitrariness, test wiseness and familiarity. 30 Downloaded from rel.sagepub.com at PENNSYLVANIA STATE UNIV on September 18, 2016 Factors that can specifically affect a test’s validity include invalid application of tests; inappropriate selection of content; imperfect cooperation of the examinee; inappropriate referent or norming population; poor criterion selection; sample truncation; and use of invalid constructs (Henning, in press). The Present Study Purpose. The purpose of the present study was to determine the extent which different item types commonly used in EFL reading comprehension tests and quizzes generate item analysis information and the extent to which the differences in empirical considerations of item analysis can be attributed to the readability of the texts on which the reading comprehension questions are based. to Subjects. The data for this research were collected at the English Language Institute, The American University in Cairo, where the author recently spent a sabbatical leave. The subjects were 19 Egyptian adults who were enrolled in an intermediate-level English as a Foreign Language class at AUC. They were enrolled full-time and were university-bound upon their passing a standardized English language proficiency test. Materials and Procedures. Four different item types to test EFL reading comprehension were employed in this research: true/false items based on a reading passage; multiple-choice comprehension questions based on a reading passage; a missing letters format in which certain letters of missing words are given and each dash in the blank signifies a letter; and a grammar paraphrase test which required the subjects to identify the correct paraphrase of the stimulus from four options. The true/false test consisted of 12 items which were based on a 232 word reading selection. The items tested literal comprehension and inference (single and multiple strings). After the students had read the passage, it was collected; then the true/false questions were distributed. Total working time was 30 minutes. The following is an example question: True False Looking at someone else’s eyes or looking away from them means a person is thinking very deeply. The multiple-choice comprehension test consisted of 8 items based 240 word reading selection. The items tested literal comprehension, inference (single and multiple strings), key points and evaluation. The subjects answered the questions after the reading passage had been read and collected. Total working time was 30 minutes. The following is an on a example multiple-choice question: We can endure the hero’s suffering because we know 31 Downloaded from rel.sagepub.com at PENNSYLVANIA STATE UNIV on September 18, 2016 A. things B. the C. good will D. the hero is very brave. cannot crew will get worse. mutiny. win in the end. The missing-letters format test consisted of 24 items, and the reading selection was 245 words long. The first three and the last three sentences were left intact. The subjects were instructed to read the entire passage first before they began to fill in the blanks. On average the blanks occurred every 10.2 words. Thirty minutes’ time was allotted for the entire exercise. The following sentence comes from the missing letters test: &dquo;Last week a team of sc-----st- from the U.S. and Egypt made an announcement : they had definitive e------- that long ago a region of the vast desert in southern Egypt and Northern Sudan was a lacy n--w--- of major waterways.&dquo; The grammar paraphrase test consisted of 50 items which tested various grammatical structures including epistemic modals, bound morphology, passive voice, multiple embedded relative clauses, extraposition, layered possessives, gerunds, present participles, presupposition, and entailment. Total working time was 30 minutes. The following item appeared as number 21 on the grammar paraphrase test: &dquo;I didn’t know that Mac hadn’t been killed after all. A. Mac B. Mac wasn’t killed, and I knew it. C. I knew that Mac D. I didn’t know that Mac was killed, but I didn’t know it. was dead. was alive.&dquo; Analyses and Results. For the four tests the following test statistics were calculated: mean, SD, SEM, KR-20 observed reliability, 100-item reliability, item difficulty as proportion correct, item discriminability for each item compted as a point biserial correlation coefficient between item responses and total scores for each test, item variance, internal construct validity proportion, maximum validity, and SMOG grade (but for the grammar paraphrase test). 32 Downloaded from rel.sagepub.com at PENNSYLVANIA STATE UNIV on September 18, 2016 TABLE I Test Statistics, Number of Examinees and Items and Estimates of Reliability Table 1 presents the number of examinees and items, test statistics, and estimates of reliability. Since it is known that adding more items of similar kind and difficulty usually improves test reliability up to a point of asympote, the observed estimates of reliability and estimates of the reliability of each test if extended to 100 items are reported. TABLE 2 Item Difficulty Indices as Proportion Correct .33 - .67 Acceptable Range 33 Downloaded from rel.sagepub.com at PENNSYLVANIA STATE UNIV on September 18, 2016 Table 2 (cont.) Table 2 presents the item difficulty indices as proportion correct for each test. Tests can exhibit low reliability when they are too easy or too difficult for a particular sample of examinees. In general, it is recommended that items with a proportion of correct answers less than .33 and greater than .67 be rejected; therefore each item with an item difficulty index outside this range carries a mark in the rejection column. TABLE 3 Item Discriminability; Point Biserial Correlations rpbi .25 and Above Acceptable 34 Downloaded from rel.sagepub.com at PENNSYLVANIA STATE UNIV on September 18, 2016 35 Downloaded from rel.sagepub.com at PENNSYLVANIA STATE UNIV on September 18, 2016 Table 3 displays the item discriminability indices for each item calculated as point biserial correlation coefficients between item responses and total scores. Since point biserial correlation coefficients of .25 and above are considered acceptable for these purposes, - .25 is the criterion for rejection. TABLE 4 Item Variance .25 pq Maximum 36 Downloaded from rel.sagepub.com at PENNSYLVANIA STATE UNIV on September 18, 2016 37 Downloaded from rel.sagepub.com at PENNSYLVANIA STATE UNIV on September 18, 2016 Table 4 lists the item variances for each item in the four tests. The not aware of any criterion for rejecting an item on the basis of amount of its information function save for those items which all the examinees get correct or miss, therefore generating 0 variance. The maximum variance which any item can generate is .25 when exactly half the subjects pass and the other half fail an item. author is TABLE 5 Validity Estimate Coefficients and SMOG Grade Readability Estimates Table 5 displays the validity estimate coefficients and SMOG grade readability estimates for the three reading passages. Following the procedure discussed in Henning (in press) the author calculated the internal construct validity proportion for the true/false, multiple-choice comprehension, and missing letters test. This procedure assumes that, for example, if the true/false items have internal construct validity, the point biserial correlation between each true/false item and the total scores for true/false should be higher than the point biserial correlations of the same items with the total scores for the grammar paraphrase test. This relationship can be expressed as follows: The generalization which these symbols are intended to convey is that the correlation coefficients of individual items with their own tests should be greater than the correlation coefficients of the same items with other test totals. An important step in this procedure is to correct each item-total coefficient for part-whole overlap because items produce artificially high correlation with their own total. Again, following Henning’s procedure, the author selected the proportion of items for each test that exhibited a higher item-total correlation (corrected for part-whole 38 Downloaded from rel.sagepub.com at PENNSYLVANIA STATE UNIV on September 18, 2016 overlap) than item-grammar paraphrase total correlation. The single coefficient is intended to reflect the internal construct validity for each test. The maximum validity coefficient possible in the most optimum situation is equal to the square root of the reliability estimate of the test. To estimate the readability level of the three reading passages, the author used the SMOG readability formula. McLaughlin (1969) gives a complete explication for the SMOG grading formula but, basically, the procedure entails four steps: 1) count ten consecutive sentences each at the beginning, middle, and end of the reading passage. A sentence is considered as any string of words ending with a terminal punctuation mark, i.e., period, question mark, or exclamation mark. 2) Count every word containing three or more syllables in the 30 sentences. 3) Estimate the square root of the number of words containing three or more syllables by taking the square root of the nearest perfect square. 4) Add 3 to the square root which indicates the grade level that a person must have reached if s/he is to comprehend fully the reading passage in question. TABLE 6 Spearman Rank Correlations Test Statistics and Readability Levels To answer the research questions, the author averaged the item difficulty, item discriminability, and item variance indices for the test and calculated the proportion of rejection for each. Each test statistic was rank ordered and the readability estimates were rank ordered for the three reading passages. The Spearman rank-order correlations are presented in Table 6; two sets of ranks are illustrated below: 39 Downloaded from rel.sagepub.com at PENNSYLVANIA STATE UNIV on September 18, 2016 Discussion As Table 1 indicates, not one of the reading tests employed in this research exhibited an acceptable, observed estimate of reliability. One a teacher-made, nonstandardized test to produce a reliability estimate between 0.60 and 0.80. Part of the explanation may lie with the fact that two of the tests have few times, i.e., true/false, 12; multiplechoice, 8. Research has shown that it takes at least 12 to 15 good items to produce a half-way decent reliability estimate (cf Downie 1967). Even after the Spearman-Brown Prophesy Formula was used to estimate the reliability of these tests if they were extended to 100 items, only two exhibit acceptable reliability estimates, i.e., true/false and multiple-choice. expects Table 2 shows that the majority of items from the four tests should be discarded. The mean item difficulty for the multiple-choice and missing letters items is below .33 and for the grammar paraphrase test, above .67; the former were too difficult for this subject pool and the latter was too easy. The mean item difficulty for the true-false items falls within the acceptable range. If the author knew how to correct the true/false binary responses for guessing, the verdict for the true-false items might have been different. The item discriminability and item variance data in Tables 3 and 4 simply reflect what has been previously stated about the tests: the multiplechoice and true/false items produce more useful information about the subject pool than do the missing letters and grammar paraphrase items. Of the three tests which were based on reading passages, the true/ false and multiple-choice tests exhibit the highest validity co-efficients calculated by the internal construct validity proportion and maximum validity methods (Table 5). Though the two validity co-efficients were calculated by different methods, it is interesting to note that the results in terms of rank ordering the item types by highest validity are quite similar. 40 Downloaded from rel.sagepub.com at PENNSYLVANIA STATE UNIV on September 18, 2016 There is evidence in Table 6 to suggest that the quality of test statistic data covaries with the readability level of the reading passage on which the test items are based. For example, the test items based on the easiest passage had the highest construct validity proportion, while the test items based on the most difficult passage had the lowest construct validity proportion. These findings may be explained by making reference to Hirsch’s (1977:85) definition of relative readability: &dquo;assuming that two texts convey the same meaning, the more readable text will take less time and effort to understand.&dquo; In this particular study the results suggest that the more readable passages entailed less peripheral processing time on the part of the readers so they could spend more time attending to the comprehension process and the components of the passage. As a result, the test items based on the more readable passage generated truer scores than the other items, thereby giving a more reliable assessment of the true differences in reading ability between subjects. As appealing as this explanation may be, the author cannot categorically state that the more readable the passage, the better the test statistics, because, in this study, it is impossible to determine whether the item type or the readability level of the passage is responsible for the covariance of the test statistics. Put another way, the true/false items were based on a passage with a SMOG grade of 9; the multiple-choice items were based on a passage with a SMOG grade of 10; and the missing letters test based on a passage with a SMOG grade of 11. The readability levels varied and so did the item types. The author believes that the readability level of a passage does affect the reliability and validity of the test items based on the passage. To confirm or disconfirm that hypothesis, a more tightly controlled study would have to be conducted. Such a study would entail the use of different reading passages which had exactly the same readability level, the same number of propositions per passage, words of the same frequency, sentences of similar syntactic and semantic complexity, the same discourse characteristics, the same story structure (cf Rumelhart 1975; Stein and Glenn 1976; Thorndyke 1977), and the same thematic information (cf Bransford and Johnson 1972). In addition one would have to control for response set, test-retest contamination, practice effect, maturation, and instrument decay. If there were significant differences between the test statistics according to item type, then one could conclude that the nature of the item type affects reliability and validity. If one were to adhere strictly to the criteria of item rejection based item analysis of difficulty, discriminability and variance, a large proportion of the test items employed in this research would be discarded. However, there are good reasons why a test constructor might wish to retain some of them. Henning (in press:67-68) mentions the following on 41 Downloaded from rel.sagepub.com at PENNSYLVANIA STATE UNIV on September 18, 2016 constraints which may need to be imposed on the decision to reject items as too easy or too difficult: 1) the need to include specific content. Rejection of all items that are at the extremes of the difficulty continuum may result in a test that is insensitive to the objectives of instruction [cf Popham 1978]; 2) the need to provide an easy introduction to overcome psychological inertia on the part of the subject; 3) the need to shape the test information curve by systematically sampling items at a specific difficulty level to cause the test to be more sensitive or discriminating at a given cut off score or scores. A final word must be added about reliability, validity, and the purposes for which any test is used: any given test may be reliable and valid for some samples and for some purposes, but not for others. The results of this study seem to indicate that item type and readability level affected both the reliability and validity of EFL reading comprehension tests with an Egyptian EFL subject pool. 42 Downloaded from rel.sagepub.com at PENNSYLVANIA STATE UNIV on September 18, 2016 REFERENCES Barrett, T. Taxonomy of reading comprehension. In Smith, R., and Barrett, T. (Eds.). Testing reading in the middle grades. Reading, MA: Addision-Wesley, 1976. Bransford, J., and Johnson, M. Considerations of some problems of comprehension. In Chase, W. (Ed.). Visual information processing. New York: Academic Press, 1972. Downie, N. Fundamentals of measurement: techniques and practices (2nd ed.). New York: Oxford University Press, 1967. Harrison, C., and Dolan, T. Reading comprehension — a psychological viewpoint. In Mackay, R., Barkman, B., and Jordan, R. (Eds.). Reading in a second language: hypotheses, organization and practice. Rowley, MA: Newbury House, 1979. Heaton, J. Writing English language tests. London: Longman, 1975. Henning, G. Language test development. Rowley, MA: Newbury House, in press. Hirsch, E. D., Jr. The philosophy of composition. Chicago: The University of Chicago Press, 1977. McLaughlin, G. SMOG grading — a new readability formula. Journal of Reading, 1969, 12, 639-646. Popham, W. J. Criterion-referenced measurement. Englewood Cliffs, NJ: Prentice-Hall, 1978. Rumelhart, D. Notes on schema for stories. In Bobrow, D., and Collins, A. (Eds.). Representation and understanding: studies in cognitive science. New York: Academic Press, 1975. Sim, D., and Bensoussan, M. Control of contextualized function and content words as it affects English as a foreign language reading comprehension test scored. In Mackay, R., Barkman, B., and Jor- R. (Eds.). Reading in a second language: hypotheses, organization and practice. Rowley, MA: Newbury House, Stein, N., and Glenn, C. An analysis of story comprehension in elementary school children. In R. Freedle (Ed.) New directions in discourse processing. Norwood, NJ: Ablex, 1979. Thorndyke, P. Cognitive structures in comprehension and memory of narrative discourse. Cognitive Psychology, 1977, 9, 77-110. dan, 43 Downloaded from rel.sagepub.com at PENNSYLVANIA STATE UNIV on September 18, 2016
© Copyright 2026 Paperzz