Reading Comprehension

An
Analysis of Four Common Item Types used in Testing EFL
Reading Comprehension
Kyle Perkins
Southern Illinois
USA
University
at
Carbondale
Psychometric research has shown that different factors can
reliability and validity of a test. Reliability can be affected
by fluctuations in the subject and in test administration and test
affect the
characteristics. Invalid application of tests, inappropriate selection
of content, sample truncation, and poor criterion selection can pose
threats to a test’s validity. The research reported in this paper
suggests that the readability level of a passage on which reading
comprehension questions are based can affect empirical considerations of item analysis, reliability, and validity when the subject pool
is a group of adult English as a Foreign Language students. Of the
item types assessed, the true/false and multiple-choice items produced better test statistics than did the missing letters and grammar
paraphrase items. Guidelines for a more tightly-controlled study are
suggested.
1.
Introduction
When conducting psychometric research in English as a Foreign
Language (EFL) reading comprehension, a researcher must be aware of
a number of factors which include, at minimum the different skills or
components of reading comprehension, the different item types which can
be employed to assess reading skills or components, and the factors that
can affect a reading test’s reliability and validity.
Various attempts have been made to catalogue the skills and
components which are thought to be crucial to the reading process. Perhaps
one of the best known lists is Barrett’s (1976) taxonomy of comprehension tasks. Barrett’s taxonomy includes the following skills: literal
recognition or recall of details, main ideas, sequence, comparisons, cause
the effect relationships and character traits; inference of supporting details,
the main idea, sequence, comparisons, cause and effect relationships,
character traits, outcomes, and figurative language; evaluation of reality
or fantasy, fact or opinion, adequacy or validity, appropriateness, and
worth, desirability or acceptability; appreciation of emotional response
to plot or theme, identification with characters and incidents, reactions
to the author’s use of language; and imagery.
29
Downloaded from rel.sagepub.com at PENNSYLVANIA STATE UNIV on September 18, 2016
Harrison and Dolan (1979:16) list the following putative skills of
reading: word meaning: in isolation; word meaning (context): the meaning of a selected word as it is used in a particular context; literal comprehension : one which calls for only a verbatim response; inference (single
string): an inference is made from a single sentence or groups of words;
inference (multiple strings): an inference is made from information drawn
from
a number of sentences/groups of words; metaphor: responses are
sought from a passage wherein interpretation cannot be made at a literal
level; salients: the ability to isolate the key points of the passage; evaluation : the ability to make a judgment or come to a decision, after assessing
the content of a passage and setting this against knowledge gained from
previous experience.
Sim and Bensoussan (1979:38) offer a complementary list of reading
components that can be assessed: questions on function words, used in
a logical sequence; questions on content words, used denotatively for
meaning; questions on content words, used connotatively for tone and
implication; part-text questions on the ability to recognize a paraphrase
of short stretches of text; and whole-text questions concerning the author’s
purpose and manner of achieving that purpose.
A variety of item types has been employed by test writers to assess
EFL reading comprehension. Heaton (1975) discusses the more commonly
used types which include word matching: the subject identified the word
from a list of options which is the same as the stimulus word; sentence
matching: the examinee is required to recognize sentences which consist
of the same words in the same word order in the same grammatical and
rhetorical type of sentence; pictures and sentence matching: the pupil
chooses a sentence from the list of options which correctly describes the
stimulus picture; true/false items which are complete in themselves and
test general truths and true/false items which are based on a text; multiplechoice items which are based on a few sentences or on a reading passage;
grammar paraphrase items which require the subject to identify the
correct paraphrase of the stimulus from four or five options; completion
items in which certain letters of missing words are given and each dash
in the blank signifies a letter; and the cloze procedure.
There are many factors which pose threats to the reliability of any
test, and Henning (in press) discusses the more commonly cited phenomena
which are known to affect the reliability of a test: fluctuations in the learner
due to temporary psychological or physiological changes; fluctuations in
test administration including regulatory fluctuations and changes in the
administrative environment; test characteristics including length, difficulty,
and boundary effects, discriminability, speededness, and homogeneity;
and examinee factors which include response arbitrariness, test wiseness
and familiarity.
30
Downloaded from rel.sagepub.com at PENNSYLVANIA STATE UNIV on September 18, 2016
Factors that can specifically affect a test’s validity include invalid
application of tests; inappropriate selection of content; imperfect cooperation of the examinee; inappropriate referent or norming population; poor
criterion selection; sample truncation; and use of invalid constructs
(Henning, in press).
The Present Study
Purpose. The
purpose of the present study was to determine the extent
which different item types commonly used in EFL reading comprehension tests and quizzes generate item analysis information and the extent
to which the differences in empirical considerations of item analysis
can be attributed to the readability of the texts on which the reading
comprehension questions are based.
to
Subjects. The data for this research were collected at the English Language
Institute, The American University in Cairo, where the author recently
spent a sabbatical leave. The subjects were 19 Egyptian adults who were
enrolled in an intermediate-level English as a Foreign Language class at
AUC. They were enrolled full-time and were university-bound upon their
passing a standardized English language proficiency test.
Materials and Procedures. Four different item types to test EFL reading
comprehension were employed in this research: true/false items based on
a reading passage; multiple-choice comprehension questions based on a
reading passage; a missing letters format in which certain letters of
missing words are given and each dash in the blank signifies a letter; and
a grammar paraphrase test which required the subjects to identify the
correct paraphrase of the stimulus from four options.
The true/false test consisted of 12 items which were based on a
232 word reading selection. The items tested literal comprehension and
inference (single and multiple strings). After the students had read the
passage, it was collected; then the true/false questions were distributed.
Total working time was 30 minutes. The following is an example question:
True
False
Looking at someone else’s eyes or looking away from them
means a person is thinking very deeply.
The multiple-choice comprehension test consisted of 8 items based
240 word reading selection. The items tested literal comprehension,
inference (single and multiple strings), key points and evaluation. The
subjects answered the questions after the reading passage had been read
and collected. Total working time was 30 minutes. The following is an
on a
example multiple-choice question:
We
can
endure the hero’s
suffering because
we
know
31
Downloaded from rel.sagepub.com at PENNSYLVANIA STATE UNIV on September 18, 2016
A.
things
B.
the
C.
good will
D.
the hero is very brave.
cannot
crew
will
get
worse.
mutiny.
win in the end.
The missing-letters format test consisted of 24 items, and the reading
selection was 245 words long. The first three and the last three sentences
were left intact. The subjects were instructed to read the entire passage
first before they began to fill in the blanks. On average the blanks
occurred every 10.2 words. Thirty minutes’ time was allotted for the
entire exercise. The following sentence comes from the missing letters test:
&dquo;Last week a team of sc-----st- from the U.S. and Egypt made an announcement : they had definitive e------- that long ago a region of the vast
desert in southern Egypt and Northern Sudan was a lacy n--w--- of major
waterways.&dquo;
The grammar paraphrase test consisted of 50 items which tested
various grammatical structures including epistemic modals, bound
morphology, passive voice, multiple embedded relative clauses, extraposition, layered possessives, gerunds, present participles, presupposition,
and entailment. Total working time was 30 minutes. The following item
appeared as number 21 on the grammar paraphrase test:
&dquo;I didn’t know that Mac hadn’t been killed after all.
A.
Mac
B.
Mac wasn’t killed, and I knew it.
C.
I knew that Mac
D.
I didn’t know that Mac
was
killed, but
I didn’t know it.
was
dead.
was
alive.&dquo;
Analyses and Results. For the four tests the following test statistics were
calculated: mean, SD, SEM, KR-20 observed reliability, 100-item reliability, item difficulty as proportion correct, item discriminability for each
item compted as a point biserial correlation coefficient between item
responses and total scores for each test, item variance, internal construct
validity proportion, maximum validity, and SMOG grade (but for the
grammar paraphrase test).
32
Downloaded from rel.sagepub.com at PENNSYLVANIA STATE UNIV on September 18, 2016
TABLE I
Test Statistics, Number of Examinees and Items
and Estimates of Reliability
Table 1 presents the number of examinees and items, test statistics,
and estimates of reliability. Since it is known that adding more items of
similar kind and difficulty usually improves test reliability up to a point
of asympote, the observed estimates of reliability and estimates of the
reliability of each test if extended to 100 items are reported.
TABLE 2
Item
Difficulty Indices as Proportion Correct
.33 - .67 Acceptable Range
33
Downloaded from rel.sagepub.com at PENNSYLVANIA STATE UNIV on September 18, 2016
Table 2
(cont.)
Table 2 presents the item difficulty indices as proportion correct
for each test. Tests can exhibit low reliability when they are too easy or
too difficult for a particular sample of examinees. In general, it is recommended that items with a proportion of correct answers less than .33 and
greater than .67 be rejected; therefore each item with an item difficulty
index outside this range carries a mark in the rejection column.
TABLE 3
Item
Discriminability; Point Biserial Correlations
rpbi .25 and Above Acceptable
34
Downloaded from rel.sagepub.com at PENNSYLVANIA STATE UNIV on September 18, 2016
35
Downloaded from rel.sagepub.com at PENNSYLVANIA STATE UNIV on September 18, 2016
Table 3 displays the item discriminability indices for each item
calculated as point biserial correlation coefficients between item responses
and total scores. Since point biserial correlation coefficients of .25 and
above are considered acceptable for these purposes, - .25 is the criterion
for rejection.
TABLE 4
Item Variance
.25 pq Maximum
36
Downloaded from rel.sagepub.com at PENNSYLVANIA STATE UNIV on September 18, 2016
37
Downloaded from rel.sagepub.com at PENNSYLVANIA STATE UNIV on September 18, 2016
Table 4 lists the item variances for each item in the four tests. The
not aware of any criterion for rejecting an item on the basis of
amount
of its information function save for those items which all
the
examinees get correct or miss, therefore generating 0 variance. The
maximum variance which any item can generate is .25 when exactly half
the subjects pass and the other half fail an item.
author is
TABLE 5
Validity Estimate Coefficients and SMOG
Grade Readability Estimates
Table 5 displays the validity estimate coefficients and SMOG grade
readability estimates for the three reading passages. Following the
procedure discussed in Henning (in press) the author calculated the internal construct validity proportion for the true/false, multiple-choice
comprehension, and missing letters test. This procedure assumes that, for
example, if the true/false items have internal construct validity, the point
biserial correlation between each true/false item and the total scores for
true/false should be higher than the point biserial correlations of the
same items with the total scores for the grammar paraphrase test. This
relationship can be expressed as follows:
The generalization which these symbols are intended to convey is
that the correlation coefficients of individual items with their own tests
should be greater than the correlation coefficients of the same items with
other test totals. An important step in this procedure is to correct each
item-total coefficient for part-whole overlap because items produce
artificially high correlation with their own total. Again, following
Henning’s procedure, the author selected the proportion of items for each
test that exhibited a higher item-total correlation (corrected for part-whole
38
Downloaded from rel.sagepub.com at PENNSYLVANIA STATE UNIV on September 18, 2016
overlap) than item-grammar paraphrase total correlation. The single
coefficient is intended to reflect the internal construct validity for each
test. The maximum validity coefficient possible in the most optimum
situation is equal to the square root of the reliability estimate of the test.
To estimate the readability level of the three reading passages, the
author used the SMOG readability formula. McLaughlin (1969) gives a
complete explication for the SMOG grading formula but, basically, the
procedure entails four steps: 1) count ten consecutive sentences each at
the beginning, middle, and end of the reading passage. A sentence is
considered as any string of words ending with a terminal punctuation mark,
i.e., period, question mark, or exclamation mark. 2) Count every word
containing three or more syllables in the 30 sentences. 3) Estimate the
square root of the number of words containing three or more syllables by
taking the square root of the nearest perfect square. 4) Add 3 to the square
root which indicates the grade level that a person must have reached if
s/he is to
comprehend fully the reading
passage in
question.
TABLE 6
Spearman Rank Correlations
Test Statistics and Readability Levels
To answer the research questions, the author averaged the item
difficulty, item discriminability, and item variance indices for the test and
calculated the proportion of rejection for each. Each test statistic was rank
ordered and the readability estimates were rank ordered for the three
reading passages. The Spearman rank-order correlations are presented in
Table 6; two sets of ranks
are
illustrated below:
39
Downloaded from rel.sagepub.com at PENNSYLVANIA STATE UNIV on September 18, 2016
Discussion
As Table 1 indicates, not one of the reading tests employed in this
research exhibited an acceptable, observed estimate of reliability. One
a teacher-made, nonstandardized test to produce a reliability
estimate between 0.60 and 0.80. Part of the explanation may lie with the
fact that two of the tests have few times, i.e., true/false, 12; multiplechoice, 8. Research has shown that it takes at least 12 to 15 good items
to produce a half-way decent reliability estimate (cf Downie 1967). Even
after the Spearman-Brown Prophesy Formula was used to estimate the
reliability of these tests if they were extended to 100 items, only two
exhibit acceptable reliability estimates, i.e., true/false and multiple-choice.
expects
Table 2 shows that the majority of items from the four tests should
be discarded. The mean item difficulty for the multiple-choice and
missing letters items is below .33 and for the grammar paraphrase test,
above .67; the former were too difficult for this subject pool and the
latter was too easy. The mean item difficulty for the true-false items
falls within the acceptable range. If the author knew how to correct the
true/false binary responses for guessing, the verdict for the true-false
items might have been different.
The item discriminability and item variance data in Tables 3 and 4
simply reflect what has been previously stated about the tests: the multiplechoice and true/false items produce more useful information about the
subject pool than do the missing letters and grammar paraphrase items.
Of the three tests which were based on reading passages, the true/
false and multiple-choice tests exhibit the highest validity co-efficients
calculated by the internal construct validity proportion and maximum
validity methods (Table 5). Though the two validity co-efficients were
calculated by different methods, it is interesting to note that the results
in terms of rank ordering the item types by highest validity are quite similar.
40
Downloaded from rel.sagepub.com at PENNSYLVANIA STATE UNIV on September 18, 2016
There is evidence in Table 6 to suggest that the quality of test statistic
data covaries with the readability level of the reading passage on which
the test items are based. For example, the test items based on the easiest
passage had the highest construct validity proportion, while the test items
based on the most difficult passage had the lowest construct validity
proportion. These findings may be explained by making reference to
Hirsch’s (1977:85) definition of relative readability: &dquo;assuming that two
texts convey the same meaning, the more readable text will take less time
and effort to understand.&dquo; In this particular study the results suggest that
the more readable passages entailed less peripheral processing time on
the part of the readers so they could spend more time attending to the
comprehension process and the components of the passage. As a result,
the test items based on the more readable passage generated truer scores
than the other items, thereby giving a more reliable assessment of the true
differences in reading ability between subjects.
As appealing as this explanation may be, the author cannot
categorically state that the more readable the passage, the better the test
statistics, because, in this study, it is impossible to determine whether the
item type or the readability level of the passage is responsible for the
covariance of the test statistics. Put another way, the true/false items were
based on a passage with a SMOG grade of 9; the multiple-choice items
were based on a passage with a SMOG grade of 10; and the missing
letters test based on a passage with a SMOG grade of 11. The readability
levels varied and so did the item types.
The author believes that the readability level of a passage does
affect the reliability and validity of the test items based on the passage.
To confirm or disconfirm that hypothesis, a more tightly controlled study
would have to be conducted. Such a study would entail the use of
different reading passages which had exactly the same readability level,
the same number of propositions per passage, words of the same frequency, sentences of similar syntactic and semantic complexity, the same
discourse characteristics, the same story structure (cf Rumelhart 1975; Stein
and Glenn 1976; Thorndyke 1977), and the same thematic information
(cf Bransford and Johnson 1972). In addition one would have to control
for response set, test-retest contamination, practice effect, maturation,
and instrument decay. If there were significant differences between the
test statistics according to item type, then one could conclude that the
nature of the item type affects reliability and validity.
If one were to adhere strictly to the criteria of item rejection based
item analysis of difficulty, discriminability and variance, a large
proportion of the test items employed in this research would be discarded. However, there are good reasons why a test constructor might wish
to retain some of them. Henning (in press:67-68) mentions the following
on
41
Downloaded from rel.sagepub.com at PENNSYLVANIA STATE UNIV on September 18, 2016
constraints which may need to be imposed on the decision to reject items
as too easy or too difficult: 1) the need to include specific content.
Rejection of all items that are at the extremes of the difficulty continuum
may result in a test that is insensitive to the objectives of instruction [cf
Popham 1978]; 2) the need to provide an easy introduction to overcome
psychological inertia on the part of the subject; 3) the need to shape the
test information curve by systematically sampling items at a specific
difficulty level to cause the test to be more sensitive or discriminating at
a given cut off score or scores.
A final word must be added about reliability, validity, and the
purposes for which any test is used: any given test may be reliable and
valid for some samples and for some purposes, but not for others. The
results of this study seem to indicate that item type and readability level
affected both the reliability and validity of EFL reading comprehension
tests with an Egyptian EFL subject pool.
42
Downloaded from rel.sagepub.com at PENNSYLVANIA STATE UNIV on September 18, 2016
REFERENCES
Barrett, T. Taxonomy of reading comprehension. In Smith, R., and
Barrett, T. (Eds.). Testing reading in the middle grades. Reading,
MA: Addision-Wesley, 1976.
Bransford, J., and Johnson, M. Considerations of some problems of
comprehension. In Chase, W. (Ed.). Visual information processing.
New York: Academic Press, 1972.
Downie, N. Fundamentals of measurement: techniques and practices
(2nd ed.). New York: Oxford University Press, 1967.
Harrison, C., and Dolan, T. Reading comprehension — a psychological
viewpoint. In Mackay, R., Barkman, B., and Jordan, R. (Eds.).
Reading in a second language: hypotheses, organization and practice. Rowley, MA: Newbury House, 1979.
Heaton, J. Writing English language tests. London: Longman, 1975.
Henning, G. Language test development. Rowley, MA: Newbury House,
in press.
Hirsch, E. D., Jr. The philosophy of composition. Chicago: The University of Chicago Press, 1977.
McLaughlin, G. SMOG grading — a new readability formula. Journal of
Reading, 1969, 12, 639-646.
Popham, W. J. Criterion-referenced measurement. Englewood Cliffs,
NJ: Prentice-Hall, 1978.
Rumelhart, D. Notes on schema for stories. In Bobrow, D., and Collins,
A. (Eds.). Representation and understanding: studies in cognitive
science. New York: Academic Press, 1975.
Sim, D., and Bensoussan, M. Control of contextualized function and
content words as it affects English as a foreign language reading
comprehension test scored. In Mackay, R., Barkman, B., and Jor-
R. (Eds.). Reading in a second language: hypotheses, organization and practice. Rowley, MA: Newbury House,
Stein, N., and Glenn, C. An analysis of story comprehension in elementary school children. In R. Freedle (Ed.) New directions in discourse
processing. Norwood, NJ: Ablex, 1979.
Thorndyke, P. Cognitive structures in comprehension and memory of
narrative discourse. Cognitive Psychology, 1977, 9, 77-110.
dan,
43
Downloaded from rel.sagepub.com at PENNSYLVANIA STATE UNIV on September 18, 2016