AN ITEM RESPONSE THEORY FRAMEWORK FOR

AN ITEM RESPONSE THEORY FRAMEWORK FOR COMBINED ABILITY
ESTIMATION AND QUESTION/HINT SELECTION
by
PRAPAN SHEWINVANAKITKUL
Submitted in partial fulfillment of the requirements
For the degree of Master of Science
Thesis Adviser: Professor Marc Buchner
Department of Electrical Engineering and Computer Science
CASE WESTERN RESERVE UNIVERSITY
January 2012
CASE WESTERN RESERVE UNIVERSITY
SCHOOL OF GRADUATE STUDIES
We hereby approve the thesis/dissertation of
__________________Prapan Shewinvanakitkul______________
candidate for the _____Master of Science_____________degree *.
(signed)_________Marc Buchner__________________________
(chair of the committee)
_____________Vira Chankong_______________________
_____________Swarup Bhunia_______________________
_____________Daniel Saab__________________________
________________________________________________
(date) ____10/28/2011_________________
*We also certify that written approval has been obtained for any
proprietary material contained therein.
Table of Contents
1 Introduction
1
2 Literature Review
5
2.1 Item Response Theory
5
2.2 Computerized adaptive testing
15
2.3 Hint value Estimation
19
2.4 Adaptive Learning
20
3 Solutions and Results
23
3.1 The Data
23
3.2 Estimation of parameters
25
3.3 Item selection methods
39
3.4 Increasing examinee ability
42
4 Conclusions and Future Work
57
iii
LIST OF TABLES
Page
Table 3.1 Results of estimation of theta by using MLE (all parameters were generated from
uniform distribution)
29
Table 3.2 Results of estimation of theta by using MLE (beta and alpha were generated from
uniform distribution and theta was generated from normal distribution)
29
Table 3.3 Results of estimation of theta by using MLE (theta and beta were generated from
uniform distribution and alpha was generated from exponential distribution)
30
Table 3.4 Results of estimation of theta by using Newton-Raphson
33
Table 3.5 Results of Estimation of Question Parameters by using MLE
36
Table 3.6 Results of Joint Maximum Likelihood Estimation
38
Table 3.7 Results of Estimation of Examinee Ability with Item Selection Methods
42
Table 3.8 Results of given hint from three models
44
Table 3.9 Results of finding hint value from 1000 examinees
47
Table 3.10 Results of finding hint value from 5000 examinees
48
Table 3.11 Results of finding hint value with 2 questions from 5000 examinees (each examinee
gave two responses)
48
Table 3.12 Results of finding hint value with 5 questions from 5000 examinees (each examinee
gave two responses)
49
Table 3.13 Results of finding expected value with Tree sequence
52
Table 3.14 Results of finding average and SD of expected value from 100 items
54
Table 3.15 Results of finding average and SD of estimates theta from 100 items
54
Table 3.16 Results of find average and SD of expected value from 250 items
55
iv
Table 3.17 Results of find average and SD of estimates theta from 250 items
55
Table 3.18 Results of finding average and SD of expected value from 250 items
55
Table 3.19 Results of finding average and SD of estimates theta from 250 items
56
v
LIST OF FIGURES
Page
Figure 2.1 Item Characteristic Curve
6
Figure 3.1 Pseudocode for Generating Item Response by Monte Carlo Simulation
25
Figure 3.2 Pseudocode for Estimation of Examinee Ability by using MLE
28
Figure 3.3 Pseudocode for Estimation of Examinee Ability by using Newton-Raphson
32
Figure 3.4 Pseudocode for Estimation of Parameters (𝛃 𝐚𝐧𝐝 𝛂) by using MLE
35
Figure 3.5 Pseudocode for Joint Maximum Likelihood Estimation
38
Figure 3.6 Pseudocode for Maximum Information
40
Figure 3.7 Pseudocode for Kullback Leibler Information
41
Figure 3.8 Given hint from three models
44
Figure 3.9 Pseudocode for Computation of Hint Value by using MLE
46
Figure 3.10 Example Tree Sequences
51
Figure 3.11 Pseudocode for Computation of Expected Value by using Objective Function
53
vi
An Item Response Theory Framework for Combined Ability Estimation and
Question/Hint Selection
Abstract
by
PRAPAN SHEWINVANAKITKUL
This study examines a new approach to the combined problem of question
parameter/examinee ability estimation and examinee learning in an analytical framework using
computerized adaptive testing (CAT), item response theory (IRT), and adaptive learning. We
investigate how to estimate examinee ability coupled with how to increase examinee learning by
providing suitably chosen hints to arrive at a learning goal. The overall objective is to increase
examinee ability with the minimum number of questions and hints in an adaptive testing
framework. Monte Carlo simulation experiments are conducted in order to validate the model
and test algorithm performance. Results show that estimated examinee abilities are increased to
specified set points by providing suitably chosen number of appropriate hints.
vii
1 Introduction
In this research, the goal is to improve test taking ability and test scores of examinees.
Current examinee abilities are estimated in order to select appropriate information or hints that
should be provided to an examinee in order to optimally increase examinee ability. To estimate
an examinee’s ability, computerized adaptive testing (CAT) has been shown to be powerful
technique in this domain. As a result CAT has been studied extensively by many researchers
over the past twenty years and has been shown to provide an excellent approach for adaptive
testing. Compared to standard fixed length tests, i.e. having a fixed number of questions, CAT
can shorten the number of questions needed to provide a precise estimation of examinee ability.
There are three main components in the CAT framework. First is a calibrated item pool. The
item pool contains the questions used for testing examinee ability. Each question has its own
parameter values that describe its characteristic curve. The second component is the item
selection algorithm. To best arrive at a proper estimation of the ability level for each examinee,
questions should be provided which have a difficulty level close to the examinee ability. The
third component is the termination criterion that is used to specify when the estimation process is
finished. All of these components can be constructed in the context of Item Response Theory
(IRT). IRT is a mathematical model that can be used to analytically describe the relationship
between examinee ability and question (item) parameters. IRT is based upon calculating the
probability of a correct response to a question given an examinee’s ability level. In this research
experiments were conducted to investigate the efficiency of IRT models using maximum
likelihood estimation (MLE) – using both iterative and direct numerical optimization.
In the research the learning process was explicitly included in our investigation.
Examinee ability was increased by providing suitably chosen hints that are matched to the
1
examinee’s ability level under the assumption that the examinee could understand the hints.
Questions in the item bank we divided into several sets - each set of questions contained a hint
that should be helpful to an examinee in answering any of the questions in the set. The problem
of optimizing the choice of questions in order to estimate examinee ability and the choice of hint
to provide to the examinee is similar to the problem of dual control found in the control systems
literature. When to give the hint and which hint to give the question to satisfy the objective goal
can be considered to be the underlying control problem. However in order to make this decision,
a good estimate of the examinee ability needs to be available. This can only be achieved by
asking questions of the examinee. Thus the goal was to increase examinee ability to a set point as
well as to accurately estimate examinee ability. As in controlled processes, the point to which
examinee ability should be increased is fixed – the examinee ability set point. In the “control”
algorithm hints that are used to increase examinee ability should also be dependent on the
desired objective function value. Thus an objective function was used to determine whether to
give a question or give a hint to the examinee. In the learning/estimation, i.e., dual control
approach that is proposed and studied in this research, for each selection the value of both giving
questions and giving hints to minimize the objective function is computed. The objective
function considers two important values: the difference between the examinee ability goal and
the current estimated ability and the standard error obtained from maximum likelihood ability
estimation.
This generalized ability estimation and hint selection framework developed in this
research can be applied in many different learning situations such as in a system that constructs
questions for patients to test how well they know about diabetes. The system could first provide
a question about diabetes; “How low a value of your blood sugar should prompt you to see a
2
doctor?” If a patient selects a correct answer, the system will estimate their ability to understand
the basic concept of diabetes control and provide a following question such as “What could be a
symptom of low blood sugar?” However, if they gave an incorrect answer, the system could
provide the patient with a hint about critical levels of blood sugar levels depending on their
estimated abilities. As each patient has a different ability to learn, the system selects appropriate
information that the patient could understand. Then the system re-estimates their ability related
to the hints that were provided and select the next question to refine the estimate of their ability.
This helps each patient to get precisely needed information with a minimum number of questions
and hints. Another example leaning situation for which the proposed learning framework can be
applied is in an intelligent tutoring system. Each student might have different problems about
their subjects. This framework could provide a flexible approach for a broad group of students
who want to improve their knowledge. For example, in the context of high school geometry the
system could provide a question about computing the area of a right triangle: “Find the area of
the triangle shown in Figure 1” If a student needed a hint, the system could provide the formula
for calculating triangle area of a right triangle. After that, if a student gave a correct answer, the
system would give a more difficult question about computing the area of an isosceles triangle.
However, if students give an incorrect answer, the system could provide an example of how to
calculate the area of a simple right triangle. Given appropriate information at the appropriate
time in a test could help students “process” many questions in a test and simultaneously increase
their understanding about subjects.
In this thesis, chapter 2 provides an overview of several important frameworks that are
used heavily in this work. How item response theory provides the relationship between examinee
ability and the probability of correct response is described. In addition the general idea of
3
computerized adaptive testing and adaptive learning that are used them to solve our problem is
also examined here. In chapter 3, the experimental framework for testing the proposed estimation
and learning techniques is described in detail and pseudo-codes of the algorithms are discussed.
Experimental results that and immediate observations about the results are also provided here.
Finally in chapter 4, the major conclusions stemming from the research and future directions for
the work are presented.
4
2 Literature Review
2.1 Item Response Theory
The field of Item Response Theory (IRT) has a long history of development and
application in intelligence testing. In testing situations, examinee performance on a test can be
predicted by estimating examinee characteristics, referred to as traits or abilities. In this
framework, an examinee has an ability score which can be used to predict or explain test
performance. An IRT model can be used to describe the relationship between examinee ability
and their performance on a given question on a test. This relationship is described by a
mathematical function called the item characteristic curve (ICC) [1]. This curve defines the
probability that an examinee with a given ability will provide a correct answer to an item. The
proportion of correct response can therefore be plotted as a function of ability. For the standard
models used in IRT, the result is a smooth S-shaped curve as shown in Figure 2.1.
5
Figure 2.1 Item Characteristic Curve
In figure 2.1, the x-axis represents examinee ability and the y-axis represents the
probability of a correct response to a single test item. The S-shaped curve shows the probability
of a correct response for an examinee having an ability level of 𝜃. Each item in a test has its own
item characteristic curve. There are two technical parameters of an item characteristic curve that
are used to describe it. The first is the difficulty (𝛽) of the item. In item response theory, the
difficulty of an item describes where an item generally lies along the ability scale – represents
the ability for which the examinee has a 50/50 chance of getting the question correct (or
incorrect). For example an easy item would have a low difficulty parameter value corresponding
to low-ability examinees being able to answer the question with a reasonable probability and a
hard item would have high difficulty parameter value corresponding to high-ability examinees
being able to answer the question with a reasonable probability; difficulty is therefore a location
index. The second technical parameter is discrimination (𝛼), which describes how well, or how
“sharply”, the item can differentiate between examinees having abilities below the item location
6
and those having abilities above the item location. This property essentially reflects the steepness
of the item characteristic curve around the value of the difficulty parameter. The steeper the
curve, the better the item can discriminate among examinee abilities near the difficulty value.
The flatter the curve, the less the item is able to discriminate since the probability of correct
response at low ability levels is nearly the same as it is at high ability levels. These two
properties describe the form of the two item characteristic model. In a three item characteristic
model, the third technical parameter is the guessing value (c) that describes the probability of a
correct response for examinees with very little ability. Although examinees might have low
ability, they still have a chance to response with correct answer.
When we want to measure examinee ability, it is necessary to provide a scale for this
measurement. This can be defined as an arbitrary manner. For this work we assume that, as is
standard in IRT, whatever the value of the ability, it is constrained to lie on a scale having a
midpoint of zero, a unit of measurement of one, and a range from negative infinity to positive
infinity. The underlying idea here is that if one could physically ascertain the ability of a person,
this scale could be used to tell how much ability a given person has, and the ability of several
different examinees could be compared. The usual approach taken to measure the ability is to
develop a test consisting of a number of items (questions). Each of these items measures some
facet of the particular ability of interest. The person scoring the test must then decide whether the
response is correct or not. Under item response theory, the primary interest is in whether the
examinees get each individual item correct or not, rather than in the raw test score. This is
because the basic concepts of item response theory rest upon the individual items of a test rather
than upon some aggregate of the item response such as a test score. Items scored dichotomously
are referred to as binary items - the correct answer receives a score of one, and each of the
7
incorrect answer yields a score of zero. Thus, one can consider each examinee to have a
numerical value, i.e., a score, that places him somewhere on the ability scale. This ability score
will be denoted by theta (𝜃). At each ability level, there will be a certain probability that an
examinee with that ability will give a correct answer to the item. This probability will be denoted
by 𝑃(𝜃). In a typical test item, this probability will be small for examinees of low ability and
large for examinees of high ability. The probability of correct response 𝑃(𝜃) for a twoparameter IRT model is calculated from the two-parameter logistic model [1] given below.
𝑃(𝜃) =
1
(2.1)
1+ 𝑒 −𝛼( 𝜃−𝛽)
where 𝛽 is the difficulty parameter.
𝛼 is the discrimination parameter.
𝜃 is an ability level.
In a three-parameter logistic model, the guessing parameter (c) is included so that examinees
could provide correct responses by guessing. Thus, the probability of correct response includes a
small component that is due to guessing. The equation for the three-parameter model [1] is given
below.
P(𝜃) = 𝑐 + (1 − 𝑐)
1
(2.2)
1+ 𝑒 −𝛼( 𝜃− 𝛽)
The parameter 𝑐 is the probability of getting the item correct by guessing alone.
2.1.1 Estimation of ability
The estimation of an examinee’s ability when item parameters are known is
accomplished in a straightforward manner using maximum likelihood estimation [2]. Each
examinee provides responses to questions with each response being either correct or incorrect.
Therefore, the probability that an examinee with ability level obtains a giving response on that
8
question can be computed. The probability that an examinee with ability 𝜃 obtains a response 𝑈𝑖
on item 𝑖, where 𝑈𝑖 = 1 for a correct response and 𝑈𝑖 = 0 for a incorrect response is denoted by
𝑃(𝑈𝑖 |𝜃). For a correct response, the probability 𝑃(𝑈𝑖 = 1|𝜃) is denoted by 𝑃𝑖 (𝜃). As 𝑈𝑖 is a
binomial variable, the probability of a response, 𝑈𝑖 , can be expressed as
𝑈
𝑃(𝑈𝑖 |𝜃) = 𝑃𝑖 𝑖 (1 − 𝑃𝑖 )1−𝑈𝑖
𝑈
1−𝑈𝑖
= 𝑃𝑖 𝑖 𝑄𝑖
(2.3)
where 𝑄𝑖 = 1 − 𝑃𝑖 . If an examinee with ability 𝜃 responds to n items, the joint probability of the
responses 𝑈1 , 𝑈2 , … , 𝑈𝑛 can be denoted by 𝑃(𝑈1 , 𝑈2 , … , 𝑈𝑛 |𝜃). Under a local independence
assumption, 𝑈1 , 𝑈2 , … , 𝑈𝑛 are statistically independent. This implies that
𝑃(𝑈1 , 𝑈2 , … , 𝑈𝑛 |𝜃) = 𝑃(𝑈1 |𝜃)𝑃(𝑈2 |𝜃) … 𝑃(𝑈𝑛 |𝜃)
The probability of the vector of item responses for a given examinee ability is given by the
likelihood function
𝑢 1−𝑢
𝑃𝑟𝑜𝑏(𝑈𝑖 |𝜃) = ∏𝑛𝑖=1 𝑃𝑖 𝑖 𝑄𝑖 𝑖
(2.4)
Taking the natural logarithm of the likelihood function yields [1]
𝐿 = log 𝑃𝑟𝑜𝑏(𝑈𝑖 |𝜃) = ∑𝑛𝑖=1[𝑈𝑖 log 𝑃𝑖 + (1 − 𝑈𝑖 ) log 𝑄𝑖 ]
(2.5)
The value of 𝜃 that maximizes log 𝑃𝑟𝑜𝑏(𝑈𝑖 |𝜃) is then
𝜃� = arg 𝜃𝑚𝑎𝑥 {log 𝑃𝑟𝑜𝑏(𝑈𝑖 |𝜃)}
(2.6)
The large sample variance of 𝜃�𝑗 [1] will be given by
(2.7)
𝑆𝜃2𝑗 =
1
2
∑𝑛
𝑖=1 𝛼i 𝑃𝑖 𝑄𝑖
and the standard error [1] is
𝑆𝐸𝜃�𝑗 = �𝑆𝜃2𝑗
(2.8)
9
2.1.2 Estimation of item parameters
In parameter estimation procedures [1], examinees are grouped according to their ability.
Suppose that 𝑘 groups of 𝑓𝑗 subjects possessing known ability score 𝜃𝑗 are drawn at random from
a population of persons and 𝑗 = 1, … , 𝑘. Each subject has responded to a single dichotomously
scored item. Out of the 𝑓𝑗 subjects having ability 𝜃𝑗 , 𝑟𝑗 give the correct response and 𝑓𝑗 − 𝑟𝑗 give
an incorrect response. Let 𝑅 = (𝑟1 , … , 𝑟𝑘 ) be the vector of the observed number of correct
responses. The observed proportion of correct response at ability 𝜃𝑗 is
𝑝(𝜃𝑗 ) = 𝑝𝑗 =
𝑟𝑗
𝑓𝑗
(2.9)
And the observed proportion of incorrect response is
𝑞(𝜃𝑗 ) = 𝑞𝑗 =
𝑓𝑗 − 𝑟𝑗
(2.10)
𝑓𝑗
The probability of 𝑅 is given by the likelihood function
𝑃𝑟𝑜𝑏(𝑅) = ∏𝑘𝑗=1
𝑓𝑗 !
𝑟𝑗 !�𝑓𝑗 −𝑟𝑗 �!
𝑟
𝑓𝑗−𝑟𝑗
𝑃𝑗 𝑗 𝑄𝑗
(2.11)
And the natural logarithm of the likelihood is [1]
𝐿 = 𝑙𝑜𝑔 𝑃𝑟𝑜𝑏(𝑅) = ∑𝑘𝑗=1 𝑟𝑗 log 𝑃𝑗 + ∑𝑘𝑗=1(𝑓𝑗 − 𝑟𝑗 ) log 𝑄𝑗
The values of the item parameters that maximize log 𝑃𝑟𝑜𝑏(𝑅) are then given by
𝑚𝑎𝑥
(𝛼�, 𝛽̂ ) = arg
{log 𝑃𝑟𝑜𝑏(𝑅)}
(𝛼,𝛽)
(2.12)
(2.13)
2.1.3 Estimation of item and ability parameters
In the parameter estimation process, the number of parameters that need to be estimated
depend on the item response model. For the two-parameter model, when 𝑁 examinees take a test
that has 𝑛 items, the number of items parameters is 2𝑛 and the number of ability parameters is 𝑁.
Thus the total number of parameters to be estimated for a two-parameter model is 𝑁 + 2𝑛. As
10
both of the item parameters and the ability are unknown, there is a certain degree of
indeterminacy in the model [2]. As the number of examinees increases, the number of estimated
parameters increase, and this presents a potential estimation problem. Suppose that there are 𝑛
normal populations have differing means, 𝜇1 , 𝜇2 , … , 𝜇𝑛 , but the same variance, 𝜎 2 , and that 𝑛𝑖𝑗 is
the 𝑗th observation in the 𝑖th population. Then
𝑥𝑖𝑗 ~ 𝑁(𝜇𝑖 , 𝜎 2 )
𝑖 = 1, … , 𝑛; 𝑗 = 1, … , 𝑘.
(2.14)
Here 𝑁(𝜇, 𝜎 2 ) indicates a normally distributed variable with mean 𝜇 and variance 𝜎 2
Since the density function of 𝑥𝑖𝑗 , given by 𝑓�𝑥𝑖𝑗 |𝜇𝑖 , 𝜎 2 �, is
2
2 −1/2
𝑓�𝑥𝑖𝑗 |𝜇𝑖 , 𝜎 � = (2𝜋𝜎 )
exp{−
�𝑥𝑖𝑗 − 𝜇𝑖 �
𝜎2
2
},
(2.15)
The likelihood function of the observations [𝑥1 , 𝑥2 , … , 𝑥𝑖 , … , 𝑥𝑛 ], where 𝑥𝑖 = [𝑥𝑖1 , 𝑥𝑖2 , … , 𝑥𝑖𝑘 ]
is given by
𝐿(𝑥1 , 𝑥2 , … , 𝑥𝑛 | 𝜇𝑖 , 𝜎
2)
=
�𝑥 − 𝜇 �
1
∏𝑘𝑗=1(2𝜋𝜎 2 )−𝑛/2 exp[− ∑𝑛𝑖=1 𝑖𝑗 2 𝑖
2
𝜎
2 −𝑛𝑘/2
= (2𝜋𝜎 )
2
]
�𝑥𝑖𝑗 − 𝜇𝑖 �
exp[− ∑𝑘𝑗=1 ∑𝑛𝑖=1
2
𝜎2
1
2
]
(2.16)
Taking logarithms, differentiating, and solving the resulting likelihood equations, we obtain the
following estimators for 𝜇𝑖 and 𝜎 2 :
Clearly,
𝜇̂ 𝑖 = ∑𝑘𝑗=1 𝑥𝑖𝑗 /𝑘
(2.17)
𝜎� 2 = ∑𝑘𝑗=1 ∑𝑛𝑖=1(𝑥𝑖𝑗 − 𝜇̂ 𝑖 )2 /𝑛𝑘
(2.18)
𝐸(𝜇̂ 𝑖 ) = 𝜇𝑖 ,
(2.19)
11
but
𝐸(𝜎� 2 ) =
𝜎 2 (𝑘𝑛−𝑛)
𝑘𝑛
= 𝜎 2 (1 − 1/𝑘)
(2.20)
This result shows that while 𝜇̂ 𝑖 is an unbiased estimator of 𝜇𝑖 , 𝜎� 2 is not an unbiased estimator of
𝜎 2 . Moreover, 𝜎� 2 is not a consistent estimator of 𝜎 2 since the bias does not vanish
as 𝑛 → ∞ with 𝑘 fixed. The number of unknown parameters increases as 𝑛 increases. In this
situation, the number of parameters 𝜇𝑖 , that increases with 𝑛 are called incidental parameters,
while the parameter 𝜎 2 is called the structural parameter. This problem has implications for the
simultaneous estimation of item and ability parameters in the item response models. With known
item parameters the maximum likelihood estimator of 𝜃 converges to the true value as the
numbers of items increases. Similarly when the true ability value is known, maximum likelihood
estimators of the item parameters converge to their true values as the number of examinees
increases. However, when simultaneous estimation of item and ability parameters is attempted,
the item parameters are the structural parameters and the ability parameters are the incidental
parameters since their number increases with increasing numbers of examinees. When estimating
the means and variance in 𝑛 normal populations, the estimators of item parameters will not
converge to their true values as the number of ability or incidental parameters increases.
2.1.4 Joint Maximum Likelihood Estimation
Because the actual values of the parameters of the items in a test are unknown, one of the
important tasks performed when a test is examined within item response theory is to estimate
these parameter values. The parameters of a single item are estimated under the assumption that
the examinee’s ability scores are known. In reality however, these scores are not known. On the
12
other hand, for estimating examinee ability based on the examinee responses to the all items of
the test it is assumed that the values of the parameters of these items are known a priori. It is at
this point where the Birnbaum [1] paradigm comes into play. Birnbaum proposed using a back
and forth two-stage procedure to solve for the estimates of the parameters. In the first stage, the
item parameters are estimated assuming known examinee abilities. In the second stage, the
examinee’s abilities are estimated assuming that the item parameter values are known. To initiate
the process, a rough estimate of each examinee’s ability is obtained. The standardized raw test
score is commonly used as the initial known value of the examinee’s ability. If each item is
considered separately, then the single item, known ability situation can be used to estimate item
parameters using a maximum likelihood estimation procedure.
Although the estimation procedures could be derived on the basis of individual
examinees, the logic of the estimation process is simpler if examinees are grouped according to
ability. Suppose k groups of 𝑓𝑗 subjects possessing known ability scores 𝜃𝑗 are drawn at random
from a population and 𝑗 = 1, … , 𝑘. Each subject responds to a single dichotomously scored item.
Out of the 𝑓𝑗 subjects having ability 𝜃𝑗 , 𝑟𝑗 give the correct response and 𝑓𝑗 - 𝑟𝑗 give an incorrect
response.
Let 𝑅 = (𝑟1 , … , 𝑟𝑘 ) be the vector of the observed number of correct responses. The observed
proportion of correct response at ability 𝜃𝑗 is then
𝑝(𝜃𝑗 ) = 𝑝𝑗 =
𝑟𝑗
𝑓𝑗
(2.21)
and the observed proportion of incorrect response is
𝑞(𝜃𝑗 ) = 𝑞𝑗 =
𝑓𝑗 − 𝑟𝑗
(2.22)
𝑓𝑗
13
It will be assumed that the observed 𝑟𝑗 at each ability 𝜃𝑗 are binomially distributed with
parameters 𝑓𝑗 , 𝑃𝑗 , where 𝑃𝑗 is the true probability of correct response. For the two-parameter
logistic ICC model, the cumulative logistic distribution function is given by
𝑃𝑗 = 𝑃(𝛽, 𝛼, 𝜃𝑗 ) =
The likelihood function is
𝑃𝑟𝑜𝑏(𝑅) = ∏𝑘𝑗=1
1
(2.23)
−(𝛽+𝛼𝜃𝑗 )
1+ 𝑒
𝑓𝑗 !
𝑟𝑗 !�𝑓𝑗 −𝑟𝑗 �!
𝑟
𝑓𝑗−𝑟𝑗
𝑃𝑗 𝑗 𝑄𝑗
(2.24)
and the natural logarithm of the likelihood function is
𝐿 = 𝑙𝑜𝑔 𝑃𝑟𝑜𝑏(𝑅) = ∑𝑘𝑗=1 𝑟𝑗 log 𝑃𝑗 + ∑𝑘𝑗=1(𝑓𝑗 − 𝑟𝑗 ) log 𝑄𝑗
(2.25)
First the item parameters are computed individually for each item. In the second stage, these
parameter estimates can be treated as being the true item parameters, and the procedure for
estimating examinee ability is done. A given examinee responds to the n items of a test and the
responses are dichotomously scored, 𝑢𝑖𝑗 = 0, 1, where 𝑖 designates the item 𝑖 = 1, … , 𝑛 and j
designates the examinee 𝑗 = 1, … , 𝑁, yielding a vector of item responses of length 𝑛 denoted by
𝑈𝑗 = (𝑢1𝑗 , 𝑢2𝑗 , 𝑢3𝑗 , … , 𝑢𝑛𝑗 |𝜃𝑗 ). Under a local independence assumption, the 𝑢𝑖𝑗 are statistically
independent. Thus, the probability of the vector of item responses for a given examinee is given
by the likelihood function
𝑢𝑖𝑗
𝑃𝑟𝑜𝑏�𝑈𝑗 �𝜃𝑗 � = ∏𝑛𝑖=1 𝑃𝑖
1−𝑢𝑖𝑗
𝑄𝑖𝑗
(2.26)
Taking the natural logarithm of the likelihood function yields
𝐿 = log 𝑃𝑟𝑜𝑏(𝑈|𝜃) = ∑𝑛𝑖=1[𝑢𝑖𝑗 log 𝑃𝑖𝑗 + �1 − 𝑢𝑖𝑗 � log 𝑄𝑖𝑗 ]
(2.27)
After this estimation procedure has been performed for each examinee, a vector of maximum
likelihood estimates 𝜃�𝑗 of length 𝑁 is obtained. At this point a single cycle of the Birnbaum
paradigm has been completed, and the initial crude estimate of each examinee’s ability has been
14
replaced by the second stage set of maximum likelihood estimates 𝜃�𝑗 . An overall convergence
criterion is needed to determine when a sufficient number of cycles have been performed and a
final set of item and ability parameter estimates have been obtained.
2.2 Computerized adaptive testing
Computerized adaptive testing (CAT) is a powerful application of adaptive testing
methodology that selects questions to maximize the precision of an examinee’s ability level.
From the full collection of items, CAT selects the items that match to the estimated ability level
of examinee. If the examinee gives the correct response on the item, a more difficult item is
provided next. But if the examinee gives an incorrect response on the item, a less difficult item is
provided. This will be the advantage of CAT which can shorten the duration of a test but still
retain a good precision of the estimation of ability level.
For the item bank that contains the dichotomous items, e.g., multiple choice questions,
CAT selects the first item from the item bank and provides this item to the examinee. The first
item is usually chosen to be somewhat on the less difficult side [3]. This gives the examinee an
initial feeling of an accomplishment, but still feeling like they might be challenged. After the
examinee provides a response, the estimation process is carried out resulting in an estimated
ability level. Then the item of nearly the same level of difficulty is provided for the next item.
This means if the examinee gives the correct response, the estimate of the examinee ability is
increased. If the examinee gives the incorrect response, the estimated ability is decreased. The
process is continued until the stopping criterion is met. This criterion is very important in CAT.
If the test is too short, the estimated ability may be inaccurate. If the test is too long, time and
resources are wasted. The examinee may tire and drop his performance, leading to error test
15
results. The stop criterion is usually either when the maximum test length is reached or the
ability measure is estimated with sufficient precision.
2.2.1 Information Function
While variances and the standard errors that are described by equation (2.8) are of
interest when a given examinee’s ability is estimated, it is also of interest to examine them over
the whole ability scale. In IRT, this is accomplished through the item and test information
functions that reflect how well the individual items and the test as a whole estimate ability over
the ability scale. One can conceptualize a conditional sampling distribution of ability estimates 𝜃�
about a common underlying ability level 𝜃. The maximum likelihood estimator, such as 𝜃�, has a
normal asymptotic distribution with mean 𝜃 and variance 𝜎 2 = 1/𝐼(𝜃), where 𝐼(𝜃) is the
amount of information and�1/ 𝐼(𝜃) is referred to as the standard error [1]. Thus, the larger this
variance is the less precise the estimate of 𝜃. The information function, 𝐼(𝜃), is defined as
𝐼(𝜃) = −𝐸{
𝜕2
𝜕𝜃2
[ln 𝐿(𝑢|𝜃)] },
where 𝐿(𝑢|𝜃) is the likelihood function. 𝐼(𝜃) can be expressed as
𝐼(𝜃) = ∑𝑛𝑖=1{
�𝑃𝑖′ (𝜃)�
2
}
𝑃𝑖 (𝜃)𝑄𝑖 (𝜃)
(2.28)
(2.29)
where 𝑃𝑖′ (𝜃) is the derivative of 𝑃𝑖 (𝜃).
Since the right hand side of the information function is a sum, it can be decomposed into the
contribution of each item of the test to the amount of test information. The amount of
information contributed by an individual item is therefore given by
𝐼𝑖 (𝜃) =
�𝑃𝑖′ (𝜃)�
2
(2.30)
𝑃𝑖 (𝜃)𝑄𝑖 (𝜃)
16
For the test information function in terms of the variance of the conditional distribution of the
maximum likelihood estimates of ability, this provides the interpretation of the meaning of the
amount of information. The greater the amount of information at a given ability level, the closer
the maximum likelihood estimates of ability will be clustered around the true, but unknown,
ability level and the estimate will more precise.
2.2.2 Local and Global Information
If the information around a small region of 𝜃0 is viewed as local information, then the
information outside that region can be viewed as global information [4]. When 𝜃0 is unknown,
local information may serve as a benchmark for item selection when there is sufficient
knowledge about te location of 𝜃0 , and global information might be preferred when there is lack
of such knowledge. In IRT, the local information corresponds to the information function. For
global information, given an examinee’s responses 𝑥1 , 𝑥2 , … , 𝑥𝑛 to the n items in a test, the
quantity that summarizes all the information for the examinee’s 𝜃 is the likelihood function
𝐿(𝜃) = 𝐿(𝜃; 𝑥1 , 𝑥2 , … , 𝑥𝑛 ). To distinguish any fixed 𝜃1 from 𝜃0 , one can examine the difference
between values of 𝐿 at 𝜃1 and 𝜃0 . Such a difference can be captured by the ratio of these two
values, resulting in the likelihood ratio method which is optimal for testing the hypothesis of
𝜃 = 𝜃0 versus the hypothesis 𝜃 = 𝜃1 . In other words, it is the best way to tell 𝜃1 from 𝜃0 when
the IRT model is assumed for 𝑥1 , 𝑥2 , … , 𝑥𝑛 . Kullback-Leibler (KL) information can be used to
measure the discrepancy between the two probability distributions specified by 𝜃0 and 𝜃1 . Let 𝜃0
be the true parameter. For any 𝜃, the KL information of the 𝑖th item (with response 𝑥𝑖 ) is defined
by
𝐾𝑖 (𝜃 ∥ 𝜃0 ) ≡ 𝐸𝜃0 log [
𝐿𝑖 (𝜃0 ; 𝑥𝑖 )
𝐿𝑖 (𝜃; 𝑥𝑖 )
],
(2.31)
17
where 𝐸𝜃0 denotes expectation over 𝑥𝑖 and
𝑥
1−𝑥
𝐿𝑖 (𝜃0 ; 𝑥𝑖 ) = 𝑃𝑖 𝑖 (𝜃)𝑄𝑖 𝑖 (𝜃)
(2.32)
is the likelihood function for the 𝑖th item. KL information can be expressed explicitly as
𝐾𝑖 (𝜃 ∥ 𝜃0 ) = 𝑃𝑖 (𝜃0 )log �
𝑃𝑖 (𝜃0 )
�
𝑃𝑖 (𝜃)
+ [1 − 𝑃𝑖 (𝜃0 )] log �
1− 𝑃𝑖 (𝜃0 )
�
1− 𝑃𝑖 (𝜃)
(2.33)
The purpose of CAT is to accurately estimate an examinee’s 𝜃0 by efficiently selecting items. It
is desirable to find a quantity that distinguishes all 𝜃 ≠ 𝜃0 from 𝜃0 . 𝐾 is thus the weighted
average of the log-likelihood ratio measures. For item 𝑖, as 𝜃 varies over the parameter space, 𝐾
generates a global profile about the discrimination power of the items. For each 𝜃0 , 𝐾 is a
function of 𝜃, and 𝐼 is a fixed number. This is one of the key distinctions between 𝐾 and 𝐼.
Global information should be used when 𝑛 is small, and local information should be used when
𝑛 is large. Thus, to design a good CAT, both global and local information are needed at different
stage of the test.
2.2.3 Termination criteria
The termination criterion is what determines the length of the CAT. Termination criteria
studies in computerized adaptive testing [5] compared a large number of CAT termination
criteria using multiple item banks. Four termination rules were examined to determine which
termination rules led to best CAT estimation of examinee’s ability: fixed-length termination,
standard error termination, minimum information termination, and 𝜃 convergence. From these
results, it was observed that CAT that administered more items yielded better 𝜃 estimates. CAT
that were too short (e.g., fewer than 15 items) did not give good 𝜃 estimates in low ranges of 𝜃.
Termination criteria that terminated very quickly were also not stable. These results indicated
dtat CAT administrators using standard error termination should use a standard error that is equal
18
to or smaller than 0.315 for accurate measurement of 𝜃 in terms of bias. Variable termination
criteria that performed best when taking test length and accuracy into consideration were the
conditions that used a standard error below 0.315 as part of termination rule. Ability level
convergence termination performed slightly worse that the standard error termination conditions.
Also it was found that using minimum information termination administered too many items for
large item banks. This study suggested that the best solution to the CAT termination issue might
be to use one or more variable termination criteria in combination with a minimum number of
items constraint. Based on this research, 15 to 20 items appears to be a reasonable minimum
number of items for variable-length CAT termination, depending on the precision needs of the
test user
2.3 Hint value Estimation
From previous work of measuring student learning with item response theory [6], it was
demonstrated that giving hints could increase an examinee’s ability – generally by about 0.8
standard deviations. Therefore, measured ability changes can be used to find the hint value. In
this work the IRT two-parameter logistic model was used to calculate the probability of a correct
response for item 𝑖 denoted by 𝑃(𝜃):
𝑃(𝜃) =
1
(2.34)
1+ 𝑒 −𝛼𝑖 ( 𝜃−𝛽𝑖 )
The computed the change in ability on each item (𝛿ℎ𝑖𝑛𝑡) was obtained by fitting the end-of-path
response data to the resulting equation:
𝑁𝑠
�
1
−𝛼𝑖 [(𝜃𝑠 + 𝛿ℎ𝑖𝑛𝑡)− 𝛽𝑖 ]
𝑠=1 1+ 𝑒
19
= 𝑁𝑐𝑜𝑟𝑟𝑒𝑐𝑡
(2.35)
where the examinee’s ability (𝜃𝑠 ) and the item parameters of items (𝛼𝑖 , 𝛽𝑖 ) are found from the
IRT estimation process. 𝑁𝑐𝑜𝑟𝑟𝑒𝑐𝑡 is the number of examinees who got the item correct out of 𝑁𝑠
examinees who answered item 𝑖 in the test.
This study [6] presents a mathematical model and a procedure to generalize IRT to
measure examinee’s ability change due to learning that occurs between successive attempts to
answer a single item. In the study it was shown that the ability changed depended on the
examinees’ path through the tutoring such that with help examinees performed 0.84 standard
deviation better on their second attempt. They also found that low ability examinees benefited
more from the tutoring than examinees having high ability.
2.4 Adaptive Learning
The concept of adaptation has recently been an important issue of research in the area of
learning systems. The application of adaptation can provide a better learning environment to the
examinee. With adaptation there is an attempt to provide a more appropriate learning experience
based on a custom, or personalized, model of the goals, preferences and level of knowledge of
the individual examinee. Adaptive systems thus can attempt to offer different experiences for
different examinees by taking into account information accumulated about an individual
examinee during a testing process. This can be based on estimates of the examinee’s ability.
Many adaptive systems focus the adaptation efforts on assessment instead of on content
presentation. For example, SIETTE [7] can offer hints with the question or provide feedback
with the answer, focusing on cognitive diagnosis. Adaptive learning systems have been
developed for learning Chinese keyboarding skills [8] based on a combination of Computerized
Adaptive Testing (CAT) and Item Response Theory (IRT) in order to select items to be
20
presented to the student, according to the student’s estimated individual abilities. In a study of a
Personalized e-learning system using Item Response Theory [9], a personalized e-learning
system was proposed based on IRT. This system can dynamically estimate examinee ability
based on the maximum likelihood estimation (MLE) by collecting examinee feedback after
studying the recommended course materials. Based on the estimation of examinee abilities, the
system can then recommend appropriate course materials to examinees. The proposed
personalized e-learning system based on IRT provides benefits to provide learning paths that can
be adapted to various levels of difficulty of course materials and various abilities of examinees.
Their experimental results in this study show that the proposed system can precisely provide
personalized course material recommendations based on examinee abilities. In an application of
an adaptive, web-based learning environment on oxidation-reduction reactions [10], the authors
present an adaptive learning environment that offers two levels of adaption. At the beginning of
the course, students take a Group Embedded Figures Test (GEFT) in order to be classified
according to their learning pattern as field-independent or field-dependent learners. In this way,
students are divided into three groups (types) having three associated learning theories or webdesign models: (1) a situated model for field-dependent students, (2) a constructivist model for
field-independent and science college students, and (3) scaffolding model for field-independent
and non-science college students. Each learning system adapts dynamically to the student’s
progress by means of scenarios or story telling, with feedback according to an expert’ conceptual
map and scaffolding assisting. Some Intelligent Tutoring System (ITS) in the group of Intelligent
Collaborative Learning [11] use technologies for adaptive group formation and peer help [12].
Here a grouping strategy is followed based on students’ evaluated conceptual graphs, which are
21
calculated using Bayesian analysis. Students with complementary concepts of curriculum are
then grouped together in order to learn from each other.
In our research, which is presented next, we want to increase examinee ability to a predefined set point. In order to accomplish this we also need to estimate examinee ability with
small errors. Therefore, we approach this problem, which is similar in concept to the dual control
problem, by controlling the learning process and identifying, and then using, the error from the
estimation process. The proposed system can accelerate examinee learning efficiency and
effectiveness and simultaneously estimate the examinee’s ability precisely.
22
3 Solutions and Results
3.1 The Data
For the experimental study in this research both the two-parameter IRT model and the
three-parameter IRT model are used. In the two-parameter IRT model, for each question there
are two parameters that are used to determine the characteristic curve. The first parameter is
difficulty - denoted by beta (𝛽) - with each question having a different difficulty value. This
value can be used to determine the location of the characteristic curve along the examinee ability
scale. The second parameter is discrimination - denoted by alpha (𝛼). This parameter is used to
describe how strongly the question can differentiate between different examinees. Generally
speaking one wants to select the question that has a good discrimination value to differentiate
between low ability and high ability. Finally, for each examinee, there is an ability level that is
denoted by (𝜃). The scale measurement for examinee ability and question difficulty is limited to
the range from -5 to +5. The lowest ability (and question difficulty) is set as -5 and the highest
ability (and question difficulty) is set at 5. The discrimination value range is set to be between 0
and 2. The higher the value of discrimination, the more strongly the question can be used to
discriminate between low and high ability. The three-parameter IRT model has the same
parameters as two-parameter IRT model, but includes the third guessing parameter. Guessing can
be described as the probability of a correct response for examinees with very little ability as an
examinee who has very low ability still has a chance to give a correct answer. In our experiment,
the guessing parameter value was fixed as 0.25 to model a multiple-choice question of 4 possible
responses.
23
Apart from the guessing parameter all of the parameters values were considered to be
unknown initially. Only the individual responses from the questions were available for each
examinee from the test. Each response was scored dichotomously - the response was either
scored as a correct answer that will receive a score of one or scored as an incorrect answer that
will receive a score of zero. Therefore, each examinee had a score for the test that will place him
somewhere on the ability scale. At each ability level, there is a certain probability that an
examinee with that ability level will give a correct response to that question. An IRT model can
then be used to estimate all of parameters and estimate the probability of a correct response for
each question given the examinee’s ability level.
In the estimation process, we used Monte Carlo simulation to conduct numerical
statistical experiments. As IRT provides an estimation of the probability of a correct response to
each question for a given value of examinee ability, a response is generated for each question by
using the question parameters and the examinee ability to calculate the probability of correct
response. The probability of a correct response value is then compared with a random number
sample uniformly distributed over the range of 0 and 1 [1]. If the numerical value of the sample
is less than probability of a correct response, then generated response was scored as being
correct. If the numerical value was greater than the probability of a correct response, then the
generated response was scored as being incorrect. This algorithm is described in more detail in
Figure 3.1. For the experimental results in this thesis almost all of the parameter distributions are
chosen to be uniform distributions over fixed ranges. Using different distributions in the Monte
Carlo experiments would result in different numerical values for the experimental results and
study conclusions; however the sensitivity of the initial experiments to other distributions such as
24
a normally distributed parameter values is investigated and shown to have minimal impact on the
results.
Figure 3.1 Pseudocode for Generating Item Response by Monte Carlo Simulation
Input: 𝜃 𝑎𝑛𝑑 𝛼𝑖 , 𝛽𝑖 , 𝑐𝑖 for each question
Output: item response (𝑢𝑖 )
For 𝑖 = 1 to (# of questions)
𝑧𝑖 = 𝛼𝑖 ∗ (𝜃 − 𝛽𝑖 )
If (𝑡𝑤𝑜 𝑝𝑎𝑟𝑚𝑒𝑡𝑒𝑟 𝑚𝑜𝑑𝑒𝑙) then 𝑝𝑟𝑜𝑏𝑖 = 1/(1 + exp(−1 ∗ 𝑧𝑖 ))
If (𝑡ℎ𝑟𝑒𝑒 𝑝𝑎𝑟𝑚𝑒𝑡𝑒𝑟 𝑚𝑜𝑑𝑒𝑙) then 𝑝𝑟𝑜𝑏𝑖 = 𝑐𝑖 + (1 − 𝑐𝑖 ) ∗ 1/(1 + exp(−1 ∗ 𝑧𝑖 ))
µ = 𝑟𝑎𝑛𝑑𝑜𝑚()
if (µ < 𝑝𝑟𝑜𝑏𝑖 ) then 𝑢𝑖 = 1
else 𝑢𝑖 = 0
3.2 Estimation of parameters
3.2.1 Estimation of examinee ability
To estimate examinee ability, the model was simulated using randomly generated
examinee abilities and question parameters. Difficulty values were generated from a uniform
distribution with a range of -5 and 5, i.e., for µ = 𝑈[0,1], 𝛽 = −5 + �5 − (−5)�µ.
Discrimination values were generated from a uniform distribution with a range of 0 and 2, i.e.,
µ = 𝑈[0,1], 𝛼 = 0 + (2 − 0)µ. Examinee abilities were generated from uniform distribution
with a range of -5 and 5, i.e. µ = 𝑈[0,1], 𝜃 = −5 + �5 − (−5)�µ. The true ability value was
25
used to generate all of the responses by using the two-parameter logistic model and the threeparameter logistic model to calculate the probability of a correct response and compare this with
a random number which was generated from µ = 𝑈[0,1]. The probability of a correct response
𝑃(𝜃) for the two-parameter IRT model is calculated using the two-parameter logistic model [1]
given below.
1
𝑃(𝜃) =
(3.1)
1+ 𝑒 −𝛼( 𝜃−𝛽)
where 𝛽 is the difficulty parameter.
𝛼 is the discrimination parameter.
𝜃 is an ability level.
In the three-parameter logistic model, examinees can respond to responses by guessing. Thus, the
probability of correct response includes a small component that is due to guessing. The equation
for the three-parameter model [1] is given below.
P(𝜃) = 𝑐 + (1 − 𝑐)
1
(3.2)
1+ 𝑒 −𝛼( 𝜃− 𝛽)
The parameter 𝑐 is the probability of getting the item correct by guessing alone.
In maximum likelihood estimation procedures [2], each examinee provides responses to all of the
questions – with responses scored as either being correct or incorrect. Therefore the probability
that an examinee with ability level obtains a giving response on that question can be compouted.
The probability that an examinee with ability 𝜃 obtains a response 𝑈𝑖 on item 𝑖, where 𝑈𝑖 = 1 for
a correct response and 𝑈𝑖 = 0 for a incorrect response is denoted by 𝑃(𝑈𝑖 |𝜃). For a correct
response, the probability 𝑃(𝑈𝑖 = 1|𝜃) is denoted by 𝑃𝑖 (𝜃). Since 𝑈𝑖 is a binomial variable, the
probability of a response, 𝑈𝑖 , can be expressed as
𝑈
𝑃(𝑈𝑖 |𝜃) = 𝑃𝑖 𝑖 (1 − 𝑃𝑖 )1−𝑈𝑖
𝑈
1−𝑈𝑖
= 𝑃𝑖 𝑖 𝑄𝑖
(3.3)
26
where 𝑄𝑖 = 1 − 𝑃𝑖 . If an examinee with ability 𝜃 responds to n items, the joint probability of the
responses 𝑈1 , 𝑈2 , … , 𝑈𝑛 can be denoted by 𝑃(𝑈1 , 𝑈2 , … , 𝑈𝑛 |𝜃). Under the local independence
assumption, the 𝑈1 , 𝑈2 , … , 𝑈𝑛 are statistically independent. This implies that
𝑃(𝑈1 , 𝑈2 , … , 𝑈𝑛 |𝜃) = 𝑃(𝑈1 |𝜃)𝑃(𝑈2 |𝜃), … , 𝑃(𝑈𝑛 |𝜃)
(3.4)
The probability of the vector of item responses for a given examinee ability is given by the
likelihood function
𝑢 1−𝑢
𝑃𝑟𝑜𝑏(𝑈𝑖 |𝜃) = ∏𝑛𝑖=1 𝑃𝑖 𝑖 𝑄𝑖 𝑖
(3.5)
Taking the natural logarithm of the likelihood function [1] yields
𝐿 = log 𝑃𝑟𝑜𝑏(𝑈𝑖 |𝜃) = ∑𝑛𝑖=1[𝑈𝑖 log 𝑃𝑖 + (1 − 𝑈𝑖 ) log 𝑄𝑖 ]
(3.6)
Finding the value of 𝜃 that maximizes log 𝑃𝑟𝑜𝑏(𝑈𝑖 |𝜃),
𝜃� = arg 𝜃𝑚𝑎𝑥 {log 𝑃𝑟𝑜𝑏(𝑈𝑖 |𝜃)}
(3.7)
The large sample variance of 𝜃�𝑗 [1] will be given by
(3.8)
𝑆𝜃2𝑗 =
1
2
∑𝑛
𝑖=1 𝛼i 𝑃𝑖 𝑄𝑖
and the standard error [1] will be
𝑆𝐸𝜃�𝑗 = �𝑆𝜃2𝑗
(3.9)
This is described in more detail in Figure 3.2.
27
Figure 3.2 Pseudocode for Estimation of Examinee Ability by using MLE
Input: 𝛼𝑖 , 𝛽𝑖 , 𝑐𝑖 for all questions
Given responses (𝑢𝑖 ) for all questions
Output: Estimated Ability (𝜃�), Standard Error (𝑆𝐸)
Initial (𝜃 = -5)
While (𝜃 ≤ 5)
For 𝑖 = 1 to (# of questions)
𝑧𝑖 = 𝛼𝑖 ∗ (𝜃 − 𝛽𝑖 )
If (𝑡𝑤𝑜 𝑝𝑎𝑟𝑚𝑒𝑡𝑒𝑟 𝑚𝑜𝑑𝑒𝑙) then 𝑝𝑟𝑜𝑏𝑖 = 1/(1 + exp(−1 ∗ 𝑧𝑖 ))
If (𝑡ℎ𝑟𝑒𝑒 𝑝𝑎𝑟𝑚𝑒𝑡𝑒𝑟 𝑚𝑜𝑑𝑒𝑙) then 𝑝𝑟𝑜𝑏𝑖 = 𝑐𝑖 + (1 − 𝑐𝑖 ) ∗ 1/(1 + exp(−1 ∗ 𝑧𝑖 ))
𝑠𝑢𝑚𝑖 = 𝑢𝑖 ∗ log(𝑝𝑟𝑜𝑏𝑖 ) + (1 − 𝑢𝑖 ) ∗ log (1 − 𝑝𝑟𝑜𝑏𝑖 )
𝑠𝑢𝑚(𝜃) = 𝑠𝑢𝑚(𝜃) + 𝑠𝑢𝑚𝑖
If (𝑠𝑢𝑚(𝜃) > max) then 𝜃� = 𝜃
𝜃 += 0.1
𝑛
𝑆𝐸 = 𝑠𝑞𝑟𝑡(1/ �(𝛼𝑖 )2 ∗ 𝑝𝑟𝑜𝑏𝑖 ∗ (1 − 𝑝𝑟𝑜𝑏𝑖 ))
𝑖=1
Experimental results were generated by using maximum likelihood to estimate examinee ability
from maximizing the log likelihood function. Delta values are defined to be the average
difference in value between the true ability and the estimated ability of 100 examinees. We also
examined different numbers of questions to see the impact of changing the question pool size in
the estimation. In the three-parameter model, we included the guessing parameter, denoted by 𝑐,
in the estimation. The guessing parameter value was fixed at 0.25 which was represented the
28
probability of giving a correct answer with four choices. Note that the literature suggests that the
guessing parameter value should not be chosen too high as it might affect to the estimation
process [3]. In the initial set of experiments, additional parameter distributions are used in
addition to the uniform distribution to determine the sensitivity of the Monte Carlo simulation
results to the choice of parameter distributions. Looking at these results in Table 3.1, when the
number of questions is increased, there is an improvement in the average accuracy in estimation
as compared to the case for 100 examinees and the standard deviations of the experimental deltas
were also decreased. The deltas using the three-parameter model were higher than when using
the two-parameter model. We also note that using the two-parameter logistic model was more
robust than three-parameter logistic model. All data distributions gave similar results in these
experiments.
Table 3.1 Results of estimation of theta by using MLE (all parameters were generated from
uniform distribution)
2-parameter with 100 questions
2-parameter with 1,000 questions
3-parameter with 100 questions
3-parameter with 1,000 questions
Average Delta SD Delta
0.2878
0.2097
0.0933
0.0705
0.3599
0.2808
0.1211
0.0939
Table 3.2 Results of estimation of theta by using MLE (beta and alpha were generated from
uniform distribution and theta was generated from normal distribution)
2-parameter with 100 questions
2-parameter with 1,000 questions
3-parameter with 100 questions
3-parameter with 1,000 questions
Average Delta SD Delta
0.2679
0.1799
0.0815
0.0614
0.3231
0.2417
0.1163
0.0949
29
Table 3.3 Results of estimation of theta by using MLE (theta and beta were generated from
uniform distribution and alpha was generated from exponential distribution)
2-parameter with 100 questions
2-parameter with 1,000 questions
3-parameter with 100 questions
3-parameter with 1,000 questions
Average Delta SD Delta
0.2617
0.2013
0.0874
0.0701
0.3462
0.2909
0.1161
0.0987
We also performed experiments using the iterative Newton-Raphson procedure [1] to maximize
the log likelihood function to obtain the estimated examinee ability. The first and second
derivatives of the log-likelihood with respect to ability are needed for this. We note that [1]
𝜕𝑃𝑖𝑗
𝜕𝜃𝑗
= 𝛼𝑖 𝑃𝑖𝑗 𝑄𝑖𝑗 and
𝜕𝑄𝑖𝑗
= −𝛼𝑖 𝑃𝑖𝑗 𝑄𝑖𝑗
𝜕𝜃𝑗
(3.9)
The first derivative of the log-likelihood with respect to 𝜃𝑗 , and it is
𝜕𝐿
𝜕𝜃𝑗
= ∑𝑛𝑖=1 𝑢𝑖𝑗
1
𝑃𝑖𝑗
�𝛼𝑖 𝑃𝑖𝑗 𝑄𝑖𝑗 � + ∑𝑛𝑖=1 𝑢𝑖𝑗
= ∑𝑛𝑖=1 𝛼𝑖 �𝑢𝑖𝑗 − 𝑃𝑖𝑗 �
1
𝑃𝑖𝑗
�𝛼𝑖 𝑃𝑖𝑗 𝑄𝑖𝑗 �
= ∑𝑛𝑖=1 𝛼𝑖 𝑢𝑖𝑗 − ∑𝑛𝑖=1 𝛼𝑖 𝑃𝑖𝑗
Letting 𝑊𝑖𝑗 = 𝑃𝑖𝑗 𝑄𝑖𝑗 ,
𝜕𝐿
= ∑𝑛𝑖=1 𝛼𝑖 𝑊𝑖𝑗 �
𝜕𝜃𝑗
𝑢𝑖𝑗 −𝑃𝑖𝑗
𝑃𝑖𝑗 𝑄𝑖𝑗
(3.10)
�
(3.11)
The second order derivative of the log-likelihood function with respect to 𝜃𝑗 is
𝜕2 𝐿
𝜕𝜃𝑗2
=
=
𝜕
𝜕𝜃𝑗
𝜕
𝜕𝜃𝑗
�∑𝑛𝑖=1 𝛼𝑖 �𝑢𝑖𝑗 − 𝑃𝑖𝑗 ��
�∑𝑛𝑖=1 𝛼𝑖 𝑢𝑖𝑗 � −
𝜕
𝜕𝜃𝑗
�∑𝑛𝑖=1 𝛼𝑖 𝑃𝑖𝑗 �
= − ∑𝑛𝑖=1 𝛼𝑖2 𝑃𝑖𝑗 𝑄𝑖𝑗 = ∑𝑛𝑖=1 𝛼𝑖2 𝑊𝑖𝑗
30
(3.12)
The Newton-Raphson technique is used to obtain the estimates of an ability parameter via an
iterative procedure. For a given examinee a Newton-Raphson equation can be established to be
solved iteratively for the maximum likelihood estimate of ability:
2
−1
𝜕 𝐿
[𝜃�𝑗 ]𝑡+1 = [𝜃�𝑗 ]𝑡 − � 2 �
Substituting
𝜕2 𝐿
𝜕𝜃𝑗2
𝜕𝜃𝑗
and
𝜕𝐿
𝜕𝜃𝑗
𝑡
�
𝜕𝐿
𝜕𝜃𝑗
�
(3.13)
𝑡
with the equations above, the Fisher scoring equation for estimating
ability is
[𝜃�𝑗 ]𝑡+1 = [𝜃�𝑗 ]𝑡 − �
𝑢𝑖𝑗 −𝑃𝑖𝑗
∑𝑛
𝑖=1 𝛼𝑖 𝑊𝑖𝑗 �
𝑃𝑖𝑗 𝑄𝑖𝑗
2
− ∑𝑛
𝑖=1 𝛼𝑖 𝑊𝑖𝑗
�
�
This is described in more detail in Figure 3.3.
(3.14)
𝑡
31
Figure 3.3 Pseudocode for Estimation of Examinee Ability by using Newton-Raphson
Input: 𝛼𝑖 , 𝛽𝑖 , 𝑐𝑖 for all questions
Given responses (𝑢𝑖 ) for all questions
Output: Estimated Ability (𝜃�)
Initial (𝜃 = 0)
For 𝑖𝑡 = 1 to (# of iterations)
For 𝑖 = 1 to (# of questions)
𝑑𝑒𝑣𝑖 = 𝛽𝑖 + 𝛼𝑖 ∗ 𝜃
𝑝𝑟𝑜𝑏𝑖 = 1/(1 + exp(−1 ∗ 𝑑𝑒𝑣𝑖 )
If (𝑡𝑤𝑜 𝑝𝑎𝑟𝑚𝑒𝑡𝑒𝑟 𝑚𝑜𝑑𝑒𝑙)
then 𝑤 = 𝑝𝑟𝑜𝑏𝑖 ∗ (1 − 𝑝𝑟𝑜𝑏𝑖 )
𝑣 = 𝑢𝑖 − 𝑝𝑟𝑜𝑏𝑖
𝑠𝑢𝑚𝑛𝑢𝑚 = 𝑠𝑢𝑚𝑛𝑢𝑚 + (𝛼𝑖 ∗ 𝑣)
𝑠𝑢𝑚𝑑𝑒𝑚 = 𝑠𝑢𝑚𝑑𝑒𝑚 + (𝛼𝑖 ∗ 𝛼𝑖 ∗ 𝑤)
If (𝑡ℎ𝑟𝑒𝑒 𝑝𝑎𝑟𝑚𝑒𝑡𝑒𝑟 𝑚𝑜𝑑𝑒𝑙)
then 𝑝𝑡𝑖 = 𝑐𝑖 + (1 − 𝑐𝑖 ) ∗ 𝑝𝑟𝑜𝑏𝑖 )
𝑤 = 𝑝𝑡𝑖 ∗ (1 − 𝑝𝑡𝑖 )
𝑣 = 𝑢𝑖 − 𝑝𝑡𝑖
𝑝𝑠𝑝𝑖 = 𝑝𝑟𝑜𝑏𝑖 /𝑝𝑡𝑖
𝑠𝑢𝑚𝑛𝑢𝑚 = 𝑠𝑢𝑚𝑛𝑢𝑚 + (𝛼𝑖 ∗ 𝑣) ∗ 𝑝𝑠𝑝𝑖
𝑠𝑢𝑚𝑑𝑒𝑚 = 𝑠𝑢𝑚𝑑𝑒𝑚 + (𝛼𝑖 ∗ 𝛼𝑖 ∗ 𝑤) ∗ 𝑝𝑠𝑝𝑖 ∗ 𝑝𝑠𝑝𝑖
𝑑𝑒𝑙𝑡𝑎 = 𝑠𝑢𝑚𝑛𝑢𝑚/𝑠𝑢𝑚𝑑𝑒𝑚
𝜃 = 𝜃 + 𝑑𝑒𝑙𝑡𝑎
32
It was found that the experimental values for the estimated abilities were often out of scale of our
defined range using the Newton Raphson method. Thus, this procedure might not be reliable.
Therefore in the remainder of this thesis we performed direct maximum likelihood estimation i.e.
direct numerical maximization of the log likelihood function, 𝐿, for all estimation procedures in
our experiments.
Table 3.4 Results of estimation of theta by using Newton-Raphson
Examinee True Theta
1
4.2962
2
-1.8362
3
-3.1608
4
-2.9544
5
0.6773
6
0.9554
7
4.6451
8
1.5318
9
2.4891
10
1.5357
Theta
Estimated
3-parameter
4.3041
-2.4423
-3.1781
-1.7610
0.6931
0.7538
25.0000
1.5622
2.8764
1.2272
Delta 3parameter
0.0079
0.6061
0.0172
1.1934
0.0159
0.2017
20.3549
0.0304
0.3873
0.3085
True
Theta
4.2962
-1.8362
-3.1608
-2.9544
0.6773
0.9554
4.6451
1.5318
2.4891
1.5357
Theta
Estimated
2-parameter
4.5951
-1.8153
-3.1781
-2.9444
0.5754
0.8473
25.0000
1.5856
3.1781
1.0986
Delta 2parameter
0.2990
0.0210
0.0172
0.0100
0.1019
0.1081
20.3549
0.0539
0.6890
0.4371
3.2.2 Estimation of item parameters
In order to estimate item (or question) parameters, the model is simulated by choosing
randomly generated question parameters and examinee difficulty values. Difficulty values were
generated from a uniform distribution with a range of -5 and 5, i.e., µ = 𝑈[0,1], and 𝛽 = −5 +
�5 − (−5)�µ. Discrimination values were generated from a uniform distribution with a range of
0 and 2, i.e., µ = 𝑈[0,1], and 𝛼 = 0 + (2 − 0)µ. Examinee abilities were generated from a
uniform distribution with a range of -5 and 5, i.e., µ = 𝑈[0,1],and 𝜃 = −5 + �5 − (−5)�µ. The
true ability is used to generate all of the responses by using a two-parameter logistic model and a
33
three-parameter logistic model to calculate the probability of a correct response and compare it
with random number U[0,1]. In parameter estimation procedures [1], examinees are grouped
according to their ability. Suppose that 𝑘 groups of 𝑓𝑗 subjects possessing known ability score 𝜃𝑗
are drawn at random from a population of persons and 𝑗 = 1, … , 𝑘. Each subject has responded
to a single dichotomously scored item. Out of the 𝑓𝑗 subjects having ability 𝜃𝑗 , 𝑟𝑗 gave the correct
response and 𝑓𝑗 − 𝑟𝑗 the incorrect response. Let 𝑅 = (𝑟1 , … , 𝑟𝑘 ) be the vector of the observed
number of correct responses. The observed proportion of correct response at ability 𝜃𝑗 is
𝑝(𝜃𝑗 ) = 𝑝𝑗 =
𝑟𝑗
𝑓𝑗
(3.15)
And the observed proportion of incorrect response is
𝑞(𝜃𝑗 ) = 𝑞𝑗 =
𝑓𝑗 − 𝑟𝑗
(3.16)
𝑓𝑗
The probability of 𝑅 is given by the likelihood function
𝑃𝑟𝑜𝑏(𝑅) = ∏𝑘𝑗=1
𝑓𝑗 !
𝑟𝑗 !�𝑓𝑗 −𝑟𝑗 �!
𝑟
𝑓𝑗−𝑟𝑗
𝑃𝑗 𝑗 𝑄𝑗
(3.17)
And the natural logarithm of the likelihood [1] is
𝐿 = 𝑙𝑜𝑔 𝑃𝑟𝑜𝑏(𝑅) = ∑𝑘𝑗=1 𝑟𝑗 log 𝑃𝑗 + ∑𝑘𝑗=1(𝑓𝑗 − 𝑟𝑗 ) log 𝑄𝑗
Finding the value of 𝜃 that maximizes log 𝑃𝑟𝑜𝑏(𝑅),
𝑚𝑎𝑥
(𝛼�, 𝛽̂ ) = arg
{log 𝑃𝑟𝑜𝑏(𝑅)}
(𝛼,𝛽)
This is described in more detail in Figure 3.4.
34
(3.18)
(3.19)
Figure 3.4 Pseudocode for Estimation of Parameters (𝛃 𝐚𝐧𝐝 𝛂) by using MLE
Input: 𝜃𝑗 for all examinees, given responses (𝑢𝑗 ) for all questions
Output: Estimated parameters (𝛽̂, 𝛼�)
For j = 1 to (# of examinees)
For k = 1 to (# of ablility groups)
If (𝑢𝑗 == 1) then
else
Initial (𝛽 = -5)
𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑘 += 1
𝑖𝑛𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑘 += 1
while (𝛽 ≤ 5)
Initial (𝛼 = 0.01)
while (𝛼 ≤ 2)
For k = 1 to (# of ablility groups)
𝑧 = 𝛽 ∗ (𝛼 ∗ 𝜃𝑘 )
If (𝑡𝑤𝑜 𝑝𝑎𝑟𝑚𝑒𝑡𝑒𝑟 𝑚𝑜𝑑𝑒𝑙) then 𝑝𝑟𝑜𝑏 = 1/(1 + exp(−1 ∗ 𝑧))
If (𝑡ℎ𝑟𝑒𝑒 𝑝𝑎𝑟𝑚𝑒𝑡𝑒𝑟 𝑚𝑜𝑑𝑒𝑙)
then 𝑝𝑟𝑜𝑏 = 𝑐𝑖 + (1 − 𝑐𝑖 ) ∗ 1/(1 + exp(−1 ∗ 𝑧))
𝑠𝑢𝑚 = 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑘 ∗ log(𝑝𝑟𝑜𝑏) + 𝑖𝑛𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑘 ∗ log (1 − 𝑝𝑟𝑜𝑏)
𝑠𝑢𝑚(𝛽, 𝛼) = 𝑠𝑢𝑚(𝛽, 𝛼) + 𝑠𝑢𝑚
If (𝑠𝑢𝑚(𝛽, 𝛼) > max)
If (𝑡𝑤𝑜 𝑝𝑎𝑟𝑚𝑒𝑡𝑒𝑟 𝑚𝑜𝑑𝑒𝑙) then 𝛽̂ = −1 ∗ 𝛽/𝛼
If (𝑡ℎ𝑟𝑒𝑒 𝑝𝑎𝑟𝑚𝑒𝑡𝑒𝑟 𝑚𝑜𝑑𝑒𝑙) then 𝛽̂ = 𝛽 𝑎𝑛𝑑 𝛼� = 𝛼
𝛼 += 0.1
𝛽 += 0.1
35
From the results, when the number of examinees are increased, the deltas between the true value
of parameters and estimated values are decreased. As examinees are grouped with specific range,
changing the range had an impact on the estimation process - reducing the range size generally
increased the accuracy of the estimation process. However the standard deviation was typically
not changed with different ranges. Because reducing the range size did not reduce the variance in
the estimation process. In the three parameter logistic model, the guessing parameter was fixed to
be 0.25 in order to find the difficulty (𝛽) and discrimination (𝛼) value. The results from using the
two parameter logistic model were close to those of the three parameter logistic model for the
same number of examinees.
Table 3.5 Results of Estimation of Question Parameters by using MLE
2-parameter with 100 examinees
2-parameter with 1,000 examinees
2-parameter with 100 examinees
2-parameter with 1,000 examinees
2-parameter with 100 examinees
2-parameter with 1,000 examinees
3-parameter with 100 examinees
3-parameter with 1,000 examinees
3-parameter with 100 examinees
3-parameter with 1,000 examinees
3-parameter with 100 examinees
3-parameter with 1,000 examinees
range
0.5
0.5
0.1
0.1
0.05
0.05
0.5
0.5
0.1
0.1
0.05
0.05
36
Beta
Average
Delta
0.4446
0.2457
0.4588
0.1467
0.4615
0.1249
0.4219
0.2827
0.3619
0.1427
0.3519
0.1127
Alpha
SD
Average
Delta
Delta
0.4183
0.2162
0.1538
0.0654
0.4347
0.2262
0.1298
0.0654
0.4263
0.2262
0.1319
0.0654
0.3206
0.2491
0.1361
0.1157
0.3343
0.2391
0.0933
0.1257
0.3412
0.2491
0.0935
0.1113
SD
Delta
0.1400
0.0926
0.1597
0.0923
0.1597
0.0926
0.1938
0.1457
0.1975
0.1440
0.1937
0.1509
3.2.3 Estimation of ability and item parameters
When both ability and item parameters are unknown it is possible to perform a joint
maximum likelihood estimation procedure to estimate these parameters. When question
parameters were estimated (section 3.2.2), it was assumed that we knew all examinee abilities. In
reality, these values are not known a priori. On the other hand, when the examinee ability was
estimated based on the examinee responses to the all questions, it was assumed that we knew the
question parameters. Here the Birnbaum [1] paradigm comes into play. Birnbaum proposed
using a back and forth, two-stage, procedure to solve for the estimates of the parameters. In the
first stage, all question parameters are estimated assuming the examinee abilities are known. In
the second stage, examinee abilities are estimated assuming that the question parameters are
known. The probability of 𝑁 examinees response to 𝑛 questions of the test, and the responses is
dichotomously scored, 𝑢𝑖𝑗 = 0, 1 where 𝑖 designates the question, 𝑖 = 1, … , 𝑛 and 𝑗 designates
the examinee, 𝑗 = 1, … , 𝑁. The resulting 𝑛 by 𝑁 matrix of item responses is denoted by 𝑈 =
[𝑢𝑖𝑗 ] and 𝜃 is the vector of the 𝑁 examinee ability scores, 𝜃 = (𝜃1 , … , 𝜃𝑁 ). Thus, the probability
of the 𝑛 ∗ 𝑁 matrix of item responses is given by the likelihood function
𝑢𝑖𝑗
𝑛
𝑃𝑟𝑜𝑏(𝑈|𝜃) = ∏𝑁
𝑗=1 ∏𝑖=1 𝑃𝑖
1−𝑢𝑖𝑗
�𝜃𝑗 �𝑄𝑖
(𝜃𝑗 )
(3.20)
To simplify the notation, let 𝑃𝑖 �𝜃𝑗 � = 𝑃𝑖𝑗 and 𝑄𝑖 �𝜃𝑗 � = 𝑄𝑖𝑗
Taking the logarithm of the likelihood function yields
𝑛
𝐿 = log 𝑃𝑟𝑜𝑏(𝑈|𝜃) = ∑𝑁
𝑗=1 ∑𝑖=1[𝑈𝑖𝑗 log 𝑃𝑖𝑗 + (1 − 𝑈𝑖 ) log 𝑄𝑖𝑗 ]
(3.21)
The procedure was initialized by assigning all examinee abilities to zero. All question parameters
are then estimated separately by using the maximum likelihood estimation procedure described
in the section above. Then examinee abilities are estimated separately by using these estimated
37
parameters. This procedure is continued for a given number of iterations. This is described in
more detail in Figure 3.5.
Figure 3.5 Pseudocode for Joint Maximum Likelihood Estimation
Input: Given responses (𝑢𝑖 ) for all questions from all examinees
Output: Estimated Ability (𝜃�), Estimated parameters (𝛽̂, 𝛼�)
Initial (𝜃� = 0)
For 𝑖𝑡 = 1 to (# of itertions)
For 𝑖 = 1 to (# of questions)
𝛽̂, 𝛼� = 𝐹𝑖𝑛𝑑 𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 𝑝𝑟𝑜𝑐𝑒𝑑𝑢𝑟𝑒 �𝜃��
𝜃� = 𝐹𝑖𝑛𝑑 𝑒𝑥𝑎𝑚𝑖𝑛𝑒𝑒 𝑎𝑏𝑖𝑙𝑖𝑡𝑖𝑒𝑠 𝑝𝑟𝑜𝑐𝑒𝑑𝑢𝑟𝑒 �𝛽̂, 𝛼��
From the results, the deltas between the true values and the estimated value are seen to be very
large. This could be a function of the inherent indeterminacy in the model. Another problem
could be that the problem has local maxima. The solution that results from the described
procedure might find a local maximum but not the absolute maximum. This is a common
problem in back-and-forth, iterative solution procedures such as proposed by Birnbaum.
Table 3.6 Results of Joint Maximum Likelihood Estimation
Iteration
50
50
100
100
50
Number of Examinees and Questions
300 examinees and 50 questions
300 examinees and 100 questions
300 examinees and 50 questions
300 examinees and 100 questions
1,000 examinees and 50 questions
38
Average delta
Theta
Beta
Alpha
3.4537 3.3576 0.4746
2.9010 3.0901 0.4837
3.4537 3.3577 0.4746
2.9010 3.0901 0.4837
3.4316 3.2966 0.4386
3.3 Item selection methods
The estimation of examinee ability and question parameters is described above. Next
item selection procedure work in CAT is described. In the experiments that were conducted, a
single question is first given to examinees. This question is selected based on the initial examinee
ability value being zero, i.e., 𝜃�0 = 0. After the first response is obtained from the examinee, one
can obtain a rough estimate of the examinee ability using the estimation process described above.
Next, questions are selected sequentially by using an item selection method, according to the
current estimate of the performance of examinee ability. The advantage of CAT is that it can
provide more efficient estimates with fewer questions than sequential question/paper testing. We
used two item selection methods in our experiment: one based on Maximum Information [1] and
one based on the Kullback Leibler [4] information measure. We compared the results from both
of these methods to determine which one performed better. For the Maximum Information
method, the Item Information Function is used to find the amount of information for each
question in the current item pool based on the currently estimated ability level (𝜃�). For each item
k, the Item Information Function [1] is defined as
𝐼𝑘 �𝜃�� =
2
� ��
�𝑃𝑘′ �𝜃
�
��
𝑃𝑘 �𝜃�𝑄𝑘 �𝜃
= 𝑎𝑘2 𝑃𝑘 �𝜃��𝑄𝑘 �𝜃��
This is described in more detail in Figure 3.6.
39
(3.22)
Figure 3.6 Pseudocode for Maximum Information
Input: 𝜃� 𝑎𝑛𝑑 𝛼𝑖 , 𝛽𝑖 , 𝑐𝑖 for each question
Output: 𝑛𝑒𝑥𝑡 𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛
For 𝑖 = 1 to (# 𝑜𝑓 𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛𝑠)
𝑧𝑖 = 𝛼𝑖 ∗ (𝜃� − 𝛽𝑖 )
If (𝑡𝑤𝑜 𝑝𝑎𝑟𝑚𝑒𝑡𝑒𝑟 𝑚𝑜𝑑𝑒𝑙) then 𝑝𝑟𝑜𝑏𝑖 = 1/(1 + exp(−1 ∗ 𝑧𝑖 ))
If (𝑡ℎ𝑟𝑒𝑒 𝑝𝑎𝑟𝑚𝑒𝑡𝑒𝑟 𝑚𝑜𝑑𝑒𝑙) then 𝑝𝑟𝑜𝑏𝑖 = 𝑐𝑖 + (1 − 𝑐𝑖 ) ∗ 1/(1 + exp(−1 ∗ 𝑧𝑖 ))
𝐼�𝜃�� = (𝛼𝑖 )2 ∗ 𝑝𝑟𝑜𝑏𝑖 ∗ (1 − 𝑝𝑟𝑜𝑏𝑖 )
If (𝐼�𝜃�� > max) then 𝑛𝑒𝑥𝑡 𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛 = 𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛 𝑖
The Information Measure is computed for all questions and a question is selected which provides
the greatest value. For the Kullback Leibler method, at the early stages of CAT there are only a
few questions which are used to estimate ability. Here, the estimated ability is not close to the
true ability. This item selection method is based on the average global information where the
Kullback Leibler (KL) [4] measure is defined as:
Given 𝜃� , the kth item is selected such that 𝐾 𝑘 �𝜃�𝑘−1 � is the item with the greatest value in 𝑅𝑘 ,
𝑅𝑘 is set of items; 𝑅𝑘 = {1, … , k}
�
𝜃
𝑘−1
where:
and
+𝛿
𝐼𝑘 ≡ 𝑎𝑟𝑔𝑚𝑎𝑥s { ∫𝜃� 𝑘−1−𝛿 𝑘 𝐾𝑠 �𝜃 ∥ 𝜃�𝑘−1 � ∶ 𝑠 ⍷ 𝑅𝑘 } ,
𝛿𝑘 =
3
(3.23)
𝑘
𝑃𝑖 (𝜃0 )
�
𝑃𝑖 (𝜃)
𝐾𝑖 (𝜃 ∥ 𝜃0 ) = 𝑃𝑖 (𝜃0 )log �
+ [1 − 𝑃𝑖 (𝜃0 )] log �
√𝑘
This is described in more detail in Figure 3.7.
40
1− 𝑃𝑖 (𝜃0 )
�
1− 𝑃𝑖 (𝜃)
(3.24)
Figure 3.7 Pseudocode for Kullback Leibler Information
Input: 𝜃� 𝑎𝑛𝑑 𝛼𝑖 , 𝛽𝑖 for each question
Output: 𝑛𝑒𝑥𝑡 𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛
delta =
3
√𝑘
integral from = 𝜃� − 𝑑𝑒𝑙𝑡𝑎
integral to = 𝜃� + 𝑑𝑒𝑙𝑡𝑎
For 𝑖 = 1 to (# 𝑜𝑓 𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛𝑠)
𝑧𝑛𝑢𝑙𝑙 = 𝛼𝑖 ∗ (𝜃� − 𝛽𝑖 )
𝑝𝑟𝑜𝑏𝑛𝑢𝑙𝑙 = 1/(1 + exp(−1 ∗ 𝑧𝑛𝑢𝑙𝑙 ))
while (𝑖𝑛𝑡𝑒𝑔𝑟𝑎𝑙 𝑓𝑟𝑜𝑚 ≤ 𝑖𝑛𝑡𝑒𝑔𝑟𝑎𝑙 𝑡𝑜)
𝑧 = 𝛼𝑖 ∗ (𝑖𝑛𝑡𝑒𝑔𝑟𝑎𝑙 𝑓𝑟𝑜𝑚 − 𝛽𝑖 )
𝑝𝑟𝑜𝑏 = 1/(1 + exp(−1 ∗ 𝑧))
𝐾𝑖 �𝜃�� = 𝑝𝑟𝑜𝑏𝑛𝑢𝑙𝑙 ∗ log �
𝑝𝑟𝑜𝑏𝑛𝑢𝑙𝑙
𝑝𝑟𝑜𝑏
1−𝑝𝑟𝑜𝑏𝑛𝑢𝑙𝑙
� + (1 − 𝑝𝑟𝑜𝑏𝑛𝑢𝑙𝑙 ) ∗ log (
𝑠𝑢𝑚 𝐾𝑖 �𝜃�� = 𝑠𝑢𝑚 𝐾𝑖 �𝜃�� + 𝐾𝑖 �𝜃��
1−𝑝𝑟𝑜𝑏
)
𝑖𝑛𝑡𝑒𝑔𝑟𝑎𝑙 𝑓𝑟𝑜𝑚 = 𝑖𝑛𝑡𝑒𝑔𝑟𝑎𝑙 𝑓𝑟𝑜𝑚 + 0.1
If (sum 𝐾𝑖 �𝜃�� > max) then 𝑛𝑒𝑥𝑡 𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛 = 𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛 𝑖
A CAT system uses the selection method to decide which question is the most appropriate to
give to estimate examinee ability. It checks all questions in the item bank and determines the
question that was is expected to provide the greatest information based on the current estimate of
examinee ability. The question is then removed from the item pool after it is given to the
examinee. The estimation process and selection process are then continued until system reaches
41
the stop criterion. The stop criterion that is used in this experiment is to limit the number of
questions that are provided to estimate examinee ability.
The results show that for both methods when the number of questions (in the stop
criterion) is increased the estimation of examinee abilities is improved, i.e., the deltas values
decreased correspondingly. The results showed a slight improvement from using the Maximum
Information method as compared to the Kullback Leibler method.
Table 3.7 Results of Estimation of Examinee Ability with Item Selection Methods
Method
Number of
Questions
Maximum Information
Maximum Information
Maximum Information
Maximum Information
Kullback Leibler
Kullback Leibler
Kullback Leibler
Kullback Leibler
Average
Delta
0.3330
0.2303
0.1620
0.1510
0.3462
0.2515
0.1956
0.1257
10
20
50
100
10
20
50
100
SD
Delta
0.2504
0.1636
0.1185
0.1086
0.2418
0.1879
0.1438
0.0946
3.4 Increasing examinee ability
In this research, the goal was to propose an approach to increase the examinee ability,
thus learning capabilities need to be provided to the examinee. Examinee ability is increased by
providing appropriate hints that are matched to the examinee ability level with the assumption
that the examinee can understand these hints. In the learning process, all questions in the item
bank are divided into several sets, such that each set of questions has a hint which will help an
examinee to respond to any questions in that set. The proposed algorithm decides when to give
the hint and when to give questions to satisfy objective goal. If the proposed algorithm makes a
42
decision to give the question, the examinee ability estimation process that was described
previously is used. If the system decides to provide a hint, the hint which represents the next
selected question is used. If an examinee understands the hint, then the examinee ability will be
increased from the estimated ability to a new ability. The new ability can then be estimated based
on adding the estimated ability and hint value.
To apply this approach, parameter values that characterize the hints are needed with each
hint being characterized by two parameters, the first parameter being the hint value, and the
second value being hint difficulty. The hint value specifies how much the examinee ability will
be increased after this hint is given to an examinee. The hint difficulty determines the extent of
whether an examinee understands the hint or not. Three different hint learning/understanding
models were examined. In all of these models the assumption was made that if examinee ability
was not in a specified delta range, then the examinee would not understand that hint with delta
being defined as the difference between the examinee ability and the hint difficulty. Figure 3.8
illustrates the three different probability models for whether or not a given hint is understood.
These figures depict the probability of understanding the hint as a function of delta. In the first
model, values of delta falling within the delta range 0 and 5 would have a probability of 1 for
understanding the hint and 0 otherwise. In the second model, values of delta falling within the
delta range 0 and 2 would have a probability of 1 for understanding the hint and 0 otherwise. In
the third model the possible values for delta were divided into three ranges, -1 to 1, 1 to 2, and 2
to 3. In the third model, values of delta falling within the delta range 2 and 3 would have a
probability of 0.5 for understanding the hint, and for the delta in the range of -1 to 1 (and 2 to 3)
the probability will linearly increases (decreases) as shown.. Thus, using these models, the
43
examinee ability will be increased by the product of probability of understanding and the hint
value.
Figure 3.8 Given hint from three models
In the Monte Carlo experiments, several numbers of questions with 20 hints and three different
examinee ability values were examined. From the results, model1 and model3 had a slightly
greater chance to increase examinee ability more than model2. For high ability as 3.694, they
gave the similar results since there was very low chance to increase examinee ability. Therefore,
abilities remained the same as without learning.
Table 3.8 Results of given hint from three models
Ability
-3.694
0
3.694
Number of
questions
10
30
50
10
30
50
10
30
50
Without
learning
-3.4
-3.8
-3.9
0.4
-0.3
-0.2
3.9
3.5
3.7
Learning
model1
-3.4
-3.7
-3.6
0.8
0.1
0.1
3.9
3.5
3.7
3.4.1 Estimation of hint value
44
Learning
model2
-3.4
-3.7
-3.5
0.4
0
0.2
3.9
3.5
3.7
Learning
model3
-3.1
-3.6
-3.5
0.8
0.09
0.1
3.9
3.5
3.7
To estimate each hint value, the proportion of correct responses to total responses is used
as data for each question that is affected by the hint. Questions are divided into sets that may
have an associated (single) hint. These sets are not required to be mutually exclusive, thus the
same question may be in several sets. For a given hint, the proportion of correct responses is
counted as one when an examinee gives all correct responses to the questions in the set otherwise the proportion is counted as zero. This proportion is then compared to the product of
the summations of the probability of correct response for each question. The simulation
experiments used different number of examinees, different number of questions in one set, and
different difficulty values for each question. In the first validation simulation, only one question
in one set was simulated. The theoretical expectation was computed for the uniform distribution
and was compared to the numerically computed value of this quantity and the sample expectation
based on the experimental runs in the case of known abilities from 1,000 examinees - generated
from a uniform distribution. Difficulty values were generated from a uniform distribution for a
range of values from -5 to 5, i.e., for µ = 𝑈[0,1], 𝛽 = −5 + �5 − (−5)�µ. Discrimination
values were generated from a uniform distribution for a range of values from 0 to 2, i.e., for
µ = 𝑈[0,1], 𝛼 = 0 + (2 − 0)µ. Experimental Examinee abilities were generated from a uniform
distribution for a range of values from -5 to 5, for µ = 𝑈[0,1], 𝜃 = −5 + �5 − (−5)�µ.
Responses were generated by using Monte Carlo simulation and the proportion of correct
response for each set was counted. Therefore, if any delta hint is added, it would be known by
how much examinee ability was increased - as the increased ability is due to the number of
correct response in that set.
45
To find a delta hint for one question, the following expected value equation can be used
by fitting the response data to the expected values [6]:
𝑁𝑠
�
𝑠=1
1
1+ 𝑒 −𝛼[(𝜃𝑠 + 𝛿ℎ𝑖𝑛𝑡)− 𝛽]
= 𝑁𝑐𝑜𝑟𝑟𝑒𝑐𝑡
(3.25)
where 𝑁𝑐𝑜𝑟𝑟𝑒𝑐𝑡 is the number of correct responses for that question. As one hint was used to
represent several questions in one set, a suitable delta hint can be found for several questions by
using the following equation relating the number of correct responses of each examinee for all
question in that set.
𝑁𝑠
�
1
(
−𝛼1 [(𝜃𝑠 + 𝛿ℎ𝑖𝑛𝑡)− 𝛽1 ]
𝑠=1 1+ 𝑒
∗
1
1+ 𝑒 −𝛼2 [(𝜃𝑠+ 𝛿ℎ𝑖𝑛𝑡)− 𝛽2 ]
) = 𝑁𝑐𝑜𝑟𝑟𝑒𝑐𝑡2
(3.26)
where 𝑁𝑐𝑜𝑟𝑟𝑒𝑐𝑡2 is the number of correct responses for all questions in that set. We set the initial
guess for the delta hint value to be zero. Therefore, the expected counts should be close to the
observed counts. This is described in more detail in Figure 3.9.
Figure 3.9 Pseudocode for Computation of Hint Value by using MLE
Input: 𝜃𝑗 𝑎𝑛𝑑 𝛼𝑖 , 𝛽𝑖 for each question
Output: hint value (𝛿ℎ𝑖𝑛𝑡)
Initial 𝑒𝑠𝑡 𝛿ℎ𝑖𝑛𝑡 = 0
while (𝑒𝑠𝑡 𝛿ℎ𝑖𝑛𝑡 ≤ 5)
for 𝑗 = 1 to # of examinees
for 𝑖 = 1 to # of questions
𝑧𝑖 = 𝛼𝑖 ∗ ((𝜃𝑗 + 𝑒𝑠𝑡 𝛿ℎ𝑖𝑛𝑡) − 𝛽𝑖 )
𝑝𝑟𝑜𝑏𝑖 = 1/(1 + exp(−1 ∗ 𝑧𝑖 ))
𝑝𝑟𝑜𝑑𝑢𝑐𝑡 𝑜𝑓 𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛𝑠 = 𝑝𝑟𝑜𝑑𝑢𝑐𝑡 𝑜𝑓 𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛 ∗ 𝑝𝑟𝑜𝑏𝑖
sum examinees = sum examinees + 𝑝𝑟𝑜𝑑𝑢𝑐𝑡 𝑜𝑓 𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛𝑠
46
𝑑𝑒𝑙𝑡𝑎𝑒𝑠𝑡 𝛿ℎ𝑖𝑛𝑡 = (sum examinees − # of correct responses)/# of examinees
If (min (𝑑𝑒𝑙𝑡𝑎𝑒𝑠𝑡 𝛿ℎ𝑖𝑛𝑡 )) then 𝛿ℎ𝑖𝑛𝑡 = 𝑒𝑠𝑡 𝛿ℎ𝑖𝑛𝑡
𝑒𝑠𝑡 𝛿ℎ𝑖𝑛𝑡+= 0.1
From the experimental results one can see that when using one question in a set, the expected
values of examinee ability (based on the uniform distribution case) and the expected values from
using 1,000 examinees with known ability values were very close to the observed values. For
many questions in one set, we used the summation of product of the probability of correct
response for all questions in that set as compared to the number of correct response when all
questions in that set were correct. For a large number of questions in one set the expectation
values remained very close to the observed values.
Table 3.9 Results of finding hint value from 1000 examinees
Beta
-4
-3
-2
-1
0
1
2
3
4
Expectation
(uniform
distribution)
868.686
787.341
695.232
598.433
500.000
401.567
304.768
212.659
131.314
Expectation
(known ability)
Observe
868.725
783.447
685.738
583.889
483.203
386.149
293.737
207.184
130.114
872
799
712
581
484
388
299
215
128
47
delta/1000
(dist&Obs)
delta/1000
(known&Obs)
-0.003
-0.012
-0.017
0.017
0.016
0.014
0.006
-0.002
0.003
-0.003
-0.016
-0.026
0.003
-0.001
-0.002
-0.005
-0.008
0.002
Table 3.10 Results of finding hint value from 5000 examinees
Beta
Expectation
(uniform
distribution)
-4
-3
-2
-1
0
1
2
3
4
4343.431
3936.704
3476.162
2992.163
2500.000
2007.837
1523.838
1063.296
656.569
Expectation
(known ability)
Observe
delta/5000
(dist&Obs)
delta/5000
(known&Obs)
4346.559
3931.394
3455.012
2954.191
2454.018
1966.504
1495.296
1048.327
650.962
4376
3947
3494
2953
2452
1960
1495
1048
646
-0.007
-0.002
-0.004
0.008
0.010
0.010
0.006
0.003
0.002
-0.006
-0.003
-0.008
0.000
0.000
0.001
0.000
0.000
0.001
Table 3.11 Results of finding hint value with 2 questions from 5000 examinees (each examinee
gave two responses)
Beta
-4,4
-3,3
-2,2
-1,1
-4,0
0,4
2,4
-3,-1
-2,4
Expectation
(known
ability)
0.130
0.208
0.292
0.362
0.484
0.123
0.104
0.560
0.129
Observe
delta
(known&obs)
Variance
0.128
0.206
0.292
0.363
0.484
0.122
0.104
0.558
0.127
0.002
0.002
0.000
-0.001
0.000
0.001
0.000
0.002
0.002
0.113
0.165
0.207
0.231
0.250
0.108
0.093
0.246
0.112
48
SD
0.336
0.406
0.455
0.481
0.500
0.328
0.305
0.496
0.335
SD (dist)/10
0.034
0.041
0.045
0.048
0.050
0.033
0.031
0.050
0.034
delta
650.000
1040.000
1460.000
1810.000
2420.000
615.000
520.000
2800.000
645.000
Table 3.12 Results of finding hint value with 5 questions from 5000 examinees (each examinee
gave two responses)
Beta
Expectation
(known ability)
-2,4,2,-4,1
-3,3,2,1,-4
1,2,3,4,0
-4,-3,-2,-1,0
-2,-1,0,1,2
-4,-3,0,3,4
2,4,-1,-3,-2
-3,-1,1,2,2
-2,4,0,4,-1
0.096
0.144
0.067
0.410
0.223
0.082
0.102
0.182
0.057
Observe
delta
(known&obs)
Variance
0.001
0.001
-0.001
-0.004
0.002
0.001
0.001
0.006
0.003
0.087
0.123
0.063
0.242
0.173
0.075
0.092
0.149
0.054
0.095
0.143
0.068
0.414
0.221
0.081
0.101
0.176
0.054
SD
0.295
0.351
0.250
0.492
0.416
0.274
0.303
0.386
0.232
SD
(dist)/10
0.029
0.035
0.025
0.049
0.042
0.027
0.030
0.039
0.023
delta
480.000
720.000
335.000
2050.000
1115.000
410.000
510.000
910.000
285.000
3.4.2 Objective Function
The fundamental problem addressed in this thesis has characteristics similar to the
problems faced in dual control problems – estimating the current state at the expense of moving
the state to its desired value vs. moving the state to its desired value at the expense of an
inaccurate estimate of the current state. In this research the goal is how to accurately estimate
examinee ability so that examinee ability can be move toward a desired ability level. To
accomplish this similar tradeoffs result as compared to the dual control problem: estimating the
current ability requires more testing vs. moving the examinee to the desired set point requires
hints/learning but which changes the examinee ability dynamically – thereby complicating the
ability estimation process. To accomplish this one needs to control the learning process and
identify the error from the estimation process. In control process, we fixed the point that
examinee ability will be increased. Hints that are chosen should be dependent on an objective
function that allows for the tradeoff between either giving a question or giving a hint. In our
proposed approach to the dual estimation/control problem the value of both giving a question and
49
giving a hint is computed and the resulting choice is the one that minimizes the objective
function. The objective function that is proposed considers two important aspects: the difference
between the desired ability goal (the ability setpoint) and the standard error of the estimated
ability:
J = 𝑎 ∗ [max��𝜃𝑔𝑜𝑎𝑙 − 𝜃��, ∅�]2 + 𝑏 ∗ (𝑆𝐸)2
(3.27)
Where a and b are weights given to the control vs. estimation problems,
[max��𝜃𝑔𝑜𝑎𝑙 − 𝜃� �, ∅�]2 is the objective function contribution from the difference between the
desired ability goal and the current estimate of the examinee ability, and
(𝑆𝐸)2 is the objective function contribution from the standard error of the current estimate of the
examinee ability.
To validate this approach we first tested this objective function by using a tree to compute the
objective function values while storing the resulting sequence of optimal question choices and
selected hints. The assumption was that the use of this objective function will select the smallest
expected value from all possible sequences. We tested sub-tree by using the first three layers
possible sequences. This would show how well the approach performs with short sequences. For
this experiment we used a total of six questions and two hints. We generated all possible
sequences with first four items and computed the objective value for all sequences. We used
several different 𝑎 and 𝑏 values for weighting the chance of either giving a hint or giving a
question. The proposed algorithm was then used to select three “optimal” items. The selection
was determined by objective function. We performed 1,000 Monte Carlo runs counted the
number of times a given sequence was selected (as being optimal). We then compared using the
true ability vs. the estimated ability in the objective function.
50
Figure 3.10 Example Tree Sequences
Q3
Q4
Q2
H1
H2
Q2
Q4
Q3
H1
H2
Q1
Q2
Q3
S
Q4
H1
H2
Q2
Q3
H1
Q4
H2
Q2
Q3
H2
Q4
H1
From the results, the computed sequences from using true ability were almost the same as the
sequences from using estimated ability. With 1,000 runs, the proposed objective function the
results show that this approach able to maintain the integrity of selection. Note that the expected
values from using the true ability value vs. using the estimated ability value in the objective
function (with the same sequences) were close to each other. Each experiment in Table 3.13
51
consisted of 1,000 Monte Carlo runs using 6 items (4 questions and 2 hints). Questions 1 through
4 are labeled “1” through “4” while hint 1 is labeled as “5” and hint 2 is labeled as “6”.
Table 3.13 Results of finding expected value with Tree sequence
Weight
Item1
True Theta
a=0.3, b=0.7
Estimated Theta
True Theta
a=0.7, b=0.3
Estimated Theta
True Theta
a=0.1, b=0.9
Estimated Theta
True Theta
a=0.9, b=0.1
Estimated Theta
True Theta
a=0.99, b=0.01
Estimated Theta
True Theta
a=0.01, b=0.99
Estimated Theta
Item2
Item3
Count
Expected Value
4
3
5
501
0.6410
4
3
1
499
22.3840
4
3
2
248
0.0970
4
3
5
243
0.9957
4
3
1
509
40.8020
4
3
5
776
0.8770
4
3
1
224
0.3930
4
3
2
220
0.5490
4
3
5
270
0.8840
4
3
1
510
22.5510
4
3
2
222
0.6230
4
3
5
268
0.6960
4
3
1
510
26.9330
4
3
2
249
0.7500
4
3
5
247
1.0510
4
3
1
504
50.0710
4
3
5
770
0.7920
4
3
1
230
0.2970
4
3
2
228
0.5410
4
3
5
282
0.8280
4
3
1
490
14.0010
4
3
5
238
0.0970
4
3
1
242
0.2580
4
3
2
256
0.4820
4
3
5
498
18.4360
4
3
1
246
0.5660
4
3
2
475
0.6330
4
3
1
525
29.7540
4
3
2
509
0.7340
4
3
1
491
48.2140
52
Additional simulation experiments were performed with using the proposed approach of
selecting either a question or a hint by minimizing the objective function. If the system selects a
question, the item selection method described earlier is performed. If the system selects a hint,
the hint model described earlier is performed. In this experiment we used Maximum Information
for the item selection method and used hint model 1 to determine if the hint will be provided or
not. We ran the same procedure for 1,000 Monte Carlo simulations to find the expected value.
We compared using the true ability value and using the estimated ability to compute the expected
value. We also used different weight values and compared the results. For the experiment 100
items were used including hints and questions to compute the expected value and estimated the
ability with different weight values. The 100 items are chosen from a collection of 90 questions
and 10 hints. This is described in more detail in Figure 3.11.
Figure 3.11 Pseudocode for Computation of Expected Value by using Objective Function
Input: 𝜃� 𝑎𝑛𝑑 𝛼𝑖 , 𝛽𝑖 , a, b for each question
Output: expected value
compute 𝜃�𝑛𝑒𝑥𝑡 , 𝑆𝐸𝑛𝑒𝑥𝑡 for next question by using Maximum Information
𝑧𝑖 = 𝛼𝑖 ∗ (𝜃� − 𝛽𝑖 )
𝑝𝑟𝑜𝑏𝑖 = 1/(1 + exp(−1 ∗ 𝑧𝑖 ))
𝑆𝐸 = 𝑠𝑞𝑟𝑡(1/ ∑𝑛𝑖=1 𝛼𝑖 ∗ 𝛼𝑖 ∗ 𝑝𝑟𝑜𝑏𝑖 ∗ (1 − 𝑝𝑟𝑜𝑏𝑖 ))
𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑣𝑎𝑙𝑢𝑒 𝑓𝑜𝑟 𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛 = 𝑎 ∗ [max��𝜃𝑔𝑜𝑎𝑙 − 𝜃�𝑛𝑒𝑥𝑡 �, ∅�]2 + 𝑏 ∗ (𝑆𝐸𝑛𝑒𝑥𝑡 )2
𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑣𝑎𝑙𝑢𝑒 𝑓𝑜𝑟 ℎ𝑖𝑛𝑡 = 𝑎 ∗ [max��𝜃𝑔𝑜𝑎𝑙 − 𝜃��, ∅�]2 + 𝑏 ∗ (𝑆𝐸)2
If 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑣𝑎𝑙𝑢𝑒 𝑓𝑜𝑟 𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛 < 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑣𝑎𝑙𝑢𝑒 𝑓𝑜𝑟 ℎ𝑖𝑛𝑡
then
else
𝑠𝑒𝑙𝑒𝑐𝑡 𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛
𝑠𝑒𝑙𝑒𝑐𝑡 ℎ𝑖𝑛𝑡
53
From the results, the system used all 10 hints to increase examinee ability from 0 to 0.7485. The
expected values from using the true ability as compared to the estimated ability were very close
with delta values below 0.045. The estimation process were performed very well. The resulting
delta values between the true abilities and the estimated abilities were very close. We increased
the number of items to 250 items consisting of 230 questions and 20 hints. The results were good
but not significantly different from the results from using 100 items.
Table 3.14 Results of finding average and SD of expected value from 100 items
a=0.3,b=0.7
a=0.7,b=0.3
a=0.1,b=0.9
a=0.9,b=0.1
a=0.99,b=0.01
a=0.01,b=0.99
True Avg
0.0391
0.0436
0.0290
0.0424
0.0442
0.0255
Estimate Avg
0.0465
0.0721
0.0338
0.0871
0.0897
0.0267
-0.0074
-0.0285
-0.0048
-0.0447
-0.0455
-0.0012
True SD
0.0368
0.0454
0.0112
0.0229
0.0231
0.0016
Estimate SD
0.0289
0.0613
0.0109
0.0825
0.0832
0.0017
delta Avg
Table 3.15 Results of finding average and SD of estimates theta from 100 items
a=0.3,b=0.7
a=0.7,b=0.3
a=0.1,b=0.9
a=0.9,b=0.1
a=0.99,b=0.01
a=0.01,b=0.99
True Avg
0.7485
0.7485
0.7485
0.7485
0.7485
0.7485
Estimate Avg
0.7217
0.7290
0.6810
0.7270
0.7308
0.5209
delta Avg
0.0268
0.0195
0.0675
0.0215
0.0177
0.2276
True SD
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
Estimate SD
0.1583
0.1483
0.1534
0.1513
0.1507
0.1456
54
All data are from the average of 100 times; 230 questions and 20 hints.
Table 3.16 Results of find average and SD of expected value from 250 items
a=0.3,b=0.7
a=0.7,b=0.3
a=0.1,b=0.9
a=0.9,b=0.1
a=0.99,b=0.01
a=0.01,b=0.99
True Avg
0.1341
0.3193
0.0516
0.3971
0.3346
0.0226
Estimate Avg
0.1012
0.1962
0.0424
0.2346
0.2602
0.0194
delta Avg
0.0329
0.1231
0.0092
0.1625
0.0744
0.0032
True SD
0.1514
0.4322
0.0467
0.5109
0.3737
0.0080
Estimate SD
0.0448
0.0817
0.0132
0.1280
0.1257
0.0081
Table 3.17 Results of find average and SD of estimates theta from 250 items
a=0.3,b=0.7
a=0.7,b=0.3
a=0.1,b=0.9
a=0.9,b=0.1
a=0.99,b=0.01
a=0.01,b=0.99
True Avg
1.5583
1.5583
1.5583
1.5583
1.5583
1.5583
Estimate Avg
1.4653
1.4883
1.4783
1.5053
1.5023
1.3723
delta Avg
0.0930
0.0700
0.0800
0.0530
0.0560
0.1860
True SD
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
Estimate SD
0.1373
0.1159
0.1255
0.1359
0.1274
0.1356
All data are from the average of 100 times; 230 questions and 20 hints (after giving a hint then
giving a question) and 1000 items pool.
Table 3.18 Results of finding average and SD of expected value from 250 items
a=0.3,b=0.7
a=0.7,b=0.3
a=0.1,b=0.9
a=0.9,b=0.1
a=0.99,b=0.01
a=0.01,b=0.99
True Avg
0.2019
0.3932
0.0743
0.4727
0.5744
0.0242
Estimate Avg
0.1463
0.3089
0.0611
0.3913
0.4241
0.0216
delta Avg
0.0556
0.0843
0.0132
0.0814
0.1503
0.0026
True SD
0.1670
0.3814
0.0564
0.4673
0.5702
0.0077
Estimate SD
0.0526
0.1180
0.0167
0.1616
0.1713
0.0022
55
Table 3.19 Results of finding average and SD of estimates theta from 250 items
a=0.3,b=0.7
a=0.7,b=0.3
a=0.1,b=0.9
a=0.9,b=0.1
a=0.99,b=0.01
a=0.01,b=0.99
True Avg
1.5583
1.5583
1.5583
1.5583
1.5583
1.5583
Estimate Avg
1.4233
1.4013
1.3843
1.4113
1.4123
1.3393
delta Avg
0.1350
0.1570
0.1740
0.1470
0.1460
0.2190
True SD
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
Estimate SD
0.0999
0.0998
0.1060
0.1068
0.1039
0.0992
56
4 Conclusions and Future Work
A combined Item Response Theory (IRT) and Computerized Adaptive Testing (CAT)
framework is developed and studied in this thesis to estimate examinee ability and to provide for
the appropriate selection of hints to an examinee in a test/learning process. The proposed
approach has a primary objective of improving an examinee’s ability to reach a set point for
learning ability. In order to do this through providing hints to the examinee it is critical to have a
good estimate of the examinee’s learning ability. The results presented in this research came
from experiments based on Monte Carlo simulation data.
IRT models were used in the estimation processes with both two-parameter IRT model
and three-parameter models were studied. The two-parameter models allowed for better
parameter estimation accuracy than the three-parameter IRT model using maximum likelihood
estimation. However in the estimation of examinee ability using an iterative Newton-Raphson
approach, taken in the literature and in practice, the results were often not reliable. With this
iterative optimization approach, routinely estimated parameter and learning ability values fell out
of the scale of defined ranges. This problem could be solved by constraining the parameter value
ranges. However, this increases the complexity of the iterative optimization method while also
imposing artificial changes to the problem that are not reflected in the problem. Experimental
observations show that when the number of examinees is increased for estimating item parameter
values, the results from the estimation process were improved, as expected. In the joint
maximum likelihood estimation, the results were not nearly as good as in the independent item
parameter estimation problem or the independent learning estimation problem. This is likely
related to the resulting indeterminacy in the model. Other methods can be used to estimate
examinee ability and item parameters simultaneously. Bayesian estimation is an interesting
57
approach that has no constraints that need to be imposed on the parameter space, unlike the
maximum likelihood procedure, since outward drifts of the estimates are naturally and
effectively controlled by the priors [2]. However, the assessment of the posterior variance of the
estimator in Bayesian still needs to be done.
In the item selection method, both Maximum Information (MI) and Kullback Leibler
(KL) measures provided similar results. The Kullback Leibler information measure provides
global information; therefore using the Kullback Leibler method during the early estimation
process (e.g., 5 to 10 questions) might be beneficial. Then switching to the procedure based on
Maximum Information for the remainder of the questions. This might give better results for
estimating examinee ability. However, the transition process from Kullback Leibler to Maximum
Information needs to be considered.
In the learning process, all three proposed hint models provided reasonable results.
However, these results might be significantly different than that found from using actual
examinees in practice. These models considered only the examinee ability and the hint difficulty.
For actual examinees, the level of understanding of a hint is expected to be more complicated.
Many factors such as feedback from examinees, background knowledge, and expert suggestion
in the model might have to be included. For the estimation of hint value, the two-parameter
model was used to measure the change in the examinee ability based on the hint that was
provided to the examinee. From experimental results, this model is found to be very robust. The
final estimated hint values were very close to the true values. In this thesis an objective function
was proposed that was used to control the estimation process and the learning process and the
value of this objective function was tested by using a tree structure defined in terms of decision
sequences with the expected value compared from each sequence. A selection algorithm was
58
used to choose either a question or a hint by considering which one gave the least expected
value. Then we compared the expected value between using the true ability and estimated ability.
The results showed that the use of the objective function was able to maintain the integrity of
selection. The results from using estimated abilities were close to the results from using true
ability. In the learning process, we incorporated adaptive learning in order to analyze the
effectiveness of our model to provide for improvement of examinee ability. There are many
methods and techniques that can be used in our model. Some have been proven to be feasible and
useful, but also have demonstrated some pitfalls and problems that need to be resolved.
Particularly when both adaptive learning and adaptive testing are considered together, many
issues remain to be solved. This work represents only an initial investigation into this area and
while there are a small number of other studies in this area that have been conducted, much more
research still need to be done. We are also interested in the improvement of the proposed
algorithm for the learning process. However, this should be done after the algorithm has been
tested using real examinees. At that point it would be appropriate to further improve and
integrate the algorithm and our model.
Another point that should be considered involves the notion of dual control. The model
formulation used in our approach is similar in principle to the dual control problem. The
approach taken in this thesis did not search for an analytical solution to the underlying joint
estimation/control learning problem. A more direct search for a mathematical solution of the
underlying dual problem might be beneficial. The presented results should be considered mainly
as an initial step. Future research and investigation is needed of how adaptive and dual control
problem solutions might be applied to the learning problem is needed. This approach may offer
59
more powerful results for our model to improve the examinee’s ability together with providing
good estimation precision results.
60
Bibliography
[1]
Frank B. Baker and Seock-Ho Kim, Item Response Theory Parameter Estimation
Techniques Second Edition.
[2]
Ronald K. Hambleton and Hariharan Swanminathan, Item Response Theory Principles
and Applications.
[3]
John Michael Linacre, Computer-Adaptive Testing:A Methodology Whose Time Has
Come.
[4]
Hua-Hua Chang and Zhiliang Ying, A Global Information Approach to Computerized
Adaptive Testing.
[5]
Babcock, B. & Weiss, D.J. (2009), Termination criteria in computerized adaptive tests:
Variable-length CATs are not biased.
[6]
Young-Jin Lee, David J. Palazzo, Rasil Warnakulasooriya, and David E. Pritchard,
Measuring student learning with item response theory.
[7]
Guzman, E., Conejo, R. and Perez-de-la-Cruz, J.L., Improving Student Performance
using Self-Assessment Tests, IEEE Intelligent Systems, No. 22, 2007, pp. 46-52.
[8]
Tai, D. W-S., Tsai, T-A and Chen F. M-C, Performance Study on Learning Chinese
Keyboarding Skills Using the Adaptive Learning System, Global Journal of Engineering
Education, Vol. 5, No. 2, 2001, pp. 153-161.
[9]
Chih-Ming Chen, Hahn-Ming Lee, and Ya-Hui Chen, Personalized e-learning system
using Item Response Theory.
61
[10]
Own, Z., The Application of and Adaptive, Web-based Learning Environment on
Oxidation-reduction Reactions, International Journal of Science and Mathematics
Education, Vol. 4, No. 1, 2006, pp. 73-96.
[11]
Jong, B-S., Chan, T-Y., Wu, T-L. and Lin, T-W., Applying the Adaptive Learning
Material Producing Strategy to Group Learning, Lecture Notes in Computer Science
(LNCS), Vol. 3942, 2006 pp. 39-49.
[12]
Brusilovsky, P. and Peylo, C., Adaptive and Intelligent Web-based Educational Systems,
International Journal of Artificial Intelligence in Education, Vol. 13, 2003, pp. 156-169.
62

Download Report

AN ITEM RESPONSE THEORY FRAMEWORK FOR

Paperzz.com

Your Paperzz