Writing posters efficiently: how to compare learning rates

Writing posters efficiently: how to compare learning rates
Christine Lefebvre and Denis Cousineau
Université de Montréal
Alice and Bob both learned to write posters. At
Advantages and limitations for each method
Figure 1 : Alice and Bob’s learning curves
first, preparing a poster took around 40 hours for both of
them. At the end of training, both Alice and Bob were
40
Method A:
36
performing at an average preparation time of 12 hours.
32
N hours
Does that mean both learned in the same way? If two
different methods were used to train them, does that mean
[email protected]
Estimation and comparison
28
24
• Most obvious
•Weak because parameter
estimation is slippery
• Powerful with restricted
amount of data
•Limited power
→ Not recommended for less
than 200 trials or 40 blocks of trials
• Simple to apply
• No parameters estimation necessary
→ Not subject to biases
• Needs estimation of parameters a and b
Bob
20
both were equally efficient? Not necessarily. In fact, if we
16
look at their learning curves, we can see that Alice learned
8
Method B:
Alice
12
0
1
2
3
4
5
at a much faster rate than Bob. If the aim is to find the
most efficient training method, then the learning rate is
6
7
8
9
10
Assessment of proportionality
11
N posters
Method C:
the critical value.
Normalizing
What is a learning curve?
Eq 1 : raw data functions
Method A: Estimation and comparison
The learning curve describes the change in
Method C : Normalizing curves
Method B: Assessment of proportionality
Learning curve : f (t) = a + b g (t)
performance throughout training. Data accumulated
during training, for example response times, are plotted as
A → Power curve : g PC (t) = t-c
a function of trials (or blocks of trials). The simplest form
of learning curve would be a plot of the raw data as a
B → Exponential curve : g EX(t) =
function of trials (see left side of fig 2). However, this
e-ct
In order to test the predictions of the models cited above, one has to
These biases prompted the quest for alternative methods of comparing
compare curvatures of learning curves. The most obvious method is to
curvatures. The second method proposed here is easy to apply. As is demonstrated
estimate the parameters from the data, and then compare the c parameters
in box 3, if and only if two curves have the same curvature, they will be
for each curve. The method is described in details in box 2.
proportional. Plotting one against the other will result in a linear function. Box 4
curve is usually very noisy. It can be better to plot
Eq 2 : Averaged data functions
example, means and standard deviations are often used
Box 3:
Averaged leaning curve : f (n) = a + b g (n)
(see right side of figure 2). Data points can be fitted with
an ideal curve, a learning curve. All learning curves have
A → Power curve : g PC (n) =
two scaling parameters in common. The first one is the
nF −
( c −1)
B → Exponential curve : g EX ( n) =
identity of the curve is given by a « core function » often
Box 2 : Estimation and comparison of c parameter
− ( c −1)
− nL
(c − 1) N
asymptote (a) and the second one, amplitude (b). The
e
−e
cN
− c nL
Assessment of proportionality
•Statistical tests on obtained estimated c parameters
→ example : coefficient of correlation of the observed data
as a function of predicted data by estimated curve
(Logan, 1988)
1).
Many models of automatization make predictions about the three parameters defining the learning curve. For
example, Logan asserts that curvature (parameter c) should be equal for means and standard deviations. Cousineau and
Box 5 :Normalization of the curves
If we create a scatter plot of f1 as a function of f2, we note that f2(t) = m f1(t) + h.
Therefore,
∆f f (t') − f2(t")
m= 2 = 2
∆f1 f1(t') − f1(t")
•Minimization of Root mean square deviation or χ2
− c nF
described by one parameter, the learning rate (c) (See eq.
Larochelle’s model makes the same prediction, while Rickart’s extension of Logan’s instance-based model predicts
=
b2g2 (t') + a2 −b2g2 (t") −a2
b1g1(t') +a1 −b1g1(t") −a1
=
b2(g2 (t') − g2 (t"))
b1(g1(t') − g1(t"))
The solution to this equation is a constant if :
or
predict no change in curvature, but only in amplitude or asymptote throughout different tasks.
identified by Heathcote and Mewhort (1995) who called it the « low
→ Impossible because gs do not have, by definition, an asymptote
3) g1 and g2 are in fact the same curve
asymptote bias ». They noticed, in Logan’s (1988) data, that the estimated
asymptotes seemed unrealistically low, even though the fit of the parameters
Raw data
Averaged data
on data averaged by blocks of trials.
2500
2000
1500
1000
500
50
100
t
of the two parameters will be compensated by a change in the other without
1500
500
invalidate the following comparison.
0
150
3
6
9
t
12
15
18
21
The second bias is called the « extended-range bias ». It occurs when
data of those two curves have by definition different ranges, SDs being
smaller than means. As can be seen on the right side of figure 4, this
With raw data as a function of trial t
fitted with a power curve, the averaged
→ fit eq. 1
data will no longer be. We show in
Parameters
used to generate the curves
a
b
c
f1
250
400 0,7
extended range improves r2 when both curves are plotted together rather
f2
250
400
1,2
than when plotted separately. So the correlation between the two curves is
f3
150
200
0,7
Fitting
850
350
300
achieved by a minimization algorithm.
f ( t ) = 350 t
0.455
+0
use of averaged or raw data (see Box
1). Therefore, it doesn’t make a
200
f(t)
f(t)
250
150
100
50
0
difference whether raw or averaged
data
are
curvatures.
used
when
comparing
1
41
81
121
161
201
trial t
241
281
321
361
120
110
100
90
80
70
60
50
40
30
20
P o w e r c urve (e q 1 A )
f ( t ) = 94.2 t
-0 .6 3 3
+ 0.0
2
r = 0.973
A vera g e d po w e r c urve (e q 2 A )
SD
75 0
650
MN
65 0
550
450
350
#
250
150
2
3
4
5
6
b lo ck n
7
8
9
10
1
2
3
4
5
6
7 8 9 10 11 12 13 14 15
M ome nt t
f3 (ms)
2
3
4
5
6
t
7
8
9
10
11
12
Note that parameters are not estimated
(no low asymptote bias)
f 3 plo tte d a g ainst f 2
r2 = 1.000
D = 1.634
Conclusion :
curvatures
are the same
Discussion
300
Conclusion :
curvatures differ
250
350
450
550
f1 (ms)
650
750
can be used with both raw and averaged data. However, we recommend the use of
150
250
350
450
550
650
750
f2 (ms )
R = 0,97 62
2
R = 0,95 31
45 0
35 0
There is a limit to method B. Systematic deviation from linearity is very
SD vs ES T
25 0
MN s v E ST
15 0
2
R = 0,64 13
difficult to detect. Hence, the D test is not very powerful. Therefore, it is not
MN &S D vs . ES T
50
15 0
25 0
35 0 45 0 55 0
Obs e rv e d SRT (ms)
65 0
75 0
85 0
Two new methods for comparing learning curves have been described. They both
offer interesting alternatives to the obvious but weak parameters estimation method. They
2
r = .975
D = .947, p < .05
200
2
55 0
50
50
1
250
150
250
85 0
750
→ Perform an ANOVA of the normalized data with type of curve and
N as variables.
300
100
200
Av era g ed po w er c u rve
Po w er curve (raw valu es)
averaged learning functions is easily
All methods proposed here allow the
Figure 4 : Illustration of the extended-range bias
Figure 3 : Raw and averaged data fitted with a power function
those
Expe cte d SRT (ms)
curve.
f2
f3
400
350
300
SR T (ms)
exponential
→ If SE of both curves overlap for all points of the curve, then the two
curves have the same curvature ( but this is not a statistical test)
f1
500
200
350
core function is given. The example
shown is for a power curve, but eq.2
shows how it can be done with an
S im ula te d le a rning curv e s
600
f3 plotted against f1
→ fit eq. 2 instead
resulting core function in case raw data
• If summary values (mean or SDs) are used, compute standard errors
(SE)
700
1
significant despite being different as can be seen on left side of figure 4.
With averaged data as a function of block N
T im e (m s )
resulting in a linear regression in order to determine quality of fit. However,
Box 1 : fitting averages
averaging raw data results in a different
figure 3 and equation 2 what is the
.Perform regression between the two data sets.
.Compute Durbin-Watson D statistics
(available in most statistical packages)
to test linearity of the function.
Logan’s method is applied to means and standard deviations (SDs),
Heathcote and Mewhort (1995), and
curve. For example, if raw data are best
Box 4 : Applying the proportionality method
any decrement of fit. The curvature can therefore be poorly estimated and
1000
are less noisy than raw data. However,
then Rickart (1997), pointed out that
f (t ) − a
= g (t ) ,
b
2000
0
0
• Proceed with the following transformation: f N (t ) =
which isolates the core function
parameters a and c. The consequence of this relation is that a change in one
2500
0
Averaged data are used because they
→ The only possible solution
with the data was very good. They found a non-linear relation between
3000
response times (ms)
1988), learning curves are often plotted
response times (ms)
3000
f 3 (ms )
Fitting averages
→ estimate the percent of fast errors, find the corresponding quantile
and assume it to be the asymptote
1) there is no curvature in g1 and g2
→ Impossible if g1 and g2 are learning curves
2) their difference is a constant
This method is flawed with two major biases. The first of these was
Figure 2 : Scatter plot of raw vs averaged data
• Estimate parameters a and b (visual inspection of the curves is
recommended to detect inaccuracies in the normalizing process)
→ estimate a and b like shown for method A
or
→ assume the lowest data point to be the asymptote (vulnerable to
outliers)
differing curvatures for means and standard deviations. One can also argue that strength theories and neural networks
Although learning is thought to
how to apply method C.
illustrates how to apply the method.
summary values as a function of blocks of trials. For
occur on a trial by trial basis (Logan,
The normalization method is based on the assumption that if two curves have the
same curvature, they should superimpose perfectly after normalization. Box 5 illustrates
recommended to use this method with less than 200 trials or 40 blocks of trials.
Method C is not subject to that limit and can be used with small amount of data.
averaged data; they are less noisy and more accurate tests can be performed on them.