Writing posters efficiently: how to compare learning rates Christine Lefebvre and Denis Cousineau Université de Montréal Alice and Bob both learned to write posters. At Advantages and limitations for each method Figure 1 : Alice and Bob’s learning curves first, preparing a poster took around 40 hours for both of them. At the end of training, both Alice and Bob were 40 Method A: 36 performing at an average preparation time of 12 hours. 32 N hours Does that mean both learned in the same way? If two different methods were used to train them, does that mean [email protected] Estimation and comparison 28 24 • Most obvious •Weak because parameter estimation is slippery • Powerful with restricted amount of data •Limited power → Not recommended for less than 200 trials or 40 blocks of trials • Simple to apply • No parameters estimation necessary → Not subject to biases • Needs estimation of parameters a and b Bob 20 both were equally efficient? Not necessarily. In fact, if we 16 look at their learning curves, we can see that Alice learned 8 Method B: Alice 12 0 1 2 3 4 5 at a much faster rate than Bob. If the aim is to find the most efficient training method, then the learning rate is 6 7 8 9 10 Assessment of proportionality 11 N posters Method C: the critical value. Normalizing What is a learning curve? Eq 1 : raw data functions Method A: Estimation and comparison The learning curve describes the change in Method C : Normalizing curves Method B: Assessment of proportionality Learning curve : f (t) = a + b g (t) performance throughout training. Data accumulated during training, for example response times, are plotted as A → Power curve : g PC (t) = t-c a function of trials (or blocks of trials). The simplest form of learning curve would be a plot of the raw data as a B → Exponential curve : g EX(t) = function of trials (see left side of fig 2). However, this e-ct In order to test the predictions of the models cited above, one has to These biases prompted the quest for alternative methods of comparing compare curvatures of learning curves. The most obvious method is to curvatures. The second method proposed here is easy to apply. As is demonstrated estimate the parameters from the data, and then compare the c parameters in box 3, if and only if two curves have the same curvature, they will be for each curve. The method is described in details in box 2. proportional. Plotting one against the other will result in a linear function. Box 4 curve is usually very noisy. It can be better to plot Eq 2 : Averaged data functions example, means and standard deviations are often used Box 3: Averaged leaning curve : f (n) = a + b g (n) (see right side of figure 2). Data points can be fitted with an ideal curve, a learning curve. All learning curves have A → Power curve : g PC (n) = two scaling parameters in common. The first one is the nF − ( c −1) B → Exponential curve : g EX ( n) = identity of the curve is given by a « core function » often Box 2 : Estimation and comparison of c parameter − ( c −1) − nL (c − 1) N asymptote (a) and the second one, amplitude (b). The e −e cN − c nL Assessment of proportionality •Statistical tests on obtained estimated c parameters → example : coefficient of correlation of the observed data as a function of predicted data by estimated curve (Logan, 1988) 1). Many models of automatization make predictions about the three parameters defining the learning curve. For example, Logan asserts that curvature (parameter c) should be equal for means and standard deviations. Cousineau and Box 5 :Normalization of the curves If we create a scatter plot of f1 as a function of f2, we note that f2(t) = m f1(t) + h. Therefore, ∆f f (t') − f2(t") m= 2 = 2 ∆f1 f1(t') − f1(t") •Minimization of Root mean square deviation or χ2 − c nF described by one parameter, the learning rate (c) (See eq. Larochelle’s model makes the same prediction, while Rickart’s extension of Logan’s instance-based model predicts = b2g2 (t') + a2 −b2g2 (t") −a2 b1g1(t') +a1 −b1g1(t") −a1 = b2(g2 (t') − g2 (t")) b1(g1(t') − g1(t")) The solution to this equation is a constant if : or predict no change in curvature, but only in amplitude or asymptote throughout different tasks. identified by Heathcote and Mewhort (1995) who called it the « low → Impossible because gs do not have, by definition, an asymptote 3) g1 and g2 are in fact the same curve asymptote bias ». They noticed, in Logan’s (1988) data, that the estimated asymptotes seemed unrealistically low, even though the fit of the parameters Raw data Averaged data on data averaged by blocks of trials. 2500 2000 1500 1000 500 50 100 t of the two parameters will be compensated by a change in the other without 1500 500 invalidate the following comparison. 0 150 3 6 9 t 12 15 18 21 The second bias is called the « extended-range bias ». It occurs when data of those two curves have by definition different ranges, SDs being smaller than means. As can be seen on the right side of figure 4, this With raw data as a function of trial t fitted with a power curve, the averaged → fit eq. 1 data will no longer be. We show in Parameters used to generate the curves a b c f1 250 400 0,7 extended range improves r2 when both curves are plotted together rather f2 250 400 1,2 than when plotted separately. So the correlation between the two curves is f3 150 200 0,7 Fitting 850 350 300 achieved by a minimization algorithm. f ( t ) = 350 t 0.455 +0 use of averaged or raw data (see Box 1). Therefore, it doesn’t make a 200 f(t) f(t) 250 150 100 50 0 difference whether raw or averaged data are curvatures. used when comparing 1 41 81 121 161 201 trial t 241 281 321 361 120 110 100 90 80 70 60 50 40 30 20 P o w e r c urve (e q 1 A ) f ( t ) = 94.2 t -0 .6 3 3 + 0.0 2 r = 0.973 A vera g e d po w e r c urve (e q 2 A ) SD 75 0 650 MN 65 0 550 450 350 # 250 150 2 3 4 5 6 b lo ck n 7 8 9 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 M ome nt t f3 (ms) 2 3 4 5 6 t 7 8 9 10 11 12 Note that parameters are not estimated (no low asymptote bias) f 3 plo tte d a g ainst f 2 r2 = 1.000 D = 1.634 Conclusion : curvatures are the same Discussion 300 Conclusion : curvatures differ 250 350 450 550 f1 (ms) 650 750 can be used with both raw and averaged data. However, we recommend the use of 150 250 350 450 550 650 750 f2 (ms ) R = 0,97 62 2 R = 0,95 31 45 0 35 0 There is a limit to method B. Systematic deviation from linearity is very SD vs ES T 25 0 MN s v E ST 15 0 2 R = 0,64 13 difficult to detect. Hence, the D test is not very powerful. Therefore, it is not MN &S D vs . ES T 50 15 0 25 0 35 0 45 0 55 0 Obs e rv e d SRT (ms) 65 0 75 0 85 0 Two new methods for comparing learning curves have been described. They both offer interesting alternatives to the obvious but weak parameters estimation method. They 2 r = .975 D = .947, p < .05 200 2 55 0 50 50 1 250 150 250 85 0 750 → Perform an ANOVA of the normalized data with type of curve and N as variables. 300 100 200 Av era g ed po w er c u rve Po w er curve (raw valu es) averaged learning functions is easily All methods proposed here allow the Figure 4 : Illustration of the extended-range bias Figure 3 : Raw and averaged data fitted with a power function those Expe cte d SRT (ms) curve. f2 f3 400 350 300 SR T (ms) exponential → If SE of both curves overlap for all points of the curve, then the two curves have the same curvature ( but this is not a statistical test) f1 500 200 350 core function is given. The example shown is for a power curve, but eq.2 shows how it can be done with an S im ula te d le a rning curv e s 600 f3 plotted against f1 → fit eq. 2 instead resulting core function in case raw data • If summary values (mean or SDs) are used, compute standard errors (SE) 700 1 significant despite being different as can be seen on left side of figure 4. With averaged data as a function of block N T im e (m s ) resulting in a linear regression in order to determine quality of fit. However, Box 1 : fitting averages averaging raw data results in a different figure 3 and equation 2 what is the .Perform regression between the two data sets. .Compute Durbin-Watson D statistics (available in most statistical packages) to test linearity of the function. Logan’s method is applied to means and standard deviations (SDs), Heathcote and Mewhort (1995), and curve. For example, if raw data are best Box 4 : Applying the proportionality method any decrement of fit. The curvature can therefore be poorly estimated and 1000 are less noisy than raw data. However, then Rickart (1997), pointed out that f (t ) − a = g (t ) , b 2000 0 0 • Proceed with the following transformation: f N (t ) = which isolates the core function parameters a and c. The consequence of this relation is that a change in one 2500 0 Averaged data are used because they → The only possible solution with the data was very good. They found a non-linear relation between 3000 response times (ms) 1988), learning curves are often plotted response times (ms) 3000 f 3 (ms ) Fitting averages → estimate the percent of fast errors, find the corresponding quantile and assume it to be the asymptote 1) there is no curvature in g1 and g2 → Impossible if g1 and g2 are learning curves 2) their difference is a constant This method is flawed with two major biases. The first of these was Figure 2 : Scatter plot of raw vs averaged data • Estimate parameters a and b (visual inspection of the curves is recommended to detect inaccuracies in the normalizing process) → estimate a and b like shown for method A or → assume the lowest data point to be the asymptote (vulnerable to outliers) differing curvatures for means and standard deviations. One can also argue that strength theories and neural networks Although learning is thought to how to apply method C. illustrates how to apply the method. summary values as a function of blocks of trials. For occur on a trial by trial basis (Logan, The normalization method is based on the assumption that if two curves have the same curvature, they should superimpose perfectly after normalization. Box 5 illustrates recommended to use this method with less than 200 trials or 40 blocks of trials. Method C is not subject to that limit and can be used with small amount of data. averaged data; they are less noisy and more accurate tests can be performed on them.
© Copyright 2026 Paperzz