Does Size Matter? A Study on the Use of Netbooks in K

Does Size Matter? A Study on the Use of
Netbooks in K-12 Assessments
Annual Meeting of the American Educational Research
Association
New Orleans, LA
Leslie King
Xiaojing Jadie Kong
Bryan Bleil
April 2011
NETBOOK ASSESSMENTS
2
Abstract
One of the newest trends in school-based computing is the advent of netbooks, or “mini-laptops.”
Because of their price and mobility, they have generated significant interest from school districts
and campuses. The presence of netbooks and other mobile computing devices in schools and
their use for online testing will likely expand. Investigation is, therefore, warranted on the
suitability of such devices for online testing, especially with respect to screen sizes that are
substantially smaller than traditional displays. A study was conducted during the spring 2010
administration of the Texas End-of-Course (EOC) assessments to evaluate the feasibility of using
netbooks in the context of K-12 assessments. Samples of students from across four campuses in
two school districts were randomly assigned to one of two netbook screen size conditions (10.1inch or 11.6-inch) in one of three subject areas: world geography, geometry, and English I. Each
netbook condition was compared with a matched “large screen” control condition drawn from
the statewide online testing population. The study found no statistically significant differences at
the test level between the netbook conditions and their matched large screen conditions for any
of the subject areas and little evidence of differential item functioning (DIF) due to screen size at
the item level. The study findings provide an initial evaluation of the impact of the smaller
screen sizes on test performance. Implications of these findings as well as study limitations and
direction for future research are discussed.
Correspondence concerning this article should be addressed to Leslie Keng, Pearson,
400 Center Ridge Drive, Austin, TX 78753, E-mail: [email protected].
NETBOOK ASSESSMENTS
3
Does Size Matter? A Study on the Use of Netbooks in K-12 Assessments
Background
The use of computers in assessment has grown tremendously in recent years. The
motivations for moving to computer-based testing (CBT) include greater flexibility in
administration, reduction in the use of paper documents in testing and its associated
administrative burdens, and the possibility of faster score reporting. In general, the movement
toward CBT in K-12 assessment programs is picking up momentum as school districts and
campuses increase their technology capabilities and students become more comfortable using the
computer for a variety of educational tasks. In Texas, for example, over two million tests have
been delivered on computers since 2006. This growth in CBT is anticipated to continue over the
next several years.
One of the newest popular trends in educational mobile computing is the advent of
netbooks, or “mini-laptops.” These are smaller, and generally much cheaper, versions of laptops.
Because of their price and mobility, they have generated significant interest from campuses and
districts, many of whom have already begun purchasing netbooks, especially in conjunction with
one-to-one computing initiatives. Testing personnel are also considering the use of netbooks in
their assessments. As the presence of netbooks in schools continues to expand, the question
about their suitability for CBT, especially with respect to screen sizes that are roughly half that
of traditional displays, warrants further research.
Current professional testing standards indicate that whenever paper and computer-based
assessments of the same content are both administered, the comparability across paper and
computer-based administrations needs to be studied (APA, 1986; AERA, APA, NCME, 1999,
Standard 4.10). Much CBT research to date, therefore, has been devoted to evaluating the
NETBOOK ASSESSMENTS
4
comparability of test scores and item properties across paper and computer modes (e.g. Bennett,
Braswell, Oranje, Sandene, Kaplan & Yan, 2008; Horkay, Bennett, Allen, Kaplan, & Yan, 2006;
Way, Davis, & Fitzpatrick, 2006; Poggio, Glasnapp, Yang, & Poggio, 2005; Yu, Livingston,
Larkin, & Bonett, 2004).
However, because of its relatively recent introduction to the market and to educational
settings, no published research to date has been conducted on the use of netbooks in assessment.
Specifically, very little research has been done in the area of screen size comparability.
Bridgeman, Lennon and Jackenthal (2001, 2003) evaluated the impact of variation in screen size
and screen resolution on SAT® test performance. They found no significant effects on math
scores, but scores that were higher by approximately a quarter of a standard deviation on verbal
section for students who took it on the higher-resolution, larger screens.
The screen display conditions investigated by Bridgeman et al. included 15-inch monitors
with a 640480 resolution, 17-inch monitors with a 640480 resolution, and 17-inch monitors
with a 1024768 resolution. The newest netbooks, however, have screen sizes ranging from 10
to 12 inches; while being able to support a 1024768 resolution. Table 1 compares the
approximate screen areas for three typical display sizes found on computing devices.
Table 1: Approximate Screen Areas for Various Computing Devices
Approx. Screen Area
Typical Display Size For…
Display Size*
19-inch display
180 inch2
15-inch display
2
110 inch
Desktop Computers
Laptop Computers
2
*
10-inch display
43 inch
Netbooks
The “display size” typically represents the length of the diagonal for the screen.
Note that the screen area of a 10-inch netbook is approximately a quarter of that on a
desktop computer with a 19-inch monitor and less than half of the area for a laptop with a 15-
NETBOOK ASSESSMENTS
5
inch screen. With the movement in computing technology towards mobile devices, such as
tablets and handheld devices, with smaller screen size, the gulf between the largest and smallest
display sizes will continue to increase. This may have a profound effect on the “screen
experiences” for users of the various devices. Figures 1 and 2 provide a visual illustration of the
potential impact display sizes could have on computer-based testers, assuming equivalent screen
resolutions. The overall area of the test item in Figure 2 is approximately half the area of the
same item in Figure 1. As seen in the Table 1, this is the approximate ratio of the screen area for
a 10-inch netbook compared to that of a 15-inch laptop.
Figure 1: Example Science Test Item (Original Size)
NETBOOK ASSESSMENTS
6
Figure 2: Example Science Test Item (Half of Original Size)
The primary concern for computer-based testing in the context of K-12 assessments is
that if there are, in fact, comparability effects related to substantial reduction in display size such as effects due to reading fatigue on small screens or due to extremely detailed images such
as maps, pictures, and graphs - this impact should be identified early, in order to better construct
appropriate mitigation strategies or inform assessment policies around CBT.
Study Purpose
The purpose of the current research is to evaluate whether the significant reduction in
screen size for netbooks has an impact on test performance. Specifically, the study is designed
to:
1. Determine whether student test performance differs significantly when administered on
10-inch or 12-inch sized displays, versus the larger (14- to 21-inch) display sizes more
commonly found on desktops and laptop computers used by most computer-based testers.
2. Identify high-level directional characteristics of the tested content for which display size
may have specific comparability impacts (such as by subject area, item type, or by
extremity of size reduction).
NETBOOK ASSESSMENTS
7
Method
Participants
The study was conducted on a sample of middle- and high-school students during the
spring 2010 administration of the Texas End-of-Course (EOC) assessments. School districts in
the greater Austin area were asked if they would volunteer any schools and students to test on
netbooks. Each campus volunteered by their school district was supplied with netbooks as well
as support and setup logistics during the test administration window. A total of 1,547 students
from across four campuses in two school districts participated in the study over a four-week
testing window.
Conditions
Students in the sample were randomly assigned to one of two netbook conditions: 10.1inch netbooks with high definition widescreen display (1366 x 768 screen resolution) or 11.6inch high definition WLED display (1366 x 768 screen resolution) for one of three subject areas:
geometry, world geography and English I. These subject areas were chosen because it was
hypothesized that certain characteristics of the tested content may lead to performance
differences on reduced display sizes. The geometry tests contains items with tables and graphs;
world geography items often include charts and maps; while the English I test incorporates
longer reading passages. Each of these test features, particularly the fine details displayed on
maps and the reading load of the passages, may be more difficult for students to engage with on
the smaller netbook screens.
It should be noted that, of the three EOC assessments included in the study, geometry and
world geography were given as operational test during the spring 2010 administration; whereas
NETBOOK ASSESSMENTS
8
the English I assessment was given as a first-time stand-alone field test. This latter fact has
implications to the analysis results and is one of the limitations of the study design.
Sample Sizes
Table 2 summarizes the sample sizes analyzed for each EOC assessment in the study.
Note that the English I EOC assessment was administered over two consecutive days. The
English I writing component was administered on the first day; while the reading component was
administered on the second day. Not all students who took writing on the first day came back to
take reading on the second day. As such, the writing and reading portions of the English I
samples were analyzed separately. Also, because of the voluntary nature of the participants, the
samples in the study should be considered convenient instead of random samples.
Table 2: Participation Summary for the Netbook Study
Sample Sizes
EOC Assessment
Geometry
10.1-inch
238
11.6-inch
202
Total
440
World Geography
236
281
517
English I writing
109
98
207
English I reading
94
89
183
A “large screen” control condition was also needed for each of the EOC assessments in
the study and was drawn from the statewide population of students that took the test in the online
(i.e., computer-based) mode. It is assumed, because of the testing policies in Texas, that such
students would have taken their test on 14- to 21-inch screens with a minimal screen resolution
of 1024 x 768.
Table 3 summarizes the number of students who took the EOC assessments in the online
mode during the spring 2010 administration.
NETBOOK ASSESSMENTS
9
Table 3: Summary of Statewide Online Students (Spring 2010 EOC Assessments)
EOC Assessment
Number of Statewide Online Students
Geometry
81,913
World Geography
62,270
English I writing
29,888
English I reading
31,162
Because of the substantial difference in sample sizes between the control condition and
netbook conditions and the fact that the netbook conditions were samples of convenience, it was
necessary to create matched samples out of larger control condition prior to conducting any
comparisons. The process for forming the matched samples is described next.
Matched Samples Creation
To use of matched samples to make appropriate comparison between conditions is a
procedure commonly found in quasi-experimental studies (see, for example, Way, Davis, &
Fitzpatrick, 2006). For each EOC assessment in this study, one “large-screen” matched sample
was formed by matching to the characteristics of students in the 10.1-inch netbook condition;
another “large-screen” matched sample was created by matching to the characteristics of students
in the 11.6-inch netbook condition.
Each matched sample was created using the propensity score matching method (Dehijia
& Wahba, 1998; Rosenbaum & Rubin, 1983; Rubin, 1997.) The propensity score matching
method works best when the number of students for selecting the matched sample is substantially
larger than the sample of students under study because the large sample offers many students for
finding a close match to each student in the sample under study. Such was the scenario for this
study, and a three-step process was used to generate each matched sample:
NETBOOK ASSESSMENTS
10
1. First, a propensity score was calculated for each student in the netbook condition and for
each student who took the test on a large screen using the following student
characteristics: grade level, gender, ethnicity, economic disadvantage status and test score
on a related Texas Assessment of Knowledge and Skills (TAKS) assessment taken in the
same year. Texas high school students are currently required to take the grade-level
TAKS assessment; whereas the EOC assessments are optional. For geometry, the TAKS
assessment used for matching was the TAKS mathematics test; for world geography and
English I, both the TAKS reading/English language arts (ELA) test and the mathematics
test were used.
2. Next, each student in the netbook condition was matched to a large screen student using
the propensity score. When multiple students matched a netbook student, the matched
student included in the study was randomly selected.
3. Finally, for students in the netbook condition that an exact propensity score match could
not be found, a “nearest neighbor” was randomly chosen from the group of large screen
students with similar propensity scores.
Because of the substantial difference in sample sizes between the large screen and
netbooks conditions, exact propensity score matches were found in all but a handful of cases for
each of the two netbook conditions and three EOC assessments in this study. As such, very few
students in the netbook condition required the “nearest neighbor” match described in step 3.
Analysis Methods
For each EOC assessment in the study, two independent samples t-test were conducted on
the pair of matched samples (i.e. 10.1-inch sample vs. its matched large-screen sample and 11.6-
NETBOOK ASSESSMENTS
11
inch sample vs. its matched large-screen sample). A type I error rate (α) of 0.05 was used for
each comparison.
Additionally, item-level comparability was performed using the Mantel-Haenszel (MH)
procedure (Holland & Thayer, 1988) to determine whether items were equally appropriate for
assessing the targeted constructed across student subgroups (i.e., screen size conditions.) The
combined netbook condition (10.1-inch and 11.6-inch) was defined as the focal group and the
matched large-screen condition was defined as the reference group. The estimated EOC
assessment raw score was used as the external matching criterion in the computation of MH
statistics. Educational Testing Service’s (ETS) three-level differential item functioning (DIF)
classification system (Zieky, 1993; Zwick & Ercikan, 1989), which takes into consideration a
combination of effect size and statistical significance, was used to identify DIF items for each
EOC assessment. Any items showing DIF were reviewed to determine if certain item
characteristics may have been differentially impacted by varying computer screen sizes.
Note that because the English I assessment was a stand-alone field test, all but a few
linking items appeared on all test forms. This significantly reduced the sample size for each item
on the writing and reading components. As such, only the linking items for each component,
which were all multiple choice items, were included in the item-level analysis for English I.
Results
Test-Level Comparisons
For each EOC assessment and netbook condition in the study, the means, 95%
confidence intervals, and results of each pair of independent t-tests are shown in Tables 4 to 11.
NETBOOK ASSESSMENTS
12
Table 4: Test-Level Results for 10.1-inch Netbook Condition - Geometry
Condition
Mean
95% Confidence Interval
10.1-inch netbook
Matched large-screen
Difference (10.1-inch - Large)
29.4
27.3
2.1
(28.2, 30.6)
(25.9, 28.7)
Not statistically significant
Table 5: Test-Level Results for 11.6-inch Netbook Condition - Geometry
Condition
Mean
95% Confidence Interval
11.6-inch netbook
Matched large-screen
Difference (11.6-inch - Large)
29.1
27.6
1.5
(27.7, 30.3)
(26.1, 29.0)
Not statistically significant
Table 6: Test-Level Results for 10.1-inch Netbook Condition – World Geography
Condition
Mean
95% Confidence Interval
10.1-inch netbook
Matched large-screen
Difference (10.1-inch - Large)
32.5
35.1
-2.6
(30.9, 34.1)
(33.3, 36.9)
Not statistically significant
Table 7: Test-Level Results for 11.6-inch Netbook Condition – World Geography
Condition
Mean
95% Confidence Interval
11.6-inch netbook
Matched large-screen
Difference (11.6-inch - Large)
38.0
39.2
-1.2
(36.4, 39.5)
(37.6, 40.8)
Not statistically significant
Table 8: Test-Level Results for 10.1-inch Netbook Condition – English I Writing
Condition
Mean
95% Confidence Interval
10.1-inch netbook
Matched large-screen
Difference (10.1-inch - Large)
18.8
18.6
0.2
(18.0, 19.6)
(17.8, 19.4)
Not statistically significant
Table 9: Test-Level Results for 11.6-inch Netbook Condition – English I Writing
Condition
Mean
95% Confidence Interval
11.6-inch netbook
Matched large-screen
Difference (11.6-inch - Large)
18.2
19.3
-1.0
(17.4, 19.1)
(18.4, 20.1)
Not statistically significant
NETBOOK ASSESSMENTS
13
Table 10: Test-Level Results for 10.1-inch Netbook Condition – English I Reading
Condition
Mean
95% Confidence Interval
10.1-inch netbook
Matched large-screen
Difference (10.1-inch - Large)
21.6
22.6
-0.9
(20.3, 22.9)
(21.5, 23.6)
Not statistically significant
Table 11: Test-Level Results for 11.6-inch Netbook Condition – English I Reading
Condition
Mean
95% Confidence Interval
11.6-inch netbook
Matched large-screen
Difference (11.6-inch - Large)
21.6
22.8
-1.2
(20.2, 22.9)
(21.4, 24.1)
Not statistically significant
In summary, for all EOC assessments examined in this study, no statistically significant
differences were found between the netbook conditions (10.1-inch or 11.6-inch) and their
respective matched large-screen conditions. Thus, there was no evidence to show that student
performance at the test level differs for the EOC subject areas in this study when they are tested
on 10.1-inch or 11.6-inch netbooks versus the larger screen sizes on laptops and desktops
Item-Level Comparisons
Using ETS’ DIF classification system, all 68 items on the world geography operational
test were identified as level A items (showing little or no DIF). Also, the majority of the items
(40 out of 44, or 91%) were classified as level A items for geometry. Only four items were
classified as level B items (showing moderate DIF) and one item was classified as a level C item
(showing severe DIF). The direction of the DIF classification indicated that items appeared to be
easier for students taking the test on netbooks than those taking the test on larger screens. For
English I reading, of the 8 linking items, only one item was classified as a B level item in favor
of the large screen size condition. Similarly, only one out of the seven linking items was
classified as a B level item for English I writing, in favor of the large screen size condition. In
examining the content and online presentation of each DIF item, however, no consistent pattern
NETBOOK ASSESSMENTS
14
could be identified to explain the differential impact of screen sizes for any of the subject areas.
Thus, in general, there was little evidence to suggest that screen size impacted student
performance at the item level for the subject areas examined in this study.
Educational Implications and Limitations
The findings of this study have both immediate operational and long-term research
implications. Operationally, the study results have helped inform the state’s respond to questions
about the use of netbooks for CBT. Several other statewide assessment programs have also
inquired about the study results to help inform their CBT policies. From a research perspective,
this study serves as an initial investigation into the use of computing devices with smaller screen
sizes in K-12 statewide assessments and provides the basis for future follow-up research that
may be warranted.
A few limitations should be noted about this study. First, even though random assignment
was conducted at the campuses to form each netbook condition and a rigorous matching
methodology was used to create the matched large-screen conditions, the study was still
conducted on a convenience sample drawn from four campuses in the greater Austin area on
three EOC assessments. The study was structured this way in order to minimize the impact on
the volunteering district and campuses during the EOC assessment testing window. However,
there is evidence to suggest that students in this study had relatively high familiarity with the
smaller display sizes found on netbooks. This was evident through informal conversations with
teachers and students who participated in the study as well as the advanced technical
infrastructure observed on at least one of the campuses. As such, caution should be taken in
generalizing the study results to the entire statewide test-taking population as well as to other
subject areas that are not in the study.
NETBOOK ASSESSMENTS
15
Secondly, the EOC assessments were not high stakes to Texas high school students
during the spring 2010 administration. Thus, student motivation was likely lower than it will be
when the EOC assessments become high stakes, starting in spring 2012. This may have an
impact on the generalizability of the study results to the high-stakes testing environment.
Lastly, as stated earlier, spring 2010 was the initial stand-alone field test administration
of the English I EOC assessment. Twelve stand-alone field test forms were spiraled at the
student level during the administration and the set of items on each test form was different.
Consequently, the raw scores used to compute the statistics in Tables 8 to 11 and used as the
matching criterion in the MH computation were not all based on the same set of items. As such,
the results for English I need to be interpreted with particular caution. Furthermore, the English I
raw scores used in the study were based on the multiple-choice items only and therefore did not
take into account student test performance on the essay and short answer items of the writing and
reading components respectively.
Even with the limitations above, the study design and results serve as an example and
guide for further research to help better inform CBT policies with respect to netbooks or mobile
computing devices with even smaller screen sizes. Future research can include a larger and more
representative sample in terms of region, district and campus size, student proficiency level,
ethnic composition and other key characteristics. More elementary- and middle school-grade
assessments could also be used to further determine whether an impact from using smaller-sized
screens is more noticeable than at older grades. Greater control over the assignment of students
to the smaller screen conditions and assessment subject areas would help the study’s
experimental design and improve the generalizability of the results. Finally, additional system
configuration data (such as monitor size, screen resolution, operation system and platform etc)
NETBOOK ASSESSMENTS
16
and student-level information (such as computer proficiency level, stakes of the assessment etc)
could be captured to provide a more comprehensive examination of the suitability of smallerscreen devices for online testing.
NETBOOK ASSESSMENTS
17
References
American Educational Research Association (AERA), American Psychological Association
(APA), and the National Council on Measurement in Education (NCME). (1999).
Standards for educational and psychological testing. Washington, DC: AERA.
American Psychological Association Committee on Professional Standards and Committee on
Psychological Tests and Assessments (APA) (1986). Guidelines for computer-based tests
and interpretations. Washington, DC: Author.
Bennett, R. E., Braswell, J., Oranje, A., Sandene, B., Kaplan, B., & Yan, F. (2008). Does it
matter if I take my mathematics test on computer? A second empirical study of mode
effects in NAEP. Journal of Technology, Learning, and Assessment, 6(9). Available from
http://www.jtla.org.
Bridgeman, B., Lennon, M. L., & Jackenthal, A. (2001). Effects of screen size, screen resolution,
and display rate on computer-based test performance (ETS-RR-01-23). Princeton, NJ:
Educational Testing Service.
Bridgeman, B., Lennon, M. L., & Jackenthal, A. (2003). Effects of screen size, screen resolution,
and display rate on computer-based test performance. Applied Measurement in
Education, 16(3), 191-205.
Dehejia, R. H., & Wahba, S. (1998). Propensity Score Matching Methods for Non-experimental
Causal Studies. Cambridge, MA: National Bureau of Economic Research.
Horkay, N., Bennett, R. E., Allen, N., Kaplan, B., & Yan, F. (2006). Does it matter if I take my
writing test on computer? An empirical study of mode effects in NAEP. Journal of
Technology, Learning, and Assessment, 5(2). Available from http://www.jtla.org.
NETBOOK ASSESSMENTS
18
Holland, P., & Wainer, H. (1993). Differential Item Functioning. Hillsdale, NJ: Lawrence
Erlbaum Associates.
Poggio, J., Glasnapp, D. R., Yang, X., & Poggio, A. J. (2005). A comparative evaluation of score
results from computerized and paper and pencil mathematics testing in a large scale state
assessment program. Journal of Technology, Learning, and Assessment,3(6). Available
from http://www.jtla.org.
Rosenbaum, P.R., & Rubin, D. B. (1983). The central role of the propensity score in
observational studies for causal effects. Biometrika, 70, 41–55.
Rubin, D. B. (1997). Estimating Causal Effects from Large Data Sets Using Propensity Scores.
Annals of Internal Medicine, 127(8S), 757–763.
Russell, M. (1999). Testing on computers: a follow-up study comparing performance on
computer and on paper. Education Policy Analysis Archives (online). Retrieved July 12,
2010 from http://epaa.asu.edu/epaa/v7n20/
Russell, M., & Haney, W. (1997). Testing Writing on Computers: Results of a Pilot Study to
Compare Student Writing Test Performance via Computer or Via Paper-and-Pencil.
Educational Policy Analysis Archives, 5(3).
Way, W. D., Davis, L. L., & Fitzpatrick, S. (2006, April). Score comparability of online and
paper administrations of the Texas Assessment of Knowledge and Skills. Paper presented
at the Annual Meeting of the National Council on Measurement in Education, San
Francisco, CA. Available at
http://www.pearsonassessments.com/hai/images/tmrs/Score_Comparability_of_Online_a
nd_Paper_Administrations_of_TAKS_03_26_06_final.pdf.
NETBOOK ASSESSMENTS
19
Yu, L., Livingston, S. A., Larkin, K. C., & Bonett, J. (2004). Investigating differences in
examinee performance between computer-based and handwritten essays (RR-04-18).
Princeton, NJ: Educational Testing Service.
Zieky, M. (1993). Practical questions in the use of DIF statistics in test development. In P.
Holland & H. Wainer (Eds.) Differential item functioning (pp. 337 – 348). Hillsdale, NJ:
Erlbaum.
Zwick, R., & Ercikan, K. (1989). Analysis of differential item functioning in the NAEP history
assessment. Journal of Educational Measurement, 26, 44-66.