Does Size Matter? A Study on the Use of Netbooks in K-12 Assessments Annual Meeting of the American Educational Research Association New Orleans, LA Leslie King Xiaojing Jadie Kong Bryan Bleil April 2011 NETBOOK ASSESSMENTS 2 Abstract One of the newest trends in school-based computing is the advent of netbooks, or “mini-laptops.” Because of their price and mobility, they have generated significant interest from school districts and campuses. The presence of netbooks and other mobile computing devices in schools and their use for online testing will likely expand. Investigation is, therefore, warranted on the suitability of such devices for online testing, especially with respect to screen sizes that are substantially smaller than traditional displays. A study was conducted during the spring 2010 administration of the Texas End-of-Course (EOC) assessments to evaluate the feasibility of using netbooks in the context of K-12 assessments. Samples of students from across four campuses in two school districts were randomly assigned to one of two netbook screen size conditions (10.1inch or 11.6-inch) in one of three subject areas: world geography, geometry, and English I. Each netbook condition was compared with a matched “large screen” control condition drawn from the statewide online testing population. The study found no statistically significant differences at the test level between the netbook conditions and their matched large screen conditions for any of the subject areas and little evidence of differential item functioning (DIF) due to screen size at the item level. The study findings provide an initial evaluation of the impact of the smaller screen sizes on test performance. Implications of these findings as well as study limitations and direction for future research are discussed. Correspondence concerning this article should be addressed to Leslie Keng, Pearson, 400 Center Ridge Drive, Austin, TX 78753, E-mail: [email protected]. NETBOOK ASSESSMENTS 3 Does Size Matter? A Study on the Use of Netbooks in K-12 Assessments Background The use of computers in assessment has grown tremendously in recent years. The motivations for moving to computer-based testing (CBT) include greater flexibility in administration, reduction in the use of paper documents in testing and its associated administrative burdens, and the possibility of faster score reporting. In general, the movement toward CBT in K-12 assessment programs is picking up momentum as school districts and campuses increase their technology capabilities and students become more comfortable using the computer for a variety of educational tasks. In Texas, for example, over two million tests have been delivered on computers since 2006. This growth in CBT is anticipated to continue over the next several years. One of the newest popular trends in educational mobile computing is the advent of netbooks, or “mini-laptops.” These are smaller, and generally much cheaper, versions of laptops. Because of their price and mobility, they have generated significant interest from campuses and districts, many of whom have already begun purchasing netbooks, especially in conjunction with one-to-one computing initiatives. Testing personnel are also considering the use of netbooks in their assessments. As the presence of netbooks in schools continues to expand, the question about their suitability for CBT, especially with respect to screen sizes that are roughly half that of traditional displays, warrants further research. Current professional testing standards indicate that whenever paper and computer-based assessments of the same content are both administered, the comparability across paper and computer-based administrations needs to be studied (APA, 1986; AERA, APA, NCME, 1999, Standard 4.10). Much CBT research to date, therefore, has been devoted to evaluating the NETBOOK ASSESSMENTS 4 comparability of test scores and item properties across paper and computer modes (e.g. Bennett, Braswell, Oranje, Sandene, Kaplan & Yan, 2008; Horkay, Bennett, Allen, Kaplan, & Yan, 2006; Way, Davis, & Fitzpatrick, 2006; Poggio, Glasnapp, Yang, & Poggio, 2005; Yu, Livingston, Larkin, & Bonett, 2004). However, because of its relatively recent introduction to the market and to educational settings, no published research to date has been conducted on the use of netbooks in assessment. Specifically, very little research has been done in the area of screen size comparability. Bridgeman, Lennon and Jackenthal (2001, 2003) evaluated the impact of variation in screen size and screen resolution on SAT® test performance. They found no significant effects on math scores, but scores that were higher by approximately a quarter of a standard deviation on verbal section for students who took it on the higher-resolution, larger screens. The screen display conditions investigated by Bridgeman et al. included 15-inch monitors with a 640480 resolution, 17-inch monitors with a 640480 resolution, and 17-inch monitors with a 1024768 resolution. The newest netbooks, however, have screen sizes ranging from 10 to 12 inches; while being able to support a 1024768 resolution. Table 1 compares the approximate screen areas for three typical display sizes found on computing devices. Table 1: Approximate Screen Areas for Various Computing Devices Approx. Screen Area Typical Display Size For… Display Size* 19-inch display 180 inch2 15-inch display 2 110 inch Desktop Computers Laptop Computers 2 * 10-inch display 43 inch Netbooks The “display size” typically represents the length of the diagonal for the screen. Note that the screen area of a 10-inch netbook is approximately a quarter of that on a desktop computer with a 19-inch monitor and less than half of the area for a laptop with a 15- NETBOOK ASSESSMENTS 5 inch screen. With the movement in computing technology towards mobile devices, such as tablets and handheld devices, with smaller screen size, the gulf between the largest and smallest display sizes will continue to increase. This may have a profound effect on the “screen experiences” for users of the various devices. Figures 1 and 2 provide a visual illustration of the potential impact display sizes could have on computer-based testers, assuming equivalent screen resolutions. The overall area of the test item in Figure 2 is approximately half the area of the same item in Figure 1. As seen in the Table 1, this is the approximate ratio of the screen area for a 10-inch netbook compared to that of a 15-inch laptop. Figure 1: Example Science Test Item (Original Size) NETBOOK ASSESSMENTS 6 Figure 2: Example Science Test Item (Half of Original Size) The primary concern for computer-based testing in the context of K-12 assessments is that if there are, in fact, comparability effects related to substantial reduction in display size such as effects due to reading fatigue on small screens or due to extremely detailed images such as maps, pictures, and graphs - this impact should be identified early, in order to better construct appropriate mitigation strategies or inform assessment policies around CBT. Study Purpose The purpose of the current research is to evaluate whether the significant reduction in screen size for netbooks has an impact on test performance. Specifically, the study is designed to: 1. Determine whether student test performance differs significantly when administered on 10-inch or 12-inch sized displays, versus the larger (14- to 21-inch) display sizes more commonly found on desktops and laptop computers used by most computer-based testers. 2. Identify high-level directional characteristics of the tested content for which display size may have specific comparability impacts (such as by subject area, item type, or by extremity of size reduction). NETBOOK ASSESSMENTS 7 Method Participants The study was conducted on a sample of middle- and high-school students during the spring 2010 administration of the Texas End-of-Course (EOC) assessments. School districts in the greater Austin area were asked if they would volunteer any schools and students to test on netbooks. Each campus volunteered by their school district was supplied with netbooks as well as support and setup logistics during the test administration window. A total of 1,547 students from across four campuses in two school districts participated in the study over a four-week testing window. Conditions Students in the sample were randomly assigned to one of two netbook conditions: 10.1inch netbooks with high definition widescreen display (1366 x 768 screen resolution) or 11.6inch high definition WLED display (1366 x 768 screen resolution) for one of three subject areas: geometry, world geography and English I. These subject areas were chosen because it was hypothesized that certain characteristics of the tested content may lead to performance differences on reduced display sizes. The geometry tests contains items with tables and graphs; world geography items often include charts and maps; while the English I test incorporates longer reading passages. Each of these test features, particularly the fine details displayed on maps and the reading load of the passages, may be more difficult for students to engage with on the smaller netbook screens. It should be noted that, of the three EOC assessments included in the study, geometry and world geography were given as operational test during the spring 2010 administration; whereas NETBOOK ASSESSMENTS 8 the English I assessment was given as a first-time stand-alone field test. This latter fact has implications to the analysis results and is one of the limitations of the study design. Sample Sizes Table 2 summarizes the sample sizes analyzed for each EOC assessment in the study. Note that the English I EOC assessment was administered over two consecutive days. The English I writing component was administered on the first day; while the reading component was administered on the second day. Not all students who took writing on the first day came back to take reading on the second day. As such, the writing and reading portions of the English I samples were analyzed separately. Also, because of the voluntary nature of the participants, the samples in the study should be considered convenient instead of random samples. Table 2: Participation Summary for the Netbook Study Sample Sizes EOC Assessment Geometry 10.1-inch 238 11.6-inch 202 Total 440 World Geography 236 281 517 English I writing 109 98 207 English I reading 94 89 183 A “large screen” control condition was also needed for each of the EOC assessments in the study and was drawn from the statewide population of students that took the test in the online (i.e., computer-based) mode. It is assumed, because of the testing policies in Texas, that such students would have taken their test on 14- to 21-inch screens with a minimal screen resolution of 1024 x 768. Table 3 summarizes the number of students who took the EOC assessments in the online mode during the spring 2010 administration. NETBOOK ASSESSMENTS 9 Table 3: Summary of Statewide Online Students (Spring 2010 EOC Assessments) EOC Assessment Number of Statewide Online Students Geometry 81,913 World Geography 62,270 English I writing 29,888 English I reading 31,162 Because of the substantial difference in sample sizes between the control condition and netbook conditions and the fact that the netbook conditions were samples of convenience, it was necessary to create matched samples out of larger control condition prior to conducting any comparisons. The process for forming the matched samples is described next. Matched Samples Creation To use of matched samples to make appropriate comparison between conditions is a procedure commonly found in quasi-experimental studies (see, for example, Way, Davis, & Fitzpatrick, 2006). For each EOC assessment in this study, one “large-screen” matched sample was formed by matching to the characteristics of students in the 10.1-inch netbook condition; another “large-screen” matched sample was created by matching to the characteristics of students in the 11.6-inch netbook condition. Each matched sample was created using the propensity score matching method (Dehijia & Wahba, 1998; Rosenbaum & Rubin, 1983; Rubin, 1997.) The propensity score matching method works best when the number of students for selecting the matched sample is substantially larger than the sample of students under study because the large sample offers many students for finding a close match to each student in the sample under study. Such was the scenario for this study, and a three-step process was used to generate each matched sample: NETBOOK ASSESSMENTS 10 1. First, a propensity score was calculated for each student in the netbook condition and for each student who took the test on a large screen using the following student characteristics: grade level, gender, ethnicity, economic disadvantage status and test score on a related Texas Assessment of Knowledge and Skills (TAKS) assessment taken in the same year. Texas high school students are currently required to take the grade-level TAKS assessment; whereas the EOC assessments are optional. For geometry, the TAKS assessment used for matching was the TAKS mathematics test; for world geography and English I, both the TAKS reading/English language arts (ELA) test and the mathematics test were used. 2. Next, each student in the netbook condition was matched to a large screen student using the propensity score. When multiple students matched a netbook student, the matched student included in the study was randomly selected. 3. Finally, for students in the netbook condition that an exact propensity score match could not be found, a “nearest neighbor” was randomly chosen from the group of large screen students with similar propensity scores. Because of the substantial difference in sample sizes between the large screen and netbooks conditions, exact propensity score matches were found in all but a handful of cases for each of the two netbook conditions and three EOC assessments in this study. As such, very few students in the netbook condition required the “nearest neighbor” match described in step 3. Analysis Methods For each EOC assessment in the study, two independent samples t-test were conducted on the pair of matched samples (i.e. 10.1-inch sample vs. its matched large-screen sample and 11.6- NETBOOK ASSESSMENTS 11 inch sample vs. its matched large-screen sample). A type I error rate (α) of 0.05 was used for each comparison. Additionally, item-level comparability was performed using the Mantel-Haenszel (MH) procedure (Holland & Thayer, 1988) to determine whether items were equally appropriate for assessing the targeted constructed across student subgroups (i.e., screen size conditions.) The combined netbook condition (10.1-inch and 11.6-inch) was defined as the focal group and the matched large-screen condition was defined as the reference group. The estimated EOC assessment raw score was used as the external matching criterion in the computation of MH statistics. Educational Testing Service’s (ETS) three-level differential item functioning (DIF) classification system (Zieky, 1993; Zwick & Ercikan, 1989), which takes into consideration a combination of effect size and statistical significance, was used to identify DIF items for each EOC assessment. Any items showing DIF were reviewed to determine if certain item characteristics may have been differentially impacted by varying computer screen sizes. Note that because the English I assessment was a stand-alone field test, all but a few linking items appeared on all test forms. This significantly reduced the sample size for each item on the writing and reading components. As such, only the linking items for each component, which were all multiple choice items, were included in the item-level analysis for English I. Results Test-Level Comparisons For each EOC assessment and netbook condition in the study, the means, 95% confidence intervals, and results of each pair of independent t-tests are shown in Tables 4 to 11. NETBOOK ASSESSMENTS 12 Table 4: Test-Level Results for 10.1-inch Netbook Condition - Geometry Condition Mean 95% Confidence Interval 10.1-inch netbook Matched large-screen Difference (10.1-inch - Large) 29.4 27.3 2.1 (28.2, 30.6) (25.9, 28.7) Not statistically significant Table 5: Test-Level Results for 11.6-inch Netbook Condition - Geometry Condition Mean 95% Confidence Interval 11.6-inch netbook Matched large-screen Difference (11.6-inch - Large) 29.1 27.6 1.5 (27.7, 30.3) (26.1, 29.0) Not statistically significant Table 6: Test-Level Results for 10.1-inch Netbook Condition – World Geography Condition Mean 95% Confidence Interval 10.1-inch netbook Matched large-screen Difference (10.1-inch - Large) 32.5 35.1 -2.6 (30.9, 34.1) (33.3, 36.9) Not statistically significant Table 7: Test-Level Results for 11.6-inch Netbook Condition – World Geography Condition Mean 95% Confidence Interval 11.6-inch netbook Matched large-screen Difference (11.6-inch - Large) 38.0 39.2 -1.2 (36.4, 39.5) (37.6, 40.8) Not statistically significant Table 8: Test-Level Results for 10.1-inch Netbook Condition – English I Writing Condition Mean 95% Confidence Interval 10.1-inch netbook Matched large-screen Difference (10.1-inch - Large) 18.8 18.6 0.2 (18.0, 19.6) (17.8, 19.4) Not statistically significant Table 9: Test-Level Results for 11.6-inch Netbook Condition – English I Writing Condition Mean 95% Confidence Interval 11.6-inch netbook Matched large-screen Difference (11.6-inch - Large) 18.2 19.3 -1.0 (17.4, 19.1) (18.4, 20.1) Not statistically significant NETBOOK ASSESSMENTS 13 Table 10: Test-Level Results for 10.1-inch Netbook Condition – English I Reading Condition Mean 95% Confidence Interval 10.1-inch netbook Matched large-screen Difference (10.1-inch - Large) 21.6 22.6 -0.9 (20.3, 22.9) (21.5, 23.6) Not statistically significant Table 11: Test-Level Results for 11.6-inch Netbook Condition – English I Reading Condition Mean 95% Confidence Interval 11.6-inch netbook Matched large-screen Difference (11.6-inch - Large) 21.6 22.8 -1.2 (20.2, 22.9) (21.4, 24.1) Not statistically significant In summary, for all EOC assessments examined in this study, no statistically significant differences were found between the netbook conditions (10.1-inch or 11.6-inch) and their respective matched large-screen conditions. Thus, there was no evidence to show that student performance at the test level differs for the EOC subject areas in this study when they are tested on 10.1-inch or 11.6-inch netbooks versus the larger screen sizes on laptops and desktops Item-Level Comparisons Using ETS’ DIF classification system, all 68 items on the world geography operational test were identified as level A items (showing little or no DIF). Also, the majority of the items (40 out of 44, or 91%) were classified as level A items for geometry. Only four items were classified as level B items (showing moderate DIF) and one item was classified as a level C item (showing severe DIF). The direction of the DIF classification indicated that items appeared to be easier for students taking the test on netbooks than those taking the test on larger screens. For English I reading, of the 8 linking items, only one item was classified as a B level item in favor of the large screen size condition. Similarly, only one out of the seven linking items was classified as a B level item for English I writing, in favor of the large screen size condition. In examining the content and online presentation of each DIF item, however, no consistent pattern NETBOOK ASSESSMENTS 14 could be identified to explain the differential impact of screen sizes for any of the subject areas. Thus, in general, there was little evidence to suggest that screen size impacted student performance at the item level for the subject areas examined in this study. Educational Implications and Limitations The findings of this study have both immediate operational and long-term research implications. Operationally, the study results have helped inform the state’s respond to questions about the use of netbooks for CBT. Several other statewide assessment programs have also inquired about the study results to help inform their CBT policies. From a research perspective, this study serves as an initial investigation into the use of computing devices with smaller screen sizes in K-12 statewide assessments and provides the basis for future follow-up research that may be warranted. A few limitations should be noted about this study. First, even though random assignment was conducted at the campuses to form each netbook condition and a rigorous matching methodology was used to create the matched large-screen conditions, the study was still conducted on a convenience sample drawn from four campuses in the greater Austin area on three EOC assessments. The study was structured this way in order to minimize the impact on the volunteering district and campuses during the EOC assessment testing window. However, there is evidence to suggest that students in this study had relatively high familiarity with the smaller display sizes found on netbooks. This was evident through informal conversations with teachers and students who participated in the study as well as the advanced technical infrastructure observed on at least one of the campuses. As such, caution should be taken in generalizing the study results to the entire statewide test-taking population as well as to other subject areas that are not in the study. NETBOOK ASSESSMENTS 15 Secondly, the EOC assessments were not high stakes to Texas high school students during the spring 2010 administration. Thus, student motivation was likely lower than it will be when the EOC assessments become high stakes, starting in spring 2012. This may have an impact on the generalizability of the study results to the high-stakes testing environment. Lastly, as stated earlier, spring 2010 was the initial stand-alone field test administration of the English I EOC assessment. Twelve stand-alone field test forms were spiraled at the student level during the administration and the set of items on each test form was different. Consequently, the raw scores used to compute the statistics in Tables 8 to 11 and used as the matching criterion in the MH computation were not all based on the same set of items. As such, the results for English I need to be interpreted with particular caution. Furthermore, the English I raw scores used in the study were based on the multiple-choice items only and therefore did not take into account student test performance on the essay and short answer items of the writing and reading components respectively. Even with the limitations above, the study design and results serve as an example and guide for further research to help better inform CBT policies with respect to netbooks or mobile computing devices with even smaller screen sizes. Future research can include a larger and more representative sample in terms of region, district and campus size, student proficiency level, ethnic composition and other key characteristics. More elementary- and middle school-grade assessments could also be used to further determine whether an impact from using smaller-sized screens is more noticeable than at older grades. Greater control over the assignment of students to the smaller screen conditions and assessment subject areas would help the study’s experimental design and improve the generalizability of the results. Finally, additional system configuration data (such as monitor size, screen resolution, operation system and platform etc) NETBOOK ASSESSMENTS 16 and student-level information (such as computer proficiency level, stakes of the assessment etc) could be captured to provide a more comprehensive examination of the suitability of smallerscreen devices for online testing. NETBOOK ASSESSMENTS 17 References American Educational Research Association (AERA), American Psychological Association (APA), and the National Council on Measurement in Education (NCME). (1999). Standards for educational and psychological testing. Washington, DC: AERA. American Psychological Association Committee on Professional Standards and Committee on Psychological Tests and Assessments (APA) (1986). Guidelines for computer-based tests and interpretations. Washington, DC: Author. Bennett, R. E., Braswell, J., Oranje, A., Sandene, B., Kaplan, B., & Yan, F. (2008). Does it matter if I take my mathematics test on computer? A second empirical study of mode effects in NAEP. Journal of Technology, Learning, and Assessment, 6(9). Available from http://www.jtla.org. Bridgeman, B., Lennon, M. L., & Jackenthal, A. (2001). Effects of screen size, screen resolution, and display rate on computer-based test performance (ETS-RR-01-23). Princeton, NJ: Educational Testing Service. Bridgeman, B., Lennon, M. L., & Jackenthal, A. (2003). Effects of screen size, screen resolution, and display rate on computer-based test performance. Applied Measurement in Education, 16(3), 191-205. Dehejia, R. H., & Wahba, S. (1998). Propensity Score Matching Methods for Non-experimental Causal Studies. Cambridge, MA: National Bureau of Economic Research. Horkay, N., Bennett, R. E., Allen, N., Kaplan, B., & Yan, F. (2006). Does it matter if I take my writing test on computer? An empirical study of mode effects in NAEP. Journal of Technology, Learning, and Assessment, 5(2). Available from http://www.jtla.org. NETBOOK ASSESSMENTS 18 Holland, P., & Wainer, H. (1993). Differential Item Functioning. Hillsdale, NJ: Lawrence Erlbaum Associates. Poggio, J., Glasnapp, D. R., Yang, X., & Poggio, A. J. (2005). A comparative evaluation of score results from computerized and paper and pencil mathematics testing in a large scale state assessment program. Journal of Technology, Learning, and Assessment,3(6). Available from http://www.jtla.org. Rosenbaum, P.R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70, 41–55. Rubin, D. B. (1997). Estimating Causal Effects from Large Data Sets Using Propensity Scores. Annals of Internal Medicine, 127(8S), 757–763. Russell, M. (1999). Testing on computers: a follow-up study comparing performance on computer and on paper. Education Policy Analysis Archives (online). Retrieved July 12, 2010 from http://epaa.asu.edu/epaa/v7n20/ Russell, M., & Haney, W. (1997). Testing Writing on Computers: Results of a Pilot Study to Compare Student Writing Test Performance via Computer or Via Paper-and-Pencil. Educational Policy Analysis Archives, 5(3). Way, W. D., Davis, L. L., & Fitzpatrick, S. (2006, April). Score comparability of online and paper administrations of the Texas Assessment of Knowledge and Skills. Paper presented at the Annual Meeting of the National Council on Measurement in Education, San Francisco, CA. Available at http://www.pearsonassessments.com/hai/images/tmrs/Score_Comparability_of_Online_a nd_Paper_Administrations_of_TAKS_03_26_06_final.pdf. NETBOOK ASSESSMENTS 19 Yu, L., Livingston, S. A., Larkin, K. C., & Bonett, J. (2004). Investigating differences in examinee performance between computer-based and handwritten essays (RR-04-18). Princeton, NJ: Educational Testing Service. Zieky, M. (1993). Practical questions in the use of DIF statistics in test development. In P. Holland & H. Wainer (Eds.) Differential item functioning (pp. 337 – 348). Hillsdale, NJ: Erlbaum. Zwick, R., & Ercikan, K. (1989). Analysis of differential item functioning in the NAEP history assessment. Journal of Educational Measurement, 26, 44-66.
© Copyright 2026 Paperzz