A META-ANALYSIS OF THE ROSSELL AND BAKER REVIEW OF BILINGUAL EDUCATION RESEARCH Jay P. Greene University of Texas at Austin Abstract In 1996, Christine Rossell and Keith Baker conducted a review of the literature on the effectiveness of bilingual education and concluded that the majority of 75 methodologically acceptable studies showed that bilingual education was not beneficial. This study re-examines their literature review to verify the Rossell and Baker list of methodologically acceptable studies. After identifying only 11 studies that actually meet the standards for being methodologically acceptable, this study aggregates the results of those studies by a technique known as meta-analysis. The conclusion of the meta-analysis is that the use of at least some native language in the instruction of limited English proficient children has moderate beneficial effects on those children relative to their being taught only in English. During the debate over Proposition 227 in California that sought to eliminate the use of native language in the instruction of children with limited English proficiency (LEP), competing claims were made about what the research in the area concluded. Christine Rossell, for example, argued that the review of the literature she conducted with Keith Baker suggested that children learn English best when they are taught in English (Rossell & Baker, 1996). Kenji Hakuta, on the other hand, argued that the review of the literature he conducted as part of the National Research Council report on bilingual education, suggested that native language approaches are indeed beneficial for children learning English (National Research Council, 1997). Bewildered by these conflicting claims, the media and electorate in California generally paid little attention to researchers and Proposition 227 was passed into law. The fact that summaries of the literature on bilingual education can be used to support diametrically opposite conclusions suggests that interpretations of research findings can be ambiguous or inconsistent. One technique to reduce the ambiguity and inconsistency in reviews of research literature is meta-analysis, a systematic and statistical aggregation of research findings. While meta-analysis does not eliminate subjective factors in interpretation, it does tend to make the assumptions of interpretation more explicit and the conclusions more rigorous. The meta-analysis reported in this article consists of a re-examination of the studies reviewed by Rossell and Baker (1996). The conclusion of this meta-analysis is that the use of at least some native language in the instruction of LEP children tends to produce moderate improvements in standardized test scores taken in English. Reviewing the Rossell and Baker Review In their 1996 review of the literature, Rossell and Baker identified 75 studies that they determined were "methodologically acceptable."1Studies that were determined to be methodologically acceptable had to: (a) compare students in a bilingual program to a control group of similar students; (b) statistically control for differences between the treatment and control groups or assignment to treatment and control groups had be to done at random; (c) base results on standardized test scores in English; and (d) determine differences between the scores of treatment and control groups by applying appropriate statistical tests. These requirements for selecting methodologically acceptable studies seem like a reasonable start, but some of the items need clarification. For example, what constitutes a bilingual program and a comparable control group? What constitutes sufficient statistical control for differences between treatment and control groups? The reanalysis of the Rossell and Baker literature review presented here adds one additional requirement and more clearly defines some of the other requirements to determine whether the 75 studies identified by Rossell and Baker are, in fact, methodologically acceptable. The additional requirement is that studies had to measure the effects of bilingual programs after at least one academic year of participation in a bilingual program. Test results after a few weeks of participation in a program should not be used to assess the effects of that program. This additional requirement only causes two studies to be excluded from the meta-analysis, one that measured outcomes after seven weeks in a bilingual program (Barclay, 1969) and another that measured outcomes after ten weeks in a bilingual program (Layden, 1972). The requirements for a study to be considered methodologically acceptable were also clarified for this reanalysis in a few ways. First, bilingual programs were defined broadly as those programs in which LEP students were taught at least some of the time in their native language. Rossell and Baker subdivided bilingual programs into various program categories, such as transitional bilingual education (TBE), English as a second language (ESL), and maintenance bilingual education (MBE). The difficulty with these subdivisions is that program labels are notoriously unreliable descriptions of the content of the approach employed. What is called TBE is some places might be called ESL in others. The descriptions of the programs in the studies were often inadequate for drawing finer distinctions. Focusing on whether native language was part of the approach, however, is much easier to detect in each study and therefore is more likely to be a more reliable label for the programs. Besides, the policy-relevant question raised by Proposition 227 and most policy discussions is whether native language techniques in general are beneficial, not whether TBE is better than ESL. That is, Proposition 227 did not call for the abolition of a particular approachinstruction. Whether the literature supports that policy is an important question to address. Second, for the purposes of this reanalysis of the Rossell and Baker literature review, an acceptable control group was defined as one where students were taught almost entirely in English. This way all comparisons would be between programs in which students were taught using at least some of their native language to programs in which students were taught almost entirely in English. The Rossell and Baker literature review contained a number of studies in which both treatment and control groups received varying amounts and types of native language instruction. Those studies were excluded from this reanalysis because they do not help address whether native language instruction is generally superior to English-only instruction in advancing the academic achievement of LEP students. One cannot infer from a comparison of different amounts or types of a treatment as to whether that treatment is better than no treatment. By analogy, very large doses of acetaminophen can be lethal while low doses can help alleviate pain. The observation that large doses are worse than low doses does not mean that no treatment is better than low dosage. Requiring studies to compare programs with at least some native language instruction to programs taught almost entirely in English causes 14 of the 75 studies in Rossell and Baker's list of methodologically acceptable studies to be excluded.2 These studies are Barik, Swain, & Nwanunobi (1977), Bruck, Lambert, & Tucker (1977), Burkheimer, Conger, Dunteman, Elliot & Mowbray (1989), Day & Shapson (1988), El Paso Independent School District (1987), Genesee, & Lambert (1983), Genesee, Holobow, Lambert, & Chartrand (1989), Gersten (1985), Malherbe (1946), McConnell (1980a), Medina & Escamilla (1992), Melendez (1980), Stern (1975), and Vasquez (1990). A third clarification of the standards is defining what constitutes a sufficient statistical control for differences in the backgrounds of the treatment and control groups. To avoid omitted variable bias it is important that all pre-treatment differences between groups be controlled statistically or by random assignment. Because background characteristics can have such a strong effect on educational outcomes, failing to control fully for background differences can easily lead to an erroneous conclusion. Unfortunately, a large number of studies listed as methodologically acceptable by Rossell and Baker failed to control for any background characteristic other than prior test scores. As Campbell and Erlebacher (1970) demonstrated many years ago, controlling only for prior test scores is often inadequate because background differences usually influence the rate of test score growth, not just the level of test scores. Controlling only for prior test scores without any other controls for background differences does not adjust for what the test score trajectory would have been in the absence of treatment. Thus, studies controlling only for prior test scores are likely to be plagued by omitted variable bias. To be included in this meta-analysis studies had to assign students at random to treatment and control groups or control statistically for prior test scores and at least one other background characteristic, such as family income, parent's education, and so on. Requiring only one control in addition to prior test score is a very lax standard. Even studies that control for prior test scores and one other background characteristic may still suffer from omitted variable bias. Nevertheless, setting the standard in this way provides a clear decisive rule for including studies without having to make more subjective judgments about whether the set of background controls fully adjusts for pre-treatment differences. If the study made the effort to control for prior test scores and at least some other background characteristics, it was considered sufficient. Defining sufficient control for background differences in this way caused 25 studies to be excluded from this re-analysis of Rossell and Baker's literature review: Alvarez (1975); Ames & Bicks (1978); Balasubramonian, Seelye, Elizondo, de Weffer (1973); Barik & Swain (1975); Bates (1970); Carsrud & Curtis (1980); Ciriza (1990a); Cohen (1975); Cotrell (1971); Curiel (1979); de Weffer (1972); de la Garza & Marcella (1985); Educational Operations Concepts (1991a); Lampman (1973); Legerreta (1979); Lum (1971); Maldonado (1974); Matthews (1979); Moore & Parr (1978); Pena-Hughes & Solis (1979); Prewitt Diaz (1979); Stebbins, St. Pierre, Proper, Anderson, & Carva (1977); Valladolid (1991); Yap, Enoki, & Ishitani (1988); and Zirkel (1972). Two of these studies, Valladolid (1991) and Yap et al (1988), did not control statistically for any pre-treatment differences, not even prior test scores. The other 23 studies controlled only for prior test scores. Three other considerations caused studies listed as methodologically acceptable by Rossell and Baker to be excluded from this meta-analysis. First, there were several studies that could not be found.3 If a study was not available to be reviewed it could not be included in this reanalysis, causing the following five studies to be excluded: American Institutes for Research (1975b); Lambert & Tucker (1972); McSpadden (1979); Morgan (1971); and Ramos, Aguilar, & Sibayan (1967). Second, Rossell and Baker included in their list of 75 methodologically acceptable studies several citations for studies by the same authors of the same programs. That is, Rossell and Baker would count the release of each year's results by the same authors of the same program as if it were an independent and new study. This appears problematic since the combined the results of multiple years of a program in one report would be counted less than those results reported after each year. For the purposes of this meta-analysis, the results of the same program by the same authors were combined into one observation regardless of whether the authors released their results in one report or many. A total of fifteen studies were eliminated as independent observations as a result: Ariza (1988); Barik & Swain (1978); Cohen, Fathman, & Marino (1976); Curiel, Stenning, & Cooper-Stenning (1980); Danoff, Coles, McLaughlin, & Reynolds (1977b, 1978a, 1978b); Educational Operations Concepts (1991b); El Paso Independent School District (1990, 1992); Genesee, Lambert, & Tucker (1977); McConnell (1980b, 1980c); McSpadden (1980); and Teschner (1990). Third, three additional studies from Rossell and Baker's list of 75 were excluded from this meta-analysis because they were not evaluations of bilingual programs. One is about "direct instruction" (Becker, 1982) and makes not mention of second language learning. Another is a list of exemplary bilingual programs (Campeau, 1975), not an evaluation of the programs. Yet another, is primarily about the effects of retention (being held back a grade) (Webb, 1987)." Beginning with a list of 75 "methodologically acceptable" studies compiled by Rossell and Baker and applying clarified standards that Rossell and Baker had used for selecting those studies leaves us with 11 studies that actually meet those standards. Of course, there is no reason that a meta-analysis should only consider methodologically acceptable studies. It would also be appropriate to include all relevant studies or a sample of all relevant studies in a meta-analysis. This meta-analysis looks only at studies that meet certain criteria in an effort to determine the reliability of the literature review conducted by Rossell and Baker. That literature review established certain criteria for inclusion and this meta-analysis attempts to follow the guidelines established by Rossell and Baker. A meta-analysis of all studies of the effectiveness of bilingual education would not only have to include more than the 72 studies that Rossell and Baker describe as methodologically acceptable, but would also have to examine the more than the 300 studies that Rossell and Baker say they reviewed to identify their list of 72 acceptable studies. Additionally, there are probably several hundred more studies that could be included if one wished to review a comprehensive list of studies. This meta-analysis does not do so because it is primarily an effort to determine the reliability of the Rossell and Baker review and so takes their list of acceptable studies and their criteria as the points of departure. Furthermore, a review of several hundred studies was beyond the scope of this project. A comprehensive meta-analysis could and should be done, but it was beyond the scope of the present study. Conducting a Meta-Analysis on the 11 "Methodologically Acceptable" Studies The procedures employed for aggregating the results of these 11 studies followed conventional meta-analytical strategies (Rosenthal, 1991). An effect size and a z-score was calculated for each study for all results measured in English, reading results measured in English, math results measured in English, and, where available, all tests taken in Spanish. To calculate the effect size for each study for all tests taken in English, for example, the average of all English to bilingual education, it called for an end to native language test results reported in a study were computed. That average was adjusted for the sample size and standardized in units of standard deviation to produce a single effect size for each study, known as Hedge's g. An effect size of 1 would mean that the average result in that study indicated that students taught at least some of the time in their native language outperformed a control group taught only in English by 1 standard deviation on some measure of academic achievement. The effect size was combined across the 11 studies by simply taking an average of the 11 effect sizes. A single z-score indicating the confidence we have in the effect size was calculated for each subject area for each study. That z-score for each study was calculated by converting whatever statistical measure of confidence in results was reported into a z-score and then taking the average of those z-scores within each study. In this way all of the results within each study were condensed into a single effect size and z-score for each study in each subject category. Table 1 contains the effect sizes and z-scores for each study for all tests taken in English, reading results in English, and all Spanish test results.4 This table tells us, for example, that the Bacon (1982) study had an average effect size for all results on English tests of .79 standard deviations. That is, students in a bilingual program outperformed their English-only counterparts by .79 standard deviations on average for all tests taken in English. The z-score for this effect size is 2.39, suggesting that this positive result was unlikely to have occurred by chance. The z-scores of the 11 studies were combined by computing the sum of each study's average z-score and then dividing by the square root of the number of z-scores that were combined, in this case 11 (Rosenthal 1991, p. 85). This formula calculates a combined z-score by measuring how different the observed distribution of z-scores is from a normal distribution with a mean of 0 and a standard deviation of 1. That is, if there were no effect of bilingual education on test scores, we would expect that the distribution of z-scores from the analyzed studies should be normally distributed with an average of 0. By chance, some will be positive and some negative. But a combined z-score greater than 1.96 suggests that the observed distribution of z-scores was unlikely to have occurred by chance. From this statistic we can determine whether there is a significantly positive or negative pattern to the results from a number of studies. Study English Reading Spanish Treatment Control ES Z ES Z ES Z N N Random Assignment Yes/No Bacon, et al,1982 Covey, 1973 Danoff, et al. 1977a Huzar, 1973 Kaufman, 1968 Plante, 1976 Powers, 1978 Ramirez, et al. 1991 Rossell, 1990 Rothfarb, et al. 1987 Skoczylas, 1972 .79 2.39 .68 2.07 NA NA 18 18 No .34 2.94 .74 4.87 NA NA 86 86 Yes -.03 -.39 -.12 -1.50 NA NA 955 523 No .18 .83 .18 .83 NA NA 43 43 Yes .20 .72 .20 .72 1.65 6.05 43 31 Yes .52 1.34 .52 -1.34 1.09 2.89 16 12 Yes .001 .01 -.33 -1.53 NA NA 44 43 No -.01 .08 .12 .73 NA NA 88 160 No .01 .03 -.05 -.20 NA NA 174 173 No .05 .24 NA NA .01 .09 70 49 Yes .05 -.18 .13 .46 25 25 NO .20 .68 Table 1. Summary of Results f rom Studies Included in Meta-Analysis ES = Average effect size measured in standard deviations (Hedge's g) N = Largest number of subjects in any analysis in the study. For Huzar, 1973 and Rossell, 1990 the number of subjects in the treatment and control groups had to be estimated by halving the total reported sample. Combining the results from the 11 studies produces an average gain for bilingual students relative to English-only students on all tests measured in English of .18 standard deviations with a combined z-score of 2.41. (See Table 2.) Looking only at English test scores that measure reading shows, we observe an average benefit of having at least some native language instruction of .21 standard deviations with a combined z-score of 2.46. Both of these results meet conventional standards of statistical significance, suggesting that we can be confident that the effects of exposure to at least some native language instruction has positive effects on these English test results. Table 2. Results from the Meta-Analysis of the Effects of Bilingual Education Benefit of Bilingual Programs in Standard Deviations (Hedge's g) z - score p - value < All tests in English Reading (in English) Math (in English) All tests in Spanish .18 .21 .12 .74 2.41 .05 2.46 .05 1.65 .10 3.53 .01 The gain on Math tests measured in English is .12 with a combined z-score of 1.65, which falls short of conventional standards of statistical significance. This means that the effects of at least some native language instruction may be positive, but we cannot be confident in that conclusion. The benefit of native language instruction on Spanish test scores is considerably larger, .74 standard deviations, and we can be highly confident of this positive result given the combined z-score of 3.53. To put the size of these effects in perspective, education researchers generally consider gains of .1 standard deviation as slight, .2 or .3 of a standard deviation as moderate, and .5 of a standard deviation as large (Hanushek, 1996; Hedges & Greenwald, 1996). For readers more familiar with normal curve equivalent (NCE) points on standardized tests, an effect of .18 standard deviations is equivalent to 3.8 NCEs. An advantage of .21 standard deviations would be equivalent to 4.4 NCEs. A gain of .12 standard deviation would be equal to 2.5 NCEs and a gain of .74 would be equal to 15.6 NCEs. The effects measured in the studies included in this meta-analysis occurred after exposure to at least some native language instruction for no less than one academic year and no more than 5 academic years. The average length of exposure when tested was 2.2 academic years. The gains observed here occurred, on average, within that period of time. The average grade in which students were tested was 2.7. A Meta-Analysis of the 5 Studies With Random Assignment Experimental Design Five of the 11 studies that would be considered "methodologically acceptable" according to a reasonable interpretation of Rossell and Baker's standards were of a higher quality experimental design because they randomly assigned students to native language and English-only approaches. Random assignment is a significantly better research design for evaluating the effects of native language instruction because it greatly reduces the dangers of omitted variable bias (Campbell & Stanley, 1963). When students are not assigned at random to different programs there is always the possibility that different outcomes for students are caused by differences in their backgrounds, not the effectiveness of the programs. Given how strong the effects of differences in background are on educational outcomes, failing to control for all background differences may very well bias estimates of program effects. It is important to keep in mind that "methodologically acceptable studies" were defined here as controlling for any one background difference in addition to prior test scores, not controlling for a full set of background differences. The 11 methodologically acceptable studies, therefore, might include some whose effects are seriously distorted by omitted variable bias. If we focus on the 5 studies that avoid this bias by having the stronger research design of random assignment, we might get a more accurate estimate of the effects of bilingual education. Interestingly, the effect sizes are more strongly positive and the combined z-scores are higher when we examine only the 5 random assignment studies. For all test scores measured in English the combined effect increases to .26 standard deviations with a z-score of 2.71. (See Table 3.) The effect for reading scores measured in English, almost doubles when we focus on the random assignment studies. The average benefit of at least some native language instruction is .41 standard deviations with a combined z-score of 3.47. For math scores measured in English the effect increases slightly to .15 standard deviations, but the combined z-score drops to 1.25. And for all tests measured in Spanish the average effect size in the random assignment studies is .92 standard deviations with a combined z-score of 5.21. When we look at the higher quality research design studies we see more significantly positive benefits from native language instruction. Table 3. Results from the Meta-Analysis of the Effects of Bilingual Education for Studies with Random Assignment to Bilingual and Control Programs. All tests in English Benifit of Bilingual Programs in Standerd Deviations( Hedge's g) z - score p - value Reading Math (in All Tests in (in English) Spanish English) .26 .41 .15 .92 2.71 .01 3.47 .01 1.25 .21 5.21 .01 Comparing the Results of This Meta-Analysis to Other Reviews of the Literature The results of a meta-analysis of the Rossell and Baker literature review clearly differ from the conclusions they draw. This difference, however, is not produced by the complications of a meta-analysis or even by the elimination of studies that fail to meet their criteria for being methodologically acceptable. Of the 38 studies that evaluate bilingual versus English-only programs in Rossell and Baker's list, 21 have an average positive estimated effect and 17 have an average negative estimated effect. Simply counting positive and negative findings, a technique known as "vote counting," is less precise than a meta-analysis because it does not consider the magnitude or confidence level of effects. In addition, once we include unacceptable studies from Rossell and Baker's list, we would also have to consider the methodologically unacceptable studies advanced by supporters of bilingual education. Nevertheless, even when studies with inadequate background controls and short measurement periods only from Rossell and Baker's list are included, we still find that the scholarly literature favors the use of native language in instruction. Rossell and Baker report a different number of positive and negative studies for a few of reasons. First, they include in their report studies that are redundant with other studies, not available, not evaluations of bilingual programs, and do not have English-only control groups. Second, they do not apply any consistent rule for classifying studies as positive or negative. For example, Ramirez et al. (1991) is classified as showing "no difference" despite having significant, positive effects for bilingual instruction in reading. Similarly, Education Operation Concepts (1991a, 1991b) is classified as showing that bilingual education has a negative effect on reading scores despite having no statistically significant effects (and the average effect is actually positive, not negative). One of the advantages of meta-analysis is that it forces one to be consistent in summarizing research. It is clear that Rossell and Baker's review of studies is useful as a pool for a meta-analysis, but the lack of rigor and consistency in how they classify studies and summarize results prevents their conclusions from being reliable. The differences between Rossell and Baker's conclusions and the findings of this meta-analysis are largely a product of their lack of rigor and consistency and not the machinations of a complicated statistical technique. In the mid-eighties, Ann Willig (1985) conducted a meta-analysis of a literature review by Baker and de Kanter (1981). Like Rossell and Baker, Baker and de Kanter concluded that there were more negative studies than positive studies about the effects of bilingual education on English test scores. Like this meta-analysis, Willig had difficulty locating studies from the Baker and de Kanter review, found that the interpretation of whether studies had positive or negative results was sometimes unreliable, and found that a number of studies were lacking an adequate methodological design. Rather than exclude methodologically inadequate studies, as was done here, Willig adjusted the weighting of studies in her meta-analysis based on the quality of their research design. That technique has the benefits of including results from more studies but is vulnerable to concerns about the validity of the weightings. In any event, Willig found that a systematic analysis of the literature suggested positive effects for bilingual education similar to those found in this meta-analysis. The National Research Council's (1997) review of the bilingual research came to conclusions similar to this meta-analysis and the one conducted by Willig, but that literature review did not attempt to be as systematic as these meta-analyses. Reasons for Caution Caution needs to be exercised when interpreting the results of this meta-analysis. First, the list from which studies were selected to be included in this meta-analysis is not necessarily a representative sample of all studies on this question nor is it necessarily representative of bilingual programs in this country. The list of studies examined here was adopted from a review of the literature conducted by vocal critics of bilingual education, raising the possibility that the sample understates the benefits of native language instruction. On the other hand, it is also possible that there is a positive bias in the types of programs that are selected for study by researchers, raising the possibility that the average bilingual program in this country is less beneficial than our results suggest. Caution is also warranted given the age of the studies included in the meta-analysis. Eight of the eleven studies analyzed in this meta-analysis were conducted before 1983. Current bilingual programs, on average, may no longer resemble programs that were evaluated in the late 1960s and 1970s. It is possible that refinements in bilingual education techniques have improved upon the approaches of some of the older programs included in this meta-analysis, meaning that benefits of current programs may be larger. But it is also possible that the institutionalization of bilingual education has over time drained some of the enthusiasm and vigor that may have been found in earlier programs, making current programs less effective than estimated here. And caution is warranted given the limited amount of data upon which conclusions can be drawn. Several of the studies in this meta-analysis have limited sample sizes. While these small sample sizes are already discounted by the statistical tests they employ and further discounted by adjustments in calculating Hedge's g, it is nevertheless true that confidence in these results would be greater if there were more subjects studied. In addition to the fact that there are limited sample sizes within several of the studies included in this meta-analysis, we should also be cautious given the limited number of studies that are examined here. Additionally, more recent studies with larger samples would certainly help increase confidence in any conclusions that might be made about the effects of native language instruction. Conclusions While there are reasons to be cautious about the findings of this meta-analysis, there are also some conclusions that can reasonably be made. First, it is quite clear that the findings of the literature review conducted by Rossell and Baker are simply not reliable. They include studies in their review that do not meet their own standards for inclusion. They cannot find some of the studies they include. Some of the studies they include cannot be found, even by them. Some of the studies they include in their literature review of bilingual education are not about bilingual education. Many of the studies they include compare different native language approaches to each other, making it very difficult if not impossible to make inferences about the effects of English-only approaches. Many of the studies they include fail to control for even the most obvious differences between students assigned to bilingual and English-only programs. In addition, Rossell and Baker sometimes claim that studies have negative or neutral results for bilingual education when the actual results of those studies show otherwise. If this meta-analysis proves anything, it is that the Rossell and Baker literature review should not be the basis for policy decisions about bilingual education. Given the prominence of the Rossell and Baker literature review in electoral and legal discussions of this issue, documenting the unreliability of that review is by itself an important contribution of this meta-analysis. Second, it is reasonable to conclude from this meta-analysis that the use of at least some native language in instruction for LEP students is more likely to help the average student's achievement, as measured by standardized tests in English, than the use of only English in the instruction of those LEP students. Because this meta-analysis only compares the use of at least some native language to English-only approaches, we cannot draw conclusions about whether certain native language approaches are better than others. That is, this meta-analysis does not tell us whether it is better to have a large portion of the day devoted to native language instruction or a small portion of the day. Nor can this meta-analysis tell us whether students should be exposed to instruction in their native language for many years or few years. While there are many things that we cannot conclude from this meta-analysis, we do know that native language can be part of beneficial approaches to teaching LEP students. Therefore, efforts to eliminate the use of the native language in instruction, such as Proposition 227 in California, harm children by denying them access to beneficial approaches. Third, it is reasonable to conclude from this meta-analysis that there is a limited amount of high-quality research on this issue that can be used to persuade skeptics. The methodological rigor necessary to persuade skeptics is generally higher than what is needed to persuade those already inclined to believe in the benefits of bilingual education. If supporters of bilingual education want to have stronger evidence to fend off future efforts like Proposition 227, then they would be helped if new, rigorously-designed studies could be initiated to address this issue. A series of closely studied, random-assignment experiments should be commissioned to compare different approaches to teaching LEP students (including English-only approaches) so that we can know with greater certainty the effects of those different approaches. Those most afraid of high-quality research are those that depend on ignorance to advance their agendas. Annotated Bibliography Methodologically Acceptable Studies Included In The Meta-Analysis Bacon, H. L., Kidd, & Gerald D., et al. (1982, February). The effectiveness of bilingual instruction with Cherokee Indian students. Journal of American Indian Education, 34-43. Covey, D. D. (1973). An analytical study of secondary freshmen bilingual education and its effects on academic achievement and attitudes of Mexican American students. Doctoral dissertation, Arizona State University. Random assignment. Danoff, M. N., Arias, B.M., Coles, G. J., and Others. (1977a). Evaluation of the impact of ESEA Title VII Spanish/English bilingual education program. Palo Alto: American Institutes for Research. Huzar, H. (1973). The effects of an English-Spanish primary grade reading program on second and third grade students. Master's thesis, Rutgers University. Random assignment. Kaufman, M. (1968). Will instruction in reading Spanish affect ability in reading English? Journal of Reading, 11, 521-527. Random assignment. Plante, A. J. (1976). A study of effectiveness of the Connecticut "Paring" model of bilingual/bicultural education. Hamden: Connecticut Staff Development Cooperative. Random assignment. Powers, S. (1978). The influence of bilingual instruction on academic achievement and self-esteem of selected Mexican American junior high school students. Doctoral dissertation, University of Arizona. Ramirez, J. D., Pasta, D. J, Yuen, S., Billings, D. K., & Ramey, D. R. (1991). Final report: Longitudinal study of structural immersion strategy, early-exit, and late-exit transitional bilingual education programs for language-minority children. Report to the U.S. Department of Education. San Mateo: Aguirre International. Rossell, C. H. (1990). The effectiveness of educational alternatives for limited-English-proficient children. In G. Imhoff (Ed.), Learning in two languages. New Brunswick: Transaction Publishers. Rothfarb, S. H., Ariza, M. J. & Urrutia, R. (1987). Evaluation of the Bilingual Curriculum Content (BCC) project: A three-year study, final report. Dade County: Office of Educational Accountability. Skoczylas, R. V. (1972). An evaluation of some cognitive and affective aspects of a Spanish bilingual education program. Doctoral dissertation, University of New Mexico. Studies Excluded Because They Are Redundant Ariza, M. (1988). Evaluating limited English proficient students' achievement: Does curriculum content in the home language make a difference? Paper presented at the April meetings of the AmericanEducational Research Association, New Orleans. Redundant with Rothfarb et al., 1987. Barik, H., and Swain, M. (1978). Evaluation of a bilingual education program in Canada: The Elgin Study through grade six. Switzerland: Commission Interuniversitaire Suisse de Linguistique Appliquee. Redundant with Barik et al. 1977. Cohen, A. D., Fathman, A. K., & Merino, B. (1976). The Redwood Citybilingual education report, 1971-1974: Spanish and English proficiency, mathematics, and language-use over time. Toronto: Ontario Institute for Studies in Education. Redundant with Cohen 1975. Curiel, H., Stenning, W., & Cooper-Stenning, P. (1980). Achieved ready level, self-esteem, and grades as related to length of exposure to Bilingual education. Hispanic Journal of Behavioral Sciences, 2, 389-400. Redundant with Curiel, 1979. Danoff, M. N., Coles, G. J., McLaughlin, D. H., & Reynolds, D. J. (1977b). Evaluation of the impact of ESEA Title VII Spanish/English bilingual education programs, Vol. I: Study design and interim findings. Palo Alto: American Institutes for Research. Redundant with Danoff et al. 1977a. (1978a). Evaluation of the impact of ESEA Title VII Spanish/English bilingual education programs, Vol. III: Year two impact designs. Palo Alto: American Institutes for Research. (1978b). Evaluation of the impact of ESEA Title VII Spanish/English bilingual education programs, Vol. IV: Overview of the study and findings. Palo Alto: American Institutes for Research. Educational Operations Concepts, Inc. (1991b). An evaluation of the Title VII ESEA bilingual education program for Hmong and Cambodian students in kindergarten and first grade. St. Paul. Redundant with Educational Operations Concepts, Inc 1991a. El Paso Independent School District. (1990). Bilingual education evaluation: The sixth year in a longitudinal study. El Paso:Office for Research and Evaluation. El Paso Independent School District. (1992). Bilingual educationevaluation. El Paso: Office for Research and Evaluation. Redundant with El Paso 1987. Genesee, F., Lambert, W. E., & Tucker, G. E. (1977). An experiment in trilingual education. Montreal: McGill University. Redundant with Genesee et al 1983. McConnell, B. B. (1980b). Individualized bilingual instruction, final evaluation, 1978-1979 program. Pullman. Redundant with McConnell 1980a. (1980c). Individualized bilingual instruction formigrants. Paper presented at the October meeting of the International Congress for Individualized Instruction, Windsor. McSpadden, J. R. (1980). Arcadia bilingual bicultural education program: Interim evaluation report, 1979-80. Lafayettee Parish. Redundant with McSpadden 1979. Teschner, R. V. (1990). Adequate motivation and bilingual education. Southwest Journal of Instruction, 9, 1-42. Redundant with El Paso, 1990. Studies Excluded Because They Are Unavailable American Institutes for Research. (1975b). Bilingual education program (Aprendamos En Dos Idiomas). Corpus Christi. Palo Alto: Identification and Description of Exemplary Bilingual Education Programs. Lambert, W. E., & Tucker, G. R. (1972). Bilingual education of children: The St. Lambert experience. Rowley, MA: Newbury House. McSpadden, J. R. (1979). Arcadia bilingual bicultural education program: Interim evaluation report, 1978-79. Lafayettee Parish. Morgan, J. C. (1971). The effects of bilingual instruction of the English language arts achievement of first grade children. Doctoral dissertation, Northwestern State University of Louisiana. Ramos, M., Aguilar, J. V., & Sibayan, B. F. (1967). The determination and implementation of language policy (Monograph Series 2). Quezon City: Philippine Center for Language Study. Studies Excluded Because They Are Not Evaluations Of Bilingual Programs Becker, W. C. & Gersten, R. (1982). A follow-up of follow through: The latter effects of the Direct Instruction Model on children in fifth and sixth grades. American Educational Research Journal, 19, 75-92. Campeau, P. L., Roberts, A., Oscar H., Bowers, J. E., Austin, M., & Roberts, S. J. (1975). The identification and description of exemplary bilingual education programs. Palo Alto: American Institutes for Research. Webb, J. A., Clerc, R. J., & Gavito, A. (1987). Houston Independent School District: Comparison of bilingual and immersion programs using structural modeling. Houston Independent School District. Studies Excluded Because There Is Not An Appropriate Control Group Barik, H., Swain, M. & Nwanunobi, E. A. (1977). English-French bilingual education: The Elgin Study through grade five. Canadian Modern Language Review, 33, 459-475. Bruck, M., Lambert, W. E., & Tucker, G. R. (1977). Cognitive consequences of bilingual schooling: The St. Lambert project through grade six. Linguistics, 24, 13-33. Burkheimer, G. J., Conger, A.J., Dunteman, G.H., Elliott, B.G., & Mowbray, K.A. (1989). Effectiveness of services for language-minority limitedEnglish-proficient students. Report to the U.S. Department of Education. Day, E. M., & Shapson, S. M. (1988). Provincial assessment of early and late French immersion programs in British Columbia, Canada. Paper presented at the April meetings of the American Educational Research Associates, New Orleans. No background controls or individual level data reported. El Paso Independent School District. (1987). Interim report of the five-year bilingual education pilot 1986-1987 school year. El Paso: Office for Research and Evaluation. No background or pretest controls. Genesee, F., & Lambert, W. E. (1983). Trilingual education for majority-language children. Child Development, 54, 105-114. No background controls. Genesee, F., Holobow, N. E., Lambert, W. E, & Chartrand, L. (1989). Three elementary school alternatives for learning through a second language. The Modern Language Journal, 73, 250-263. No background controls. Gersten, R. (1985). Structured immersion for language-minority students: Results of a longitudinal evaluation. Educational Evaluation and Policy Analysis, 7, 187-196. No background controls. Malherbe, E. C. (1946). The bilingual school. London: Longmans Green. No background or pretest controls. McConnell, B. B. (1980a). Effectiveness of individualized bilingual instruction for migrant students. Doctoral dissertation, Washington State University. Medina, M., & Escamilla, K. (1992). Evaluation of transitional and maintenance bilingual programs. Urban Education, 27, 263-290. Melendez, W. A. (1980). The effect of the language of instruction on the reading achievement of limited English speakers in secondary schools. Doctoral dissertation, Loyola University of Chicago. No background controls. Stern, C. (1975). Final report to the Compton Unified School District's Title VII Bilingual/Bicultural Project: September 1969 through June 1975. Compton: Compton City Schools. Vasquez, M. (1990). A longitudinal study of cohort academic success and bilingual education. Doctoral dissertation, University of Rochester. No background controls. Studies Excluded Because The Effects Are Measured after An Unreasonably Short Period Barclay, L. (1969). The comparative efficacies of Spanish, English, and Bilingual Cognitive Verbal Instruction with Mexican American Head Start children. Doctoral dissertation, Stanford University. Positive Average Effect. Layden, R. G. (1972). The relationship between the language of instruction and the development of self-concept, classroom climate,and achievement of Spanish speaking Puerto Rican children. Doctoral dissertation, University of Maryland. Negative Average Effect. Studies Excluded Because They Inadequately Control Differences Between Bilingual And English-Only Students Alvarez, J. (1975). Comparison of academic aspirations and achievement in bilingual versus monolingual classrooms. Doctoral dissertation, University of Texas at Austin. Negative Average Effect. Ames, J., & Bicks, P. (1978). An evaluation of Title VII Bilingual/Bicultural Program, 1977-1978 school year, final report. Community School District 22. Brooklyn. School District of New York. Positive Average Effect. Balasubramonian, K., Seelye, H., & Elizondo de Weffer, R. (1973). Do bilingual education programs inhibit English languageachievement: A report on an Illinois experiment. Paper presented at the 7th Annual Convention of Teachers of Englishto Speakers of Other Languages, San Juan. Positive Average Effect. Barik, H, & Swain, M. (1975). Three year evaluation of a large-scale early grade French immersion program: The Ottawa-Study. Language Learning, 25, 1-30. Negative Average Effect. Bates, E. M. B. (1970). The effects of one experimental bilingual programon verbal ability and vocabulary of first grade pupils. Doctoraldissertation, Texas Tech University. Negative Average Effect. Carsrud, K, & Curtis, J. (1980). ESEA Title VII Bilingual Program: Final report. Austin: Austin Independent School District. No statistical tests reported. Positive Average Effect. Ciriza, F. (1990a). Evaluation report of the Preschool Project for Spanishspeaking children, 1989-1990. San Diego: Planning, Research and Evaluation Division. San Diego City Schools. Positive Average Effect. Cohen, A. D. (1975). A sociolingustic approach to bilingual education. Rowley, MA: Newbury House Press. Negative Average Effect. Cottrell, M. C. (1971). Bilingual education in San Juan Co., Utah: A crosscultural emphasis. Paper presented at the April meetings of the American Educational Research Association, New York City. Negative Average Effect. Curiel, H. (1979). A comparative study investigating achieved reading level, self-esteem, and achieved grade point average given varying participation. Doctoral dissertation, Texas A&M. Negative Average Effect. de la Garza, J. V., & Marcella, M. (1985). Academic achievement asinfluenced by bilingual instruction for Spanish-dominant Mexican American children. Hispanic Journal of Behavioral Sciences, 7, 247-259. Positive Average Effect. de Weffer, R. C. E. (1972). Effects of first language instruction in academic and psychological development of bilingual children. Doctoral dissertation, Illinois Institute of Technology. Positive Average Effect. Educational Operations Concepts, Inc. (1991a). St. Paul: An evaluation of the Title VII ESEA Bilingual Education Program for Hmong and Cambodian students in junior and senior high school. Positive Average Effect. Lampman, H. P. (1973). Southeastern New Mexico bilingual program: Final report. Artesia: Artesia Public Schools. Positive Average Effect. Legarreta, D. (1979). The effects of program models on language acquisition by Spanish-speaking children. TESOL Quarterly, 13, 521-534. Positive Average Effect. Lum, J. B. (1971). An effectiveness study of English as a second language (ESL) and Chinese bilingual methods. Doctoral dissertation, University of California, Berkeley. Negative Average Effect. Maldonado, J. R. (1974). The effect of the ESEA Title VII Program on the cognitive development of Mexican American students. Doctoral dissertation, University of Houston. Negative Average Effect. Matthews, T. (1979). An investigation of the effects of background characteristics and special language services on the reading achievement and English fluency of bilingual students. Seattle: Seattle Public Schools: Department of Planning, Research and Evaluation. Negative Average Effect. Moore, F. B. & Parr, G. D. (1978). Models of bilingual education: Comparisons of effectiveness. The Elementary School Journal, 79, 93-97. Negative Average Effect. Pena-Hughes, E., & Solis, J. (1980). ABC's. McAllen: McAllen Independent School, District. Positive Average Effect. Prewitt Diaz, J. O. (1979). An analysis of the effects of a bicultural curriculum on monolingual Spanish ninth graders as compared with monolingual English and bilingual ninth graders with regard tolanguage development, attitude toward school, and self-concept. Doctoral dissertation, University of Connecticut. Positive Average Effect. Stebbins, L. B., St. Pierre, R. G., Proper, E. C., Anderson, R. B., & Carva, T. (1977). Education as experimentation: A Planned Variation Model, Vol. IV-A. An evaluation of follow through. Cambridge: ABT Associates. Positive Average Effect. Valladolid, L. A. (1991). The effects of bilingual education of students' academic achievement as they progress through a bilingual program. Doctoral dissertation, United States International University. No background or pretest controls. Negative Average Effect. Yap, K. O., Enoki, D. Y., & Ishitani, P. (1988). SLEP student achievement: Some pertinent variables and policy implications. Paper presented at the April meetings of the American Educational Research Association, New Orleans. No background or pretest controls. Negative Average Effect. Zirkel, P. A. (1972). An evaluation of the effectiveness of selected experimental bilingual education programs in Connecticut. Doctoral dissertation, University of Connecticut. Positive Average Effect. Other Sources Baker, K. A. & de Kanter, A. A. (1981). Effectiveness of bilingual education: A review of the literature. Washington, D.C.: U.S. Department of Education, Office of Planning, Budget and Evaluation. Cambell, D. T. & Stanley, J. C. (1963). Experimental and quasi-experimental designs for research. Chicago: Rand McNally. Campbell, D. T. & Erlebacher, A. E. (1970). How regression artifacts in quasiexperimental evaluations can mistakenly make compensatory education look harmful. In J. Hellmuth (Ed.) Compensatory education: A national debate Vol. 3. Disadvantaged child. New York: Brunner/Mazel. Greene, J. (1998). A meta-analysis of the effectiveness of bilingual education. Tomas Rivera Policy Institute, Public Policy Clinic of the Department, University of Texas at Austin, and the Program on Education Policy and Government, University of Texas at Austin, and the Program on Education Policy and Governance at Harvard University. Available at http:// ourworld.compuserve.com/homepages/jwcrawford/greene.htm Hanushek, E. A. (1996). School resources and student performance. In G. Burtless (Ed.), Does money matter (pp. 43-73). Washington, DC: Brookings. Hedges, L. V. & Greenwald, R. (1996). Have times changed? In G. Gurtless (Ed.), Does money matter (pp. 74-92). Washington, DC: Brookings. National Research Council. (1997). Improving schooling for language- minority children: A research agenda. Washington, DC: National Academy Press. Rosenthal, R. (1991). Meta-analytic procedures for social research. Newbury Park: Sage Publications. Rossell C. H. & Baker K. (1996). The educational effectiveness of bilingual education. Research in the Teaching of English, 30. Willig, A. (1985). A meta-analysis of selected studies on the effectiveness of bilingual education. Review of Educational Research, 55. Author's Note This research was made possible with the support of the Tomás Rivera Policy Institute, the Harvard Program on Education Policy and Governance, and the Public Policy Clinic of the Department of Government at the University of Texas at Austin. An earlier version of this article appeared as Greene (1998). The author would like to thank Rudy de la Garza, Elsa Del Valle-Gaster, Luis Guevara, Kenji Hakuta, Christine Rossell, and Jim Yates for their helpful comments and assistance with this project. Notes 1 The published article actually lists 72, but a mimeo of the citations provided by Christine Rossell lists 75. 2 Some of these 14 studies would have been excluded from this reanalysis for other reasons as well. 3 Christine Rossell graciously agreed to swap studies that she had that were otherwise difficult to find for studies she was missing. Yet even she did not have copies of the five studies that could not be found. 4 Math results were not reported to enhance the readability of the table, given that the combined effects for math are not statistically significant.
© Copyright 2026 Paperzz