A Meta-Analysis Of The Rossell And Baker Review

A META-ANALYSIS OF THE ROSSELL AND BAKER REVIEW OF
BILINGUAL EDUCATION RESEARCH
Jay P. Greene
University of Texas at Austin
Abstract
In 1996, Christine Rossell and Keith Baker conducted a review of
the literature on the effectiveness of bilingual education and
concluded that the majority of 75 methodologically acceptable
studies showed that bilingual education was not beneficial. This
study re-examines their literature review to verify the Rossell and
Baker list of methodologically acceptable studies. After
identifying only 11 studies that actually meet the standards for
being methodologically acceptable, this study aggregates the
results of those studies by a technique known as meta-analysis.
The conclusion of the meta-analysis is that the use of at least
some native language in the instruction of limited English
proficient children has moderate beneficial effects on those
children relative to their being taught only in English.
During the debate over Proposition 227 in California that sought to eliminate
the use of native language in the instruction of children with limited English
proficiency (LEP), competing claims were made about what the research in
the area concluded. Christine Rossell, for example, argued that the review of
the literature she conducted with Keith Baker suggested that children learn
English best when they are taught in English (Rossell & Baker, 1996). Kenji
Hakuta, on the other hand, argued that the review of the literature he
conducted as part of the National Research Council report on bilingual
education, suggested that native language approaches are indeed beneficial for
children learning English (National Research Council, 1997). Bewildered by
these conflicting claims, the media and electorate in California generally paid
little attention to researchers and Proposition 227 was passed into law.
The fact that summaries of the literature on bilingual education can be used to
support diametrically opposite conclusions suggests that interpretations of
research findings can be ambiguous or inconsistent. One technique to reduce
the ambiguity and inconsistency in reviews of research literature is
meta-analysis, a systematic and statistical aggregation of research findings.
While meta-analysis does not eliminate subjective factors in interpretation, it
does tend to make the assumptions of interpretation more explicit and the
conclusions more rigorous. The meta-analysis reported in this article consists
of a re-examination of the studies reviewed by Rossell and Baker (1996). The
conclusion of this meta-analysis is that the use of at least some native language
in the instruction of LEP children tends to produce moderate improvements in
standardized test scores taken in English.
Reviewing the Rossell and Baker Review
In their 1996 review of the literature, Rossell and Baker identified 75 studies
that they determined were "methodologically acceptable."1Studies that were
determined to be methodologically acceptable had to: (a) compare students in
a bilingual program to a control group of similar students; (b) statistically
control for differences between the treatment and control groups or assignment
to treatment and control groups had be to done at random; (c) base results on
standardized test scores in English; and (d) determine differences between the
scores of treatment and control groups by applying appropriate statistical tests.
These requirements for selecting methodologically acceptable studies seem
like a reasonable start, but some of the items need clarification. For example,
what constitutes a bilingual program and a comparable control group? What
constitutes sufficient statistical control for differences between treatment and
control groups?
The reanalysis of the Rossell and Baker literature review presented here adds
one additional requirement and more clearly defines some of the other
requirements to determine whether the 75 studies identified by Rossell and
Baker are, in fact, methodologically acceptable. The additional requirement is
that studies had to measure the effects of bilingual programs after at least one
academic year of participation in a bilingual program. Test results after a few
weeks of participation in a program should not be used to assess the effects of
that program. This additional requirement only causes two studies to be
excluded from the meta-analysis, one that measured outcomes after seven
weeks in a bilingual program (Barclay, 1969) and another that measured
outcomes after ten weeks in a bilingual program (Layden, 1972).
The requirements for a study to be considered methodologically acceptable
were also clarified for this reanalysis in a few ways. First, bilingual programs
were defined broadly as those programs in which LEP students were taught at
least some of the time in their native language. Rossell and Baker subdivided
bilingual programs into various program categories, such as transitional
bilingual education (TBE), English as a second language (ESL), and
maintenance bilingual education (MBE). The difficulty with these
subdivisions is that program labels are notoriously unreliable descriptions of
the content of the approach employed. What is called TBE is some places
might be called ESL in others. The descriptions of the programs in the studies
were often inadequate for drawing finer distinctions. Focusing on whether
native language was part of the approach, however, is much easier to detect in
each study and therefore is more likely to be a more reliable label for the
programs. Besides, the policy-relevant question raised by Proposition 227 and
most policy discussions is whether native language techniques in general are
beneficial, not whether TBE is better than ESL. That is, Proposition 227 did
not call for the abolition of a particular approachinstruction. Whether the
literature supports that policy is an important question to address.
Second, for the purposes of this reanalysis of the Rossell and Baker literature
review, an acceptable control group was defined as one where students were
taught almost entirely in English. This way all comparisons would be between
programs in which students were taught using at least some of their native
language to programs in which students were taught almost entirely in
English. The Rossell and Baker literature review contained a number of
studies in which both treatment and control groups received varying amounts
and types of native language instruction. Those studies were excluded from
this reanalysis because they do not help address whether native language
instruction is generally superior to English-only instruction in advancing the
academic achievement of LEP students. One cannot infer from a comparison
of different amounts or types of a treatment as to whether that treatment is
better than no treatment. By analogy, very large doses of acetaminophen can
be lethal while low doses can help alleviate pain. The observation that large
doses are worse than low doses does not mean that no treatment is better than
low dosage.
Requiring studies to compare programs with at least some native language
instruction to programs taught almost entirely in English causes 14 of the 75
studies in Rossell and Baker's list of methodologically acceptable studies to be
excluded.2 These studies are Barik, Swain, & Nwanunobi (1977), Bruck,
Lambert, & Tucker (1977), Burkheimer, Conger, Dunteman, Elliot &
Mowbray (1989), Day & Shapson (1988), El Paso Independent School District
(1987), Genesee, & Lambert (1983), Genesee, Holobow, Lambert, &
Chartrand (1989), Gersten (1985), Malherbe (1946), McConnell (1980a),
Medina & Escamilla (1992), Melendez (1980), Stern (1975), and Vasquez
(1990).
A third clarification of the standards is defining what constitutes a sufficient
statistical control for differences in the backgrounds of the treatment and
control groups. To avoid omitted variable bias it is important that all
pre-treatment differences between groups be controlled statistically or by
random assignment. Because background characteristics can have such a
strong effect on educational outcomes, failing to control fully for background
differences can easily lead to an erroneous conclusion. Unfortunately, a large
number of studies listed as methodologically acceptable by Rossell and Baker
failed to control for any background characteristic other than prior test scores.
As Campbell and Erlebacher (1970) demonstrated many years ago, controlling
only for prior test scores is often inadequate because background differences
usually influence the rate of test score growth, not just the level of test scores.
Controlling only for prior test scores without any other controls for
background differences does not adjust for what the test score trajectory would
have been in the absence of treatment. Thus, studies controlling only for prior
test scores are likely to be plagued by omitted variable bias.
To be included in this meta-analysis studies had to assign students at random
to treatment and control groups or control statistically for prior test scores and
at least one other background characteristic, such as family income, parent's
education, and so on. Requiring only one control in addition to prior test score
is a very lax standard. Even studies that control for prior test scores and one
other background characteristic may still suffer from omitted variable bias.
Nevertheless, setting the standard in this way provides a clear decisive rule for
including studies without having to make more subjective judgments about
whether the set of background controls fully adjusts for pre-treatment
differences. If the study made the effort to control for prior test scores and at
least some other background characteristics, it was considered sufficient.
Defining sufficient control for background differences in this way caused 25
studies to be excluded from this re-analysis of Rossell and Baker's literature
review: Alvarez (1975); Ames & Bicks (1978); Balasubramonian, Seelye,
Elizondo, de Weffer (1973); Barik & Swain (1975); Bates (1970); Carsrud &
Curtis (1980); Ciriza (1990a); Cohen (1975); Cotrell (1971); Curiel (1979); de
Weffer (1972); de la Garza & Marcella (1985); Educational Operations
Concepts (1991a); Lampman (1973); Legerreta (1979); Lum (1971);
Maldonado (1974); Matthews (1979); Moore & Parr (1978); Pena-Hughes &
Solis (1979); Prewitt Diaz (1979); Stebbins, St. Pierre, Proper, Anderson, &
Carva (1977); Valladolid (1991); Yap, Enoki, & Ishitani (1988); and Zirkel
(1972). Two of these studies, Valladolid (1991) and Yap et al (1988), did not
control statistically for any pre-treatment differences, not even prior test
scores. The other 23 studies controlled only for prior test scores.
Three other considerations caused studies listed as methodologically
acceptable by Rossell and Baker to be excluded from this meta-analysis. First,
there were several studies that could not be found.3 If a study was not
available to be reviewed it could not be included in this reanalysis, causing the
following five studies to be excluded: American Institutes for Research
(1975b); Lambert & Tucker (1972); McSpadden (1979); Morgan (1971); and
Ramos, Aguilar, & Sibayan (1967).
Second, Rossell and Baker included in their list of 75 methodologically
acceptable studies several citations for studies by the same authors of the same
programs. That is, Rossell and Baker would count the release of each year's
results by the same authors of the same program as if it were an independent
and new study. This appears problematic since the combined the results of
multiple years of a program in one report would be counted less than those
results reported after each year. For the purposes of this meta-analysis, the
results of the same program by the same authors were combined into one
observation regardless of whether the authors released their results in one
report or many. A total of fifteen studies were eliminated as independent
observations as a result: Ariza (1988); Barik & Swain (1978); Cohen,
Fathman, & Marino (1976); Curiel, Stenning, & Cooper-Stenning (1980);
Danoff, Coles, McLaughlin, & Reynolds (1977b, 1978a, 1978b); Educational
Operations Concepts (1991b); El Paso Independent School District (1990,
1992); Genesee, Lambert, & Tucker (1977); McConnell (1980b, 1980c);
McSpadden (1980); and Teschner (1990).
Third, three additional studies from Rossell and Baker's list of 75 were
excluded from this meta-analysis because they were not evaluations of
bilingual programs. One is about "direct instruction" (Becker, 1982) and
makes not mention of second language learning. Another is a list of exemplary
bilingual programs (Campeau, 1975), not an evaluation of the programs. Yet
another, is primarily about the effects of retention (being held back a grade)
(Webb, 1987)."
Beginning with a list of 75 "methodologically acceptable" studies compiled by
Rossell and Baker and applying clarified standards that Rossell and Baker had
used for selecting those studies leaves us with 11 studies that actually meet
those standards.
Of course, there is no reason that a meta-analysis should only consider
methodologically acceptable studies. It would also be appropriate to include
all relevant studies or a sample of all relevant studies in a meta-analysis. This
meta-analysis looks only at studies that meet certain criteria in an effort to
determine the reliability of the literature review conducted by Rossell and
Baker. That literature review established certain criteria for inclusion and this
meta-analysis attempts to follow the guidelines established by Rossell and
Baker.
A meta-analysis of all studies of the effectiveness of bilingual education would
not only have to include more than the 72 studies that Rossell and Baker
describe as methodologically acceptable, but would also have to examine the
more than the 300 studies that Rossell and Baker say they reviewed to identify
their list of 72 acceptable studies. Additionally, there are probably several
hundred more studies that could be included if one wished to review a
comprehensive list of studies. This meta-analysis does not do so because it is
primarily an effort to determine the reliability of the Rossell and Baker review
and so takes their list of acceptable studies and their criteria as the points of
departure. Furthermore, a review of several hundred studies was beyond the
scope of this project. A comprehensive meta-analysis could and should be
done, but it was beyond the scope of the present study.
Conducting a Meta-Analysis on the 11 "Methodologically Acceptable"
Studies
The procedures employed for aggregating the results of these 11 studies
followed conventional meta-analytical strategies (Rosenthal, 1991). An effect
size and a z-score was calculated for each study for all results measured in
English, reading results measured in English, math results measured in
English, and, where available, all tests taken in Spanish. To calculate the effect
size for each study for all tests taken in English, for example, the average of all
English to bilingual education, it called for an end to native language test
results reported in a study were computed. That average was adjusted for the
sample size and standardized in units of standard deviation to produce a single
effect size for each study, known as Hedge's g. An effect size of 1 would mean
that the average result in that study indicated that students taught at least some
of the time in their native language outperformed a control group taught only
in English by 1 standard deviation on some measure of academic achievement.
The effect size was combined across the 11 studies by simply taking an
average of the 11 effect sizes.
A single z-score indicating the confidence we have in the effect size was
calculated for each subject area for each study. That z-score for each study
was calculated by converting whatever statistical measure of confidence in
results was reported into a z-score and then taking the average of those
z-scores within each study. In this way all of the results within each study
were condensed into a single effect size and z-score for each study in each
subject category. Table 1 contains the effect sizes and z-scores for each study
for all tests taken in English, reading results in English, and all Spanish test
results.4 This table tells us, for example, that the Bacon (1982) study had an
average effect size for all results on English tests of .79 standard deviations.
That is, students in a bilingual program outperformed their English-only
counterparts by .79 standard deviations on average for all tests taken in
English. The z-score for this effect size is 2.39, suggesting that this positive
result was unlikely to have occurred by chance.
The z-scores of the 11 studies were combined by computing the sum of each
study's average z-score and then dividing by the square root of the number of
z-scores that were combined, in this case 11 (Rosenthal 1991, p. 85). This
formula calculates a combined z-score by measuring how different the
observed distribution of z-scores is from a normal distribution with a mean of
0 and a standard deviation of 1. That is, if there were no effect of bilingual
education on test scores, we would expect that the distribution of z-scores
from the analyzed studies should be normally distributed with an average of 0.
By chance, some will be positive and some negative. But a combined z-score
greater than 1.96 suggests that the observed distribution of z-scores was
unlikely to have occurred by chance. From this statistic we can determine
whether there is a significantly positive or negative pattern to the results from
a number of studies.
Study
English Reading Spanish Treatment Control
ES
Z
ES
Z
ES
Z
N
N
Random
Assignment
Yes/No
Bacon, et
al,1982
Covey,
1973
Danoff, et
al. 1977a
Huzar,
1973
Kaufman,
1968
Plante,
1976
Powers,
1978
Ramirez,
et al. 1991
Rossell,
1990
Rothfarb,
et al. 1987
Skoczylas,
1972
.79 2.39 .68 2.07 NA NA
18
18
No
.34 2.94 .74 4.87 NA NA
86
86
Yes
-.03 -.39 -.12 -1.50 NA NA
955
523
No
.18 .83 .18 .83 NA NA
43
43
Yes
.20 .72 .20 .72 1.65 6.05
43
31
Yes
.52 1.34 .52 -1.34 1.09 2.89
16
12
Yes
.001 .01 -.33 -1.53 NA NA
44
43
No
-.01 .08 .12 .73 NA NA
88
160
No
.01 .03 -.05 -.20 NA NA
174
173
No
.05 .24 NA NA .01 .09
70
49
Yes
.05 -.18 .13 .46
25
25
NO
.20 .68
Table 1. Summary of Results f rom Studies Included in Meta-Analysis
ES = Average effect size measured in standard deviations (Hedge's g)
N = Largest number of subjects in any analysis in the study. For Huzar, 1973
and Rossell, 1990 the number of subjects in the treatment and control groups
had to be estimated by halving the total reported sample.
Combining the results from the 11 studies produces an average gain for
bilingual students relative to English-only students on all tests measured in
English of .18 standard deviations with a combined z-score of 2.41. (See Table
2.) Looking only at English test scores that measure reading shows, we
observe an average benefit of having at least some native language instruction
of .21 standard deviations with a combined z-score of 2.46. Both of these
results meet conventional standards of statistical significance, suggesting that
we can be confident that the effects of exposure to at least some native
language instruction has positive effects on these English test results.
Table 2. Results from the Meta-Analysis of the Effects of Bilingual Education
Benefit of
Bilingual Programs
in Standard
Deviations
(Hedge's g)
z - score
p - value <
All tests
in
English
Reading
(in
English)
Math (in
English)
All tests
in
Spanish
.18
.21
.12
.74
2.41
.05
2.46
.05
1.65
.10
3.53
.01
The gain on Math tests measured in English is .12 with a combined z-score of
1.65, which falls short of conventional standards of statistical significance.
This means that the effects of at least some native language instruction may be
positive, but we cannot be confident in that conclusion. The benefit of native
language instruction on Spanish test scores is considerably larger, .74 standard
deviations, and we can be highly confident of this positive result given the
combined z-score of 3.53.
To put the size of these effects in perspective, education researchers generally
consider gains of .1 standard deviation as slight, .2 or .3 of a standard
deviation as moderate, and .5 of a standard deviation as large (Hanushek,
1996; Hedges & Greenwald, 1996). For readers more familiar with normal
curve equivalent (NCE) points on standardized tests, an effect of .18 standard
deviations is equivalent to 3.8 NCEs. An advantage of .21 standard deviations
would be equivalent to 4.4 NCEs. A gain of .12 standard deviation would be
equal to 2.5 NCEs and a gain of .74 would be equal to 15.6 NCEs.
The effects measured in the studies included in this meta-analysis occurred
after exposure to at least some native language instruction for no less than one
academic year and no more than 5 academic years. The average length of
exposure when tested was 2.2 academic years. The gains observed here
occurred, on average, within that period of time. The average grade in which
students were tested was 2.7.
A Meta-Analysis of the 5 Studies With Random Assignment Experimental
Design
Five of the 11 studies that would be considered "methodologically acceptable"
according to a reasonable interpretation of Rossell and Baker's standards were
of a higher quality experimental design because they randomly assigned
students to native language and English-only approaches. Random assignment
is a significantly better research design for evaluating the effects of native
language instruction because it greatly reduces the dangers of omitted variable
bias (Campbell & Stanley, 1963). When students are not assigned at random to
different programs there is always the possibility that different outcomes for
students are caused by differences in their backgrounds, not the effectiveness
of the programs. Given how strong the effects of differences in background are
on educational outcomes, failing to control for all background differences may
very well bias estimates of program effects. It is important to keep in mind that
"methodologically acceptable studies" were defined here as controlling for any
one background difference in addition to prior test scores, not controlling for a
full set of background differences. The 11 methodologically acceptable
studies, therefore, might include some whose effects are seriously distorted by
omitted variable bias.
If we focus on the 5 studies that avoid this bias by having the stronger research
design of random assignment, we might get a more accurate estimate of the
effects of bilingual education. Interestingly, the effect sizes are more strongly
positive and the combined z-scores are higher when we examine only the 5
random assignment studies. For all test scores measured in English the
combined effect increases to .26 standard deviations with a z-score of 2.71.
(See Table 3.) The effect for reading scores measured in English, almost
doubles when we focus on the random assignment studies. The average benefit
of at least some native language instruction is .41 standard deviations with a
combined z-score of 3.47. For math scores measured in English the effect
increases slightly to .15 standard deviations, but the combined z-score drops to
1.25. And for all tests measured in Spanish the average effect size in the
random assignment studies is .92 standard deviations with a combined z-score
of 5.21. When we look at the higher quality research design studies we see
more significantly positive benefits from native language instruction.
Table 3. Results from the Meta-Analysis of the Effects of Bilingual Education
for Studies with Random Assignment to Bilingual and Control Programs.
All tests in
English
Benifit of Bilingual
Programs in
Standerd Deviations(
Hedge's g)
z - score
p - value
Reading
Math (in All Tests in
(in
English)
Spanish
English)
.26
.41
.15
.92
2.71
.01
3.47
.01
1.25
.21
5.21
.01
Comparing the Results of This Meta-Analysis to Other Reviews of the
Literature
The results of a meta-analysis of the Rossell and Baker literature review
clearly differ from the conclusions they draw. This difference, however, is not
produced by the complications of a meta-analysis or even by the elimination
of studies that fail to meet their criteria for being methodologically acceptable.
Of the 38 studies that evaluate bilingual versus English-only programs in
Rossell and Baker's list, 21 have an average positive estimated effect and 17
have an average negative estimated effect. Simply counting positive and
negative findings, a technique known as "vote counting," is less precise than a
meta-analysis because it does not consider the magnitude or confidence level
of effects. In addition, once we include unacceptable studies from Rossell and
Baker's list, we would also have to consider the methodologically
unacceptable studies advanced by supporters of bilingual education.
Nevertheless, even when studies with inadequate background controls and
short measurement periods only from Rossell and Baker's list are included, we
still find that the scholarly literature favors the use of native language in
instruction.
Rossell and Baker report a different number of positive and negative studies
for a few of reasons. First, they include in their report studies that are
redundant with other studies, not available, not evaluations of bilingual
programs, and do not have English-only control groups. Second, they do not
apply any consistent rule for classifying studies as positive or negative. For
example, Ramirez et al. (1991) is classified as showing "no difference" despite
having significant, positive effects for bilingual instruction in reading.
Similarly, Education Operation Concepts (1991a, 1991b) is classified as
showing that bilingual education has a negative effect on reading scores
despite having no statistically significant effects (and the average effect is
actually positive, not negative). One of the advantages of meta-analysis is that
it forces one to be consistent in summarizing research. It is clear that Rossell
and Baker's review of studies is useful as a pool for a meta-analysis, but the
lack of rigor and consistency in how they classify studies and summarize
results prevents their conclusions from being reliable. The differences between
Rossell and Baker's conclusions and the findings of this meta-analysis are
largely a product of their lack of rigor and consistency and not the
machinations of a complicated statistical technique.
In the mid-eighties, Ann Willig (1985) conducted a meta-analysis of a
literature review by Baker and de Kanter (1981). Like Rossell and Baker,
Baker and de Kanter concluded that there were more negative studies than
positive studies about the effects of bilingual education on English test scores.
Like this meta-analysis, Willig had difficulty locating studies from the Baker
and de Kanter review, found that the interpretation of whether studies had
positive or negative results was sometimes unreliable, and found that a number
of studies were lacking an adequate methodological design. Rather than
exclude methodologically inadequate studies, as was done here, Willig
adjusted the weighting of studies in her meta-analysis based on the quality of
their research design. That technique has the benefits of including results from
more studies but is vulnerable to concerns about the validity of the weightings.
In any event, Willig found that a systematic analysis of the literature suggested
positive effects for bilingual education similar to those found in this
meta-analysis. The National Research Council's (1997) review of the bilingual
research came to conclusions similar to this meta-analysis and the one
conducted by Willig, but that literature review did not attempt to be as
systematic as these meta-analyses.
Reasons for Caution
Caution needs to be exercised when interpreting the results of this
meta-analysis. First, the list from which studies were selected to be included in
this meta-analysis is not necessarily a representative sample of all studies on
this question nor is it necessarily representative of bilingual programs in this
country. The list of studies examined here was adopted from a review of the
literature conducted by vocal critics of bilingual education, raising the
possibility that the sample understates the benefits of native language
instruction. On the other hand, it is also possible that there is a positive bias in
the types of programs that are selected for study by researchers, raising the
possibility that the average bilingual program in this country is less beneficial
than our results suggest.
Caution is also warranted given the age of the studies included in the
meta-analysis. Eight of the eleven studies analyzed in this meta-analysis were
conducted before 1983. Current bilingual programs, on average, may no
longer resemble programs that were evaluated in the late 1960s and 1970s. It
is possible that refinements in bilingual education techniques have improved
upon the approaches of some of the older programs included in this
meta-analysis, meaning that benefits of current programs may be larger. But it
is also possible that the institutionalization of bilingual education has over
time drained some of the enthusiasm and vigor that may have been found in
earlier programs, making current programs less effective than estimated here.
And caution is warranted given the limited amount of data upon which
conclusions can be drawn. Several of the studies in this meta-analysis have
limited sample sizes. While these small sample sizes are already discounted by
the statistical tests they employ and further discounted by adjustments in
calculating Hedge's g, it is nevertheless true that confidence in these results
would be greater if there were more subjects studied. In addition to the fact
that there are limited sample sizes within several of the studies included in this
meta-analysis, we should also be cautious given the limited number of studies
that are examined here. Additionally, more recent studies with larger samples
would certainly help increase confidence in any conclusions that might be
made about the effects of native language instruction.
Conclusions
While there are reasons to be cautious about the findings of this meta-analysis,
there are also some conclusions that can reasonably be made. First, it is quite
clear that the findings of the literature review conducted by Rossell and Baker
are simply not reliable. They include studies in their review that do not meet
their own standards for inclusion. They cannot find some of the studies they
include. Some of the studies they include cannot be found, even by them.
Some of the studies they include in their literature review of bilingual
education are not about bilingual education. Many of the studies they include
compare different native language approaches to each other, making it very
difficult if not impossible to make inferences about the effects of English-only
approaches. Many of the studies they include fail to control for even the most
obvious differences between students assigned to bilingual and English-only
programs. In addition, Rossell and Baker sometimes claim that studies have
negative or neutral results for bilingual education when the actual results of
those studies show otherwise. If this meta-analysis proves anything, it is that
the Rossell and Baker literature review should not be the basis for policy
decisions about bilingual education. Given the prominence of the Rossell and
Baker literature review in electoral and legal discussions of this issue,
documenting the unreliability of that review is by itself an important
contribution of this meta-analysis.
Second, it is reasonable to conclude from this meta-analysis that the use of at
least some native language in instruction for LEP students is more likely to
help the average student's achievement, as measured by standardized tests in
English, than the use of only English in the instruction of those LEP students.
Because this meta-analysis only compares the use of at least some native
language to English-only approaches, we cannot draw conclusions about
whether certain native language approaches are better than others. That is, this
meta-analysis does not tell us whether it is better to have a large portion of the
day devoted to native language instruction or a small portion of the day. Nor
can this meta-analysis tell us whether students should be exposed to
instruction in their native language for many years or few years. While there
are many things that we cannot conclude from this meta-analysis, we do know
that native language can be part of beneficial approaches to teaching LEP
students. Therefore, efforts to eliminate the use of the native language in
instruction, such as Proposition 227 in California, harm children by denying
them access to beneficial approaches.
Third, it is reasonable to conclude from this meta-analysis that there is a
limited amount of high-quality research on this issue that can be used to
persuade skeptics. The methodological rigor necessary to persuade skeptics is
generally higher than what is needed to persuade those already inclined to
believe in the benefits of bilingual education. If supporters of bilingual
education want to have stronger evidence to fend off future efforts like
Proposition 227, then they would be helped if new, rigorously-designed
studies could be initiated to address this issue. A series of closely studied,
random-assignment experiments should be commissioned to compare different
approaches to teaching LEP students (including English-only approaches) so
that we can know with greater certainty the effects of those different
approaches. Those most afraid of high-quality research are those that depend
on ignorance to advance their agendas.
Annotated Bibliography
Methodologically Acceptable Studies Included In The Meta-Analysis
Bacon, H. L., Kidd, & Gerald D., et al. (1982, February). The effectiveness of
bilingual instruction with Cherokee Indian students. Journal of American
Indian Education, 34-43.
Covey, D. D. (1973). An analytical study of secondary freshmen bilingual
education and its effects on academic achievement and attitudes of Mexican
American students. Doctoral dissertation, Arizona State University. Random
assignment.
Danoff, M. N., Arias, B.M., Coles, G. J., and Others. (1977a). Evaluation of
the impact of ESEA Title VII Spanish/English bilingual education program.
Palo Alto: American Institutes for Research.
Huzar, H. (1973). The effects of an English-Spanish primary grade reading
program on second and third grade students. Master's thesis, Rutgers
University. Random assignment.
Kaufman, M. (1968). Will instruction in reading Spanish affect ability in
reading English? Journal of Reading, 11, 521-527. Random assignment.
Plante, A. J. (1976). A study of effectiveness of the Connecticut "Paring"
model of bilingual/bicultural education. Hamden: Connecticut Staff
Development Cooperative. Random assignment.
Powers, S. (1978). The influence of bilingual instruction on academic
achievement and self-esteem of selected Mexican American junior high school
students. Doctoral dissertation, University of Arizona.
Ramirez, J. D., Pasta, D. J, Yuen, S., Billings, D. K., & Ramey, D. R. (1991).
Final report: Longitudinal study of structural immersion strategy, early-exit,
and late-exit transitional bilingual education programs for language-minority
children. Report to the U.S. Department of Education. San Mateo: Aguirre
International.
Rossell, C. H. (1990). The effectiveness of educational alternatives for
limited-English-proficient children. In G. Imhoff (Ed.), Learning in two
languages. New Brunswick: Transaction Publishers.
Rothfarb, S. H., Ariza, M. J. & Urrutia, R. (1987). Evaluation of the Bilingual
Curriculum Content (BCC) project: A three-year study, final report. Dade
County: Office of Educational Accountability.
Skoczylas, R. V. (1972). An evaluation of some cognitive and affective
aspects of a Spanish bilingual education program. Doctoral dissertation,
University of New Mexico.
Studies Excluded Because They Are Redundant
Ariza, M. (1988). Evaluating limited English proficient students' achievement:
Does curriculum content in the home language make a difference? Paper
presented at the April meetings of the AmericanEducational Research
Association, New Orleans. Redundant with Rothfarb et al., 1987.
Barik, H., and Swain, M. (1978). Evaluation of a bilingual education program
in Canada: The Elgin Study through grade six. Switzerland: Commission
Interuniversitaire Suisse de Linguistique Appliquee. Redundant with Barik et
al. 1977.
Cohen, A. D., Fathman, A. K., & Merino, B. (1976). The Redwood
Citybilingual education report, 1971-1974: Spanish and English proficiency,
mathematics, and language-use over time. Toronto: Ontario Institute for
Studies in Education. Redundant with Cohen 1975.
Curiel, H., Stenning, W., & Cooper-Stenning, P. (1980). Achieved ready level,
self-esteem, and grades as related to length of exposure to Bilingual education.
Hispanic Journal of Behavioral Sciences, 2, 389-400. Redundant with Curiel,
1979.
Danoff, M. N., Coles, G. J., McLaughlin, D. H., & Reynolds, D. J. (1977b).
Evaluation of the impact of ESEA Title VII Spanish/English bilingual
education programs, Vol. I: Study design and interim findings. Palo Alto:
American Institutes for Research. Redundant with Danoff et al. 1977a.
(1978a). Evaluation of the impact of ESEA Title VII Spanish/English bilingual
education programs, Vol. III: Year two impact designs. Palo Alto: American
Institutes for Research.
(1978b). Evaluation of the impact of ESEA Title VII Spanish/English bilingual
education programs, Vol. IV: Overview of the study and findings. Palo Alto:
American Institutes for Research.
Educational Operations Concepts, Inc. (1991b). An evaluation of the Title VII
ESEA bilingual education program for Hmong and Cambodian students in
kindergarten and first grade. St. Paul. Redundant with Educational Operations
Concepts, Inc 1991a.
El Paso Independent School District. (1990). Bilingual education evaluation:
The sixth year in a longitudinal study. El Paso:Office for Research and
Evaluation.
El Paso Independent School District. (1992). Bilingual educationevaluation.
El Paso: Office for Research and Evaluation. Redundant with El Paso 1987.
Genesee, F., Lambert, W. E., & Tucker, G. E. (1977). An experiment in
trilingual education. Montreal: McGill University. Redundant with Genesee et
al 1983.
McConnell, B. B. (1980b). Individualized bilingual instruction, final
evaluation, 1978-1979 program. Pullman. Redundant with McConnell 1980a.
(1980c). Individualized bilingual instruction formigrants. Paper presented at
the October meeting of the International Congress for Individualized
Instruction, Windsor.
McSpadden, J. R. (1980). Arcadia bilingual bicultural education program:
Interim evaluation report, 1979-80. Lafayettee Parish. Redundant with
McSpadden 1979.
Teschner, R. V. (1990). Adequate motivation and bilingual education.
Southwest Journal of Instruction, 9, 1-42. Redundant with El Paso, 1990.
Studies Excluded Because They Are Unavailable
American Institutes for Research. (1975b). Bilingual education program
(Aprendamos En Dos Idiomas). Corpus Christi. Palo Alto: Identification and
Description of Exemplary Bilingual Education Programs.
Lambert, W. E., & Tucker, G. R. (1972). Bilingual education of children: The
St. Lambert experience. Rowley, MA: Newbury House.
McSpadden, J. R. (1979). Arcadia bilingual bicultural education program:
Interim evaluation report, 1978-79. Lafayettee Parish.
Morgan, J. C. (1971). The effects of bilingual instruction of the English
language arts achievement of first grade children. Doctoral dissertation,
Northwestern State University of Louisiana.
Ramos, M., Aguilar, J. V., & Sibayan, B. F. (1967). The determination and
implementation of language policy (Monograph Series 2). Quezon City:
Philippine Center for Language Study.
Studies Excluded Because They Are Not Evaluations Of Bilingual
Programs
Becker, W. C. & Gersten, R. (1982). A follow-up of follow through: The latter
effects of the Direct Instruction Model on children in fifth and sixth grades.
American Educational Research Journal, 19, 75-92. Campeau, P. L., Roberts,
A., Oscar H., Bowers, J. E., Austin, M., & Roberts, S. J. (1975). The
identification and description of exemplary bilingual education programs.
Palo Alto: American Institutes for Research.
Webb, J. A., Clerc, R. J., & Gavito, A. (1987). Houston Independent School
District: Comparison of bilingual and immersion programs using structural
modeling. Houston Independent School District.
Studies Excluded Because There Is Not An Appropriate Control Group
Barik, H., Swain, M. & Nwanunobi, E. A. (1977). English-French bilingual
education: The Elgin Study through grade five. Canadian Modern Language
Review, 33, 459-475.
Bruck, M., Lambert, W. E., & Tucker, G. R. (1977). Cognitive consequences
of bilingual schooling: The St. Lambert project through grade six. Linguistics,
24, 13-33.
Burkheimer, G. J., Conger, A.J., Dunteman, G.H., Elliott, B.G., & Mowbray,
K.A. (1989). Effectiveness of services for language-minority limitedEnglish-proficient students. Report to the U.S. Department of Education.
Day, E. M., & Shapson, S. M. (1988). Provincial assessment of early and late
French immersion programs in British Columbia, Canada. Paper presented at
the April meetings of the American Educational Research Associates, New
Orleans. No background controls or individual level data reported.
El Paso Independent School District. (1987). Interim report of the five-year
bilingual education pilot 1986-1987 school year. El Paso: Office for Research
and Evaluation. No background or pretest controls.
Genesee, F., & Lambert, W. E. (1983). Trilingual education for
majority-language children. Child Development, 54, 105-114. No background
controls.
Genesee, F., Holobow, N. E., Lambert, W. E, & Chartrand, L. (1989). Three
elementary school alternatives for learning through a second language. The
Modern Language Journal, 73, 250-263. No background controls.
Gersten, R. (1985). Structured immersion for language-minority students:
Results of a longitudinal evaluation. Educational Evaluation and Policy
Analysis, 7, 187-196. No background controls.
Malherbe, E. C. (1946). The bilingual school. London: Longmans Green. No
background or pretest controls.
McConnell, B. B. (1980a). Effectiveness of individualized bilingual instruction
for migrant students. Doctoral dissertation, Washington State University.
Medina, M., & Escamilla, K. (1992). Evaluation of transitional and
maintenance bilingual programs. Urban Education, 27, 263-290.
Melendez, W. A. (1980). The effect of the language of instruction on the
reading achievement of limited English speakers in secondary schools.
Doctoral dissertation, Loyola University of Chicago. No background controls.
Stern, C. (1975). Final report to the Compton Unified School District's Title
VII Bilingual/Bicultural Project: September 1969 through June 1975.
Compton: Compton City Schools.
Vasquez, M. (1990). A longitudinal study of cohort academic success and
bilingual education. Doctoral dissertation, University of Rochester. No
background controls.
Studies Excluded Because The Effects Are Measured after An
Unreasonably Short Period
Barclay, L. (1969). The comparative efficacies of Spanish, English, and
Bilingual Cognitive Verbal Instruction with Mexican American Head Start
children. Doctoral dissertation, Stanford University. Positive Average Effect.
Layden, R. G. (1972). The relationship between the language of instruction
and the development of self-concept, classroom climate,and achievement of
Spanish speaking Puerto Rican children. Doctoral dissertation, University of
Maryland. Negative Average Effect.
Studies Excluded Because They Inadequately Control Differences
Between Bilingual And English-Only Students
Alvarez, J. (1975). Comparison of academic aspirations and achievement in
bilingual versus monolingual classrooms. Doctoral dissertation, University of
Texas at Austin. Negative Average Effect.
Ames, J., & Bicks, P. (1978). An evaluation of Title VII Bilingual/Bicultural
Program, 1977-1978 school year, final report. Community School District 22.
Brooklyn. School District of New York. Positive Average Effect.
Balasubramonian, K., Seelye, H., & Elizondo de Weffer, R. (1973). Do
bilingual education programs inhibit English languageachievement: A report
on an Illinois experiment. Paper presented at the 7th Annual Convention of
Teachers of Englishto Speakers of Other Languages, San Juan. Positive
Average Effect.
Barik, H, & Swain, M. (1975). Three year evaluation of a large-scale early
grade French immersion program: The Ottawa-Study. Language Learning, 25,
1-30. Negative Average Effect.
Bates, E. M. B. (1970). The effects of one experimental bilingual programon
verbal ability and vocabulary of first grade pupils. Doctoraldissertation, Texas
Tech University. Negative Average Effect.
Carsrud, K, & Curtis, J. (1980). ESEA Title VII Bilingual Program: Final
report. Austin: Austin Independent School District. No statistical tests
reported. Positive Average Effect.
Ciriza, F. (1990a). Evaluation report of the Preschool Project for Spanishspeaking children, 1989-1990. San Diego: Planning, Research and Evaluation
Division. San Diego City Schools. Positive Average Effect. Cohen, A. D.
(1975). A sociolingustic approach to bilingual education. Rowley, MA:
Newbury House Press. Negative Average Effect.
Cottrell, M. C. (1971). Bilingual education in San Juan Co., Utah: A crosscultural emphasis. Paper presented at the April meetings of the American
Educational Research Association, New York City. Negative Average Effect.
Curiel, H. (1979). A comparative study investigating achieved reading level,
self-esteem, and achieved grade point average given varying participation.
Doctoral dissertation, Texas A&M. Negative Average Effect.
de la Garza, J. V., & Marcella, M. (1985). Academic achievement
asinfluenced by bilingual instruction for Spanish-dominant Mexican American
children. Hispanic Journal of Behavioral Sciences, 7, 247-259. Positive
Average Effect.
de Weffer, R. C. E. (1972). Effects of first language instruction in academic
and psychological development of bilingual children. Doctoral dissertation,
Illinois Institute of Technology. Positive Average Effect.
Educational Operations Concepts, Inc. (1991a). St. Paul: An evaluation of the
Title VII ESEA Bilingual Education Program for Hmong and Cambodian
students in junior and senior high school. Positive Average Effect.
Lampman, H. P. (1973). Southeastern New Mexico bilingual program: Final
report. Artesia: Artesia Public Schools. Positive Average Effect.
Legarreta, D. (1979). The effects of program models on language acquisition
by Spanish-speaking children. TESOL Quarterly, 13, 521-534. Positive
Average Effect.
Lum, J. B. (1971). An effectiveness study of English as a second language
(ESL) and Chinese bilingual methods. Doctoral dissertation, University of
California, Berkeley. Negative Average Effect.
Maldonado, J. R. (1974). The effect of the ESEA Title VII Program on the
cognitive development of Mexican American students. Doctoral dissertation,
University of Houston. Negative Average Effect.
Matthews, T. (1979). An investigation of the effects of background
characteristics and special language services on the reading achievement and
English fluency of bilingual students. Seattle: Seattle Public Schools:
Department of Planning, Research and Evaluation. Negative Average Effect.
Moore, F. B. & Parr, G. D. (1978). Models of bilingual education:
Comparisons of effectiveness. The Elementary School Journal, 79, 93-97.
Negative Average Effect.
Pena-Hughes, E., & Solis, J. (1980). ABC's. McAllen: McAllen Independent
School, District. Positive Average Effect.
Prewitt Diaz, J. O. (1979). An analysis of the effects of a bicultural curriculum
on monolingual Spanish ninth graders as compared with monolingual English
and bilingual ninth graders with regard tolanguage development, attitude
toward school, and self-concept. Doctoral dissertation, University of
Connecticut. Positive Average Effect. Stebbins, L. B., St. Pierre, R. G., Proper,
E. C., Anderson, R. B., & Carva, T. (1977). Education as experimentation: A
Planned Variation Model, Vol. IV-A. An evaluation of follow through.
Cambridge: ABT Associates. Positive Average Effect.
Valladolid, L. A. (1991). The effects of bilingual education of students'
academic achievement as they progress through a bilingual program. Doctoral
dissertation, United States International University. No background or pretest
controls. Negative Average Effect.
Yap, K. O., Enoki, D. Y., & Ishitani, P. (1988). SLEP student achievement:
Some pertinent variables and policy implications. Paper presented at the April
meetings of the American Educational Research Association, New Orleans.
No background or pretest controls. Negative Average Effect.
Zirkel, P. A. (1972). An evaluation of the effectiveness of selected
experimental bilingual education programs in Connecticut. Doctoral
dissertation, University of Connecticut. Positive Average Effect.
Other Sources
Baker, K. A. & de Kanter, A. A. (1981). Effectiveness of bilingual education:
A review of the literature. Washington, D.C.: U.S. Department of Education,
Office of Planning, Budget and Evaluation.
Cambell, D. T. & Stanley, J. C. (1963). Experimental and quasi-experimental
designs for research. Chicago: Rand McNally.
Campbell, D. T. & Erlebacher, A. E. (1970). How regression artifacts in quasiexperimental evaluations can mistakenly make compensatory education look
harmful. In J. Hellmuth (Ed.) Compensatory education: A national debate Vol.
3. Disadvantaged child. New York: Brunner/Mazel.
Greene, J. (1998). A meta-analysis of the effectiveness of bilingual education.
Tomas Rivera Policy Institute, Public Policy Clinic of the Department,
University of Texas at Austin, and the Program on Education Policy and
Government, University of Texas at Austin, and the Program on Education
Policy and Governance at Harvard University. Available at http://
ourworld.compuserve.com/homepages/jwcrawford/greene.htm Hanushek, E.
A. (1996). School resources and student performance. In G. Burtless (Ed.),
Does money matter (pp. 43-73). Washington, DC: Brookings.
Hedges, L. V. & Greenwald, R. (1996). Have times changed? In G. Gurtless
(Ed.), Does money matter (pp. 74-92). Washington, DC: Brookings. National
Research Council. (1997). Improving schooling for language- minority
children: A research agenda. Washington, DC: National Academy Press.
Rosenthal, R. (1991). Meta-analytic procedures for social research. Newbury
Park: Sage Publications.
Rossell C. H. & Baker K. (1996). The educational effectiveness of bilingual
education. Research in the Teaching of English, 30.
Willig, A. (1985). A meta-analysis of selected studies on the effectiveness of
bilingual education. Review of Educational Research, 55.
Author's Note
This research was made possible with the support of the Tomás Rivera Policy
Institute, the Harvard Program on Education Policy and Governance, and the
Public Policy Clinic of the Department of Government at the University of
Texas at Austin. An earlier version of this article appeared as Greene (1998).
The author would like to thank Rudy de la Garza, Elsa Del Valle-Gaster, Luis
Guevara, Kenji Hakuta, Christine Rossell, and Jim Yates for their helpful
comments and assistance with this project.
Notes
1
The published article actually lists 72, but a mimeo of the citations provided
by Christine Rossell lists 75.
2
Some of these 14 studies would have been excluded from this reanalysis for
other reasons as well.
3
Christine Rossell graciously agreed to swap studies that she had that were
otherwise difficult to find for studies she was missing. Yet even she did not
have copies of the five studies that could not be found.
4
Math results were not reported to enhance the readability of the table, given
that the combined effects for math are not statistically significant.